The glass box: why AI reasoning is the new audit trail

As models like ChatGPT 5.2 and Claude 4.6 Opus introduce deep reasoning, professionals must learn to audit the AI's logic. We explore the 'faithfulness gap' and how to verify AI thought traces.

1. Executive summary

In an audit file, a conclusion without a documented calculation is an unmanaged risk. Similarly, an AI answer without a visible reasoning process is a compliance liability.

We have fully transitioned from early conversational AI to advanced reasoning models. Tools like ChatGPT 5.2, Opus 4.6 Sonnet and Gemini Pro 3.1 now spend processing power 'thinking' before they respond. For professional services, this generated thought trace serves as a digital audit trail. However, recent AI safety research reveals a 'faithfulness gap'. Professionals must learn how to audit the AI's logic rather than blindly accepting its rationale.

The table below outlines the shift from output-focused review to process-focused assurance.

Feature	The black box (Legacy AI)	The glass box (Reasoning AI)
Processing style	Instant text prediction: Generates an immediate answer based on training patterns.	Inference-time scaling: Spends computing power to reason through steps before answering.
Auditability	Low: The user only sees the final text. The internal logic is completely hidden.	High: The user can read the Chain of Thought to see exactly how the conclusion was reached.
Primary risk	Hallucination: The model confidently invents facts to fill knowledge gaps.	Faithfulness gap: The model makes an intuitive guess and writes a logical thought trace to justify it.

2. Introduction: system 1 vs system 2 in professional workflows

To understand how modern AI operates, we can borrow a concept from the behavioural economist Daniel Kahneman. He famously divided human thought into two categories: System 1 and System 2.

System 1 is fast and intuitive. When you look at a scanned invoice and instantly locate the total amount, you are using System 1. Legacy AI models excel at this. They are brilliant at extracting a VAT number or summarising a meeting transcript.

System 2 is slow and deliberative. When you analyse a complex loan structure against current Dutch Tax Law or check Corporate Sustainability Reporting Directive (CSRD) data points for double counting, you are using System 2. You have to pause, hold multiple rules in your head and work step by step.

The most common failure in modern professional firms is assigning a System 2 compliance task to a System 1 workflow. Asking a fast chatbot to solve a nuanced tax problem is like asking a junior associate to guess a calculation without using a spreadsheet. It results in confident errors.

3. The glass box: inference-time scaling and CoT

The major technological leap in 2026 is that AI has finally unlocked System 2 thinking. Models like ChatGPT 5.2, Claude 4.6 Sonnet and Gemini Pro 3.1 utilise a technique called inference-time scaling.

Instead of answering instantly, these models use additional computing power during the query to build a Chain of Thought (CoT). The model generates internal steps, critiques its own logic and backtracks if it hits a dead end before producing the final output for the user.

For auditors and tax advisors, this is a revolutionary change. The EU AI Act now demands strict explainability for AI systems used in high-risk financial workflows. By exposing the Chain of Thought, the AI transitions from a 'black box' where only the input and output are visible, to a 'glass box' where the reasoning is fully transparent.

4. The hidden risk: the faithfulness gap

While a visible thought process is incredibly valuable, it introduces a new risk that technical professionals must understand. The thought summary presented to you on the screen is a translation of the model's neural activations. It is not a perfect mirror.

Recent AI safety research highlights a phenomenon called Implicit Post-Hoc Rationalisation. Sometimes, the AI exhibits an unconscious bias toward a specific answer. For example, if you ask a leading question, the AI wants to please you by agreeing. It will intuitively decide to say 'yes' and then use its Chain of Thought to write a highly logical justification for that 'yes' after the fact.

To manage this risk, auditors must adopt a specific mindset. You must treat the AI's Chain of Thought as a Management Representation Letter, not as independent Audit Evidence. A representation letter explains the intent and logic of the client, but the professional must still independently verify the underlying facts.

5. Practical guide: the 3-level AI review framework

To safely sign off on work generated by reasoning models, firms should implement a structured review process.

Level 1: the source check
You must verify the grounding. Look at the thought trace and ask: Did the AI cite a specific standard or a secure database? If the AI thought "I will apply the 2026 IFRS amendment", you must ensure it actually pulled that text via your firm's secure server rather than relying on outdated data from its initial training. Always click the source link.

Level 2: the logic check
You must check for valuation and bias. Read the thoughts to see if the AI considered alternative treatments. A robust reasoning trace should say: "I considered applying Rule X, but Rule Y overrides it because of this specific clause." If the AI only argues one side of a complex tax issue, it is exhibiting sycophancy. It is telling you what you want to hear.

Level 3: the artifact check
You must verify the accuracy of the execution. If the AI's thought process involves complex mathematics, it must execute a deterministic tool to get the answer. An AI model should write a Python script to calculate depreciation. You should never trust an AI model's mental arithmetic, even within a highly advanced reasoning model. Trust the script, but verify the thought.

6. Conclusion

As enterprise governance tightens and the first major wave of CSRD audits for the 2025 financial year concludes, the professional focus has permanently shifted. We no longer just ask "What did the AI write?" We must ask "How did the AI conclude this?"

The ability to audit an AI's reasoning trace is rapidly becoming a core competency for accountants and tax advisors. The final rule for 2026 is simple. If an AI agent cannot clearly expose the steps, sources and tools it used to reach a professional conclusion, its output cannot be signed off by a human practitioner.

Founded by a Dutch Chartered Accountant

See Studio or MCP servers for your firm.

Book a 30-minute demo. We'll show you how your trial balance becomes a compliant report or how MCP servers enable domain expertise.

Founded by a Dutch Chartered Accountant

See Studio or MCP servers for your firm.

Book a 30-minute demo. We'll show you how your trial balance becomes a compliant report or how MCP servers enable domain expertise.

Founded by a Dutch Chartered Accountant

See Studio or MCP servers for your firm.

Book a 30-minute demo. We'll show you how your trial balance becomes a compliant report or how MCP servers enable domain expertise.

The glass box: why AI reasoning is the new audit trail

As models like ChatGPT 5.2 and Claude 4.6 Opus introduce deep reasoning, professionals must learn to audit the AI's logic. We explore the 'faithfulness gap' and how to verify AI thought traces.

1. Executive summary

Feature

The black box (Legacy AI)

The glass box (Reasoning AI)

2. Introduction: system 1 vs system 2 in professional workflows

3. The glass box: inference-time scaling and CoT

4. The hidden risk: the faithfulness gap

5. Practical guide: the 3-level AI review framework

6. Conclusion

See Studio or MCP servers for your firm.

See Studio or MCP servers for your firm.

See Studio or MCP servers for your firm.