The Evidence Discipline Audit

IBM spent more than four billion dollars. The evidence existed for years. Nobody was required to close the gap.

That is the Watson Health story in one sentence. But the more important question — the one this issue answers — is whether the same structural gap exists inside your organization right now.

THE PATTERN

Episode 3 traced the evidence record across three documented moments: the MD Anderson partnership ($62.1M, four years, zero patients treated), the STAT News investigation (internal IBM slides documenting “unsafe and incorrect” recommendations), and the 2022 divestiture (Watson Health assets sold for a reported ~$1 billion on $4B+ invested).

The through-line was not technology failure. It was the absence of any institutional mechanism to require that promises be reconciled with evidence before capital continued to flow. IBM made public claims about Watson’s capabilities. Those claims were never tied to a testable, time-bound performance standard. And no governance body — at IBM, at MD Anderson, or at hospitals purchasing the product — ever formally required the gap to close.

The evidence existed. It simply was not required to matter.

This is not a uniquely IBM problem. It is a structural problem in how enterprise AI investments are governed — and it can exist anywhere capital is being allocated to AI without a mechanism to require evidence to matter.

THE FRAMEWORK: THE EVIDENCE DISCIPLINE AUDIT

Episode 3 introduced the Evidence Discipline Framework — a three-layer diagnostic (Promise Layer, Evidence Layer, Accountability Layer) that maps the gap between what is claimed about an AI investment and what is provable. This issue provides the operational instrument: five questions your board should be able to answer about every active AI investment.

These are not audit questions. They are governance design questions. If your organization cannot answer all five for a given investment, the governance architecture has a structural gap — regardless of whether the investment is performing.

Q1. What specific, time-bound, measurable performance claim was made when this investment was approved?

Most AI investment approvals include a business case with projected ROI, productivity gains, or cost reductions. The question is whether those projections are tied to a testable claim with a defined measurement date. “Watson will help physicians identify optimal cancer treatments” is not a testable claim. “Watson will achieve agreement with oncologist recommendations in 90 percent of cases at MSK within 18 months” is. The Promise Layer fails when the approved claim cannot be measured against a defined standard at a defined time.

Q2. What evidence exists today that validates or contradicts the claims being made about this system’s performance — and who has seen it?

By 2014, MSK pilot testing was revealing gaps in Watson’s performance. By 2016, the UT System audit was public. By mid-2017, IBM’s own deputy chief health officer was documenting “unsafe and incorrect” recommendations in internal slide decks. None of this information reached the people making capital decisions in a form that required them to act. The Evidence Layer fails when the evidence exists but has not been surfaced to the governance level with decision-making authority.

Q3. Who in this organization is formally responsible for closing the gap between the approved claim and the available evidence?

This is the accountability design question. It is not asking who ‘owns’ the AI system or who sits on the steering committee. It is asking who is formally required to measure the gap, report it upward, and trigger an escalation response if the gap exceeds a defined threshold. In the Watson Health case, no one held this role at any level of the governance chain. The Accountability Layer fails when the responsibility is implied rather than assigned.

Q4. At what evidence threshold would this investment be paused, restructured, or terminated — and is that threshold documented?

Capital stops are the most powerful governance tool available, and they are almost never defined in advance for AI investments. Most governance processes specify escalation paths for budget overruns, schedule delays, and vendor performance failures. They rarely specify what evidence gap — what divergence between promised and actual performance — triggers a formal review or capital stop. Without a defined threshold, governance becomes reactive: the capital stops when the failure becomes undeniable, not when the evidence first indicates it.

Q5. What would a UT System-style audit of this investment reveal if conducted today?

The UT System Audit Office conducted its review of the MD Anderson–IBM partnership in November 2016. It found: project costs of $62.1M with no clinical deployment, repeated scope changes and deadline failures, and procurement irregularities. The audit was not a judgment on Watson’s scientific foundation — it was a financial and operational controls review. If an independent external body reviewed your organization’s active AI investment with the same scope — financial controls, milestone achievement, governance documentation, vendor accountability structures — what would it find? If the answer is uncertain, the governance infrastructure is not yet audit-ready.

APPLIED EXAMPLE

Consider a VP of Operations at a $500M manufacturing company whose organization has deployed an AI-powered demand forecasting system. The business case approved 18 months ago projected a 15 percent reduction in excess inventory within 12 months.

Running the Evidence Discipline Audit:

Q1: The approved claim is specific and time-bound. The Promise Layer has a measurable standard. ✓

Q2: Inventory levels have not decreased by 15 percent. The system has been producing forecasts, but nobody has formally compared the 12-month outcomes against the approved projection. The evidence gap exists but has not been measured. ✗

Q3: The AI system is ‘owned’ by the data science team. The business case was owned by Operations. Nobody is formally assigned to reconcile the two. The accountability gap is real. ✗

Q4: The governance documentation does not specify what performance gap triggers a formal review. The investment continues because nobody has defined the threshold at which it shouldn’t. ✗

Q5: An audit today would find a $300K annual investment running past its promised ROI date with no formal performance review, no accountability assignment, and no documented escalation trigger.

The Governance Implication

This organization’s AI governance has a functional Promise Layer (the claim was measurable) but a structural failure across the Evidence and Accountability layers, with no documented threshold for escalation. The Watson Health failure operated at scale. The same failure pattern operates at every scale — it is just harder to see before the capital destruction becomes undeniable.

THREE QUESTIONS TO ASK MONDAY

1	For each active AI investment over $250K: who is formally responsible for measuring whether the approved performance claim has been validated? If the answer is “nobody” or “it’s shared,” you have an Accountability Layer gap.
2	For your organization’s largest current AI investment: what evidence threshold, if crossed, would trigger a formal governance review or capital stop? If that threshold is not documented, the governance process is operating without a defined escalation trigger.
3	If the UT System Audit Office reviewed your most significant AI investment tomorrow, what would its financial controls and milestone achievement findings say? If you are uncertain, the governance documentation is not yet audit-ready.

WHAT’S NEXT

Episode 3 closed with a direct challenge: does your organization have a mechanism to require evidence to matter? The Evidence Discipline Audit in this issue is the instrument for answering that question.

Episode 4 — publishing April 27 — moves one step earlier in the governance sequence. Before evidence discipline can be applied, an organization needs to know what type of risk it is governing. Episode 4 introduces the AI Risk Classification Matrix: a four-quadrant framework that maps any AI investment on two dimensions — Deployment Autonomy and Reversibility of Consequence — and prescribes the governance structure appropriate to each quadrant.

The question Episode 3 leaves open is: who should be in the room when the evidence gap is reviewed? The answer depends on what kind of risk the investment represents. That is Episode 4.

Watch Episode 3

The $4 Billion Gap: IBM Watson Health and the Evidence Discipline Failure is available now. Link in the description and on the channel. If you found this issue useful, forwarding it to a colleague who governs AI investments is the most valuable action you can take.

Strategic Risk Lab | The Governance Brief | Issue #3 | April 20, 2026

The Evidence Discipline Audit

Keep Reading

Quick Links

Subscription

Socials