The Most Important Question in AI Isn’t “What Happened?” It’s “Why?”

Financial Markets Compliance

June 8th, 2026

0626_Most-Important-Question-in-AI_628x325

At ENGAGE NYC 2026, attendees heard from leading AI and analytics experts on some of the biggest challenges shaping enterprise AI adoption today.

One of the most thought-provoking sessions came from Foster Provost, Professor of Data Science and Ira Rennert Professor of Entrepreneurship at NYU Stern School of Business, and a widely recognized expert in machine learning, AI and data-driven decision-making. His session, “Help! Why did my AI flag this?,” focused on one of the most urgent and increasingly complex issues organizations face as AI adoption accelerates: explainability.

AI can detect suspicious behavior in seconds. It can identify hidden patterns across millions of transactions, surface emerging risks and flag activity that a human analyst would struggle to find on their own.

But there’s a growing problem at the center of modern AI adoption: What happens when a system flags something, and no one can be 100% certain why?

One of the central themes from Provost’s presentation was that it is vital for organizations to be able to explain AI behavior and have faith in individual decisions and actions. Model scores, feature importance rankings and technical performance metrics may help explain how a model behaves mathematically, but they do not necessarily explain why a system acted on a specific case.

As Provost emphasized during the session, the AI system is not just the model, and the decision is a lot more than the prediction.

That distinction shapes almost every conversation about explainability.

A model score, essentially the AI system’s calculated risk rating, may tell you that something appears risky. A dashboard may show which variables may have contributed to the score. But neither necessarily explains the actual reason an alert occurred. And in high-stakes environments, that difference matters. A lot.

When AI Flags Something Suspicious

Consider a scenario where traders are exchanging Bloomberg messages filled with coded language and market slang.

At first glance, the conversation sounds vague, almost harmless. But to the AI system, something about it crossed the line, and generated an alert.

The discussion includes phrases suggesting unusual market activity, information sharing and potentially evasive language. One trader refers to “the elephant in the room” getting “seriously involved in heavy metal,” while another talks about “vacuuming up the lit book all morning.”

To a human analyst, these phrases might suggest hidden meaning, possible information sharing or even front-running activity. So it may appear “normal” or “expected” that the AI generated a high suspicion score. But the real challenge for the system designer and operator is whether the AI can identify the actual evidence behind the decision in a way sponsors can understand.

AI Doesn’t Think Like Humans

Another key theme from the session was that AI systems are fundamentally “evidence-combining systems.” They are not reasoning like human investigators reviewing a case. Instead, they evaluate evidence, combine signals, calculate probabilities and determine whether accumulated evidence crosses a threshold for action.

That sounds straightforward, until someone asks the system to explain itself.

Many modern AI systems rely on extremely complex mathematical models that combine evidence in ways humans cannot interpret easily, or perhaps at all. As organizations increasingly incorporate generative AI into enterprise systems, the explainability challenge becomes even more complicated.

One quote from the session captured that tension particularly well: “LLMs don’t actually explain their reasoning, even though they’ll tell you that they can.”

Large language models (LLMs), the technology behind generative AI systems like ChatGPT, are incredibly fluent. They sound authoritative and confident. They generate polished justifications instantly. But those justifications are not true explanations of why the AI system arrived at any specific decision.

And in regulated industries, “the AI said so” is becoming an increasingly unacceptable answer. That’s one reason explainability has moved far beyond a technical discussion. It is now an operational issue, a governance issue, and increasingly, a regulatory requirement. Frameworks such as the EU AI Act, GDPR, financial model risk management guidance and AML compliance expectations are all pushing organizations toward more transparent and defensible AI decision-making.

The Problem with Explaining AI Decisions

One of the biggest challenges highlighted during the presentation is that many common explainability techniques were not originally designed to explain decisions in the first place. Provost pointed specifically to SHAP (SHapley Additive exPlanations), a widely used method for estimating how much different data points contributed to an AI model’s prediction score. As he explained during the session: “SHAP computes each feature’s averaged contribution to the model prediction, one internal aspect of the AI system, not to the decision to flag.”

In many AI systems, the model first generates a score, such as a fraud or risk score. The system then applies additional rules or thresholds to decide whether the case should actually be flagged. That distinction matters because a feature may strongly influence the score while having little impact on the final decision. Meanwhile, another signal with far less apparent importance may be the factor that actually pushed the case across the threshold and triggered the alert.

In some cases, the highest-ranked SHAP feature can have virtually no effect on the outcome, while another feature with almost no SHAP importance turns out to be decisive. In practical terms, this means organizations can end up explaining the wrong thing entirely.

What Are Counterfactual Explanations?

If traditional explainability techniques struggle to explain why a system actually took action, what should organizations use instead? One of the central concepts explored during the session was counterfactual explanations.

Rather than asking which variables contributed to a score, counterfactual explanations ask a much more powerful question: “What is a collection of evidence such that if it were not present, the decision would not have been made?” Put more simply, counterfactual explanations try to identify sets of evidence that, by themselves, would have changed the outcome.

That shift changes explainability from abstract math into something operationally meaningful. Instead of producing generic importance rankings, counterfactual explanations isolate the actual evidence responsible for an alert. Remove that evidence, and the system no longer flags the case.

Provost referred to these as “evidence counterfactuals,” the irreducible pieces of evidence without which the decision would have changed.

To make the idea more relatable, the presentation also introduced a fictional consumer named Mariko receiving a Pottery Barn ad. Why did she receive it? Because the AI system combined evidence from her online behavior: furniture websites, real estate browsing, cooking content and even visits to AmericanIdol.com, which happened to be a surprisingly effective demographic proxy for Pottery Barn shoppers.

Remove those signals, and the ad disappears. Simple example. Big implications.

Why This Matters Beyond Data Science

So, what is the actual implication?

Once organizations, particularly those operating in highly regulated industries like financial services, begin thinking in terms of evidence rather than scores, explainability becomes much more actionable.

Analysts can investigate alerts more intelligently instead of guessing. False positives become easier to triage. Managers have more confidence when deciding to approve production systems. Organizations can better identify whether models learned legitimate patterns or dangerous correlations. And perhaps most importantly, users start trusting the AI system more.

One of the strongest themes running throughout the session was that people often resist AI systems when they cannot understand the reasoning behind them, even when the AI demonstrably outperforms human judgment.

This insight may ultimately define the next phase of enterprise AI adoption.

The future of AI will not depend solely on whether systems can generate accurate predictions. It will depend on whether humans can understand, trust, govern and defend the decisions those systems make.

In other words, when the AI system raises a flag, the most important response may no longer be: “What happened?” Rather, it will be: “Show me why.”

    Speak to an Expert