Knowledge Base

Enterprise AI Is Failing the Same Way Enterprise It Always Did

contrarian insights

A research team at MIT spent two years studying AI deployments across major enterprises and concluded that 95% of generative AI pilots fail to produce measurable P&L impact. A Stanford study published this month examined 51 enterprise AI deployments across nine industries, tracking what separates programs that generate real business value from those that don't. Both bodies of research arrived at roughly the same place, which is that the reason most enterprise AI programs fail has almost nothing to do with AI. The vocabulary is new. The pattern is not.

It is worth asking whether peer-reviewed research from two of the finest academic institutions in the country is genuinely what moves these conclusions onto a board agenda — and if so, what that says about the board agenda. Four decades of enterprise technology history had already made the same argument at considerable cost to the companies involved. For organizations that find academic credentials insufficient, there is always the option of hiring McKinsey to confirm what is already obvious, which has historically been an effective if expensive approach to the same problem.

What both sets of researchers found, and what anyone who has watched enterprise software cycles play out over the last four decades should recognize, is that organizations keep solving the wrong problem. They treat AI adoption as a technology challenge when it is, and has always been, an organizational one.

The organization is always the variable

The Stanford playbook — produced by Erik Brynjolfsson and colleagues, covering deployments from healthcare to financial services to retail — identifies organizational context as the primary differentiator between AI programs that work and those that don't. Companies that redesigned workflows alongside AI deployment generated consistently better outcomes than companies that deployed AI on top of existing processes. McKinsey's State of AI research, covering nearly 2,000 companies, puts a number on this: high performers — the ones actually scaling AI across their organizations — are 2.8 times more likely to have redesigned workflows as part of the program. They are not simply better at deploying technology. They are better at organizational change.

McKinsey finds that 88% of companies are now using AI in at least one business function. Only a third are successfully scaling it beyond that initial footprint. You have a situation where almost universal adoption coexists with almost universal failure to generate enterprise-wide value. "This time it's different" are, famously, among the most expensive phrases in financial markets, appearing with reliable consistency just before things turn out to be the same. Enterprise AI has attracted some of the highest equity valuations in market history on precisely that argument. The evidence on enterprise outcomes, so far, is not obviously supportive of it.

The Stanford researchers found, in case after case, that AI success correlates with clear business ownership, metrics connected to real financial outcomes, and — importantly — a culture that treats AI as a tool operated by domain experts rather than a capability managed by technologists. The companies generating measurable returns are the ones where the people closest to the business problem are the ones driving the solution. This is not a surprising finding. It is, however, one that a significant portion of the enterprise market is currently structured to ignore.

The structural reason most pilots produce nothing

The "redesign the workflow" finding is almost universally agreed and almost universally vague. Saying that successful firms redesigned workflows is true. It is also unspecific in a way that makes it operationally useless to a leader trying to decide where to spend the next AI dollar. There is a sharper formulation available, and it comes from the recent academic literature on the economics of AI rather than the consultancy literature on the implementation of it.

Joseph Ide and Anton Korinek presented a paper at the 2025 Economics of Transformative AI Workshop that proposes a path-based model of the firm. The firm is treated not as a flat collection of tasks but as a directed graph of production nodes. Researchers feed engineers. Engineers feed operations. Operations feed sales. Each node has its own productivity, its own elasticity, and its own capacity to absorb additional throughput from upstream nodes. The output of the firm is what comes out of the final node, constrained by whichever node along the way is the slowest.

The mathematical consequence of this model is the result that explains the 95% failure rate without recourse to any organizational pathology at all. If a firm makes its sales team 25% more productive, the gain shows up in revenue. The sales node is the last step. Nothing downstream is bottlenecking it. If the firm doubles the productivity of its researchers fifty steps back from revenue, the gain is absorbed by every downstream node that has not been touched. The output of the firm does not move. The bottleneck is somewhere between the lab and the customer, and the bottleneck does not care that the researchers are working faster.

Almost every enterprise AI pilot in the field today is being measured at a single node. The pilot delivers a productivity uplift inside one team. The team reports the uplift. Nobody traces whether the uplift survives the trip through every downstream constraint that determines whether the firm's actual output changes. In most cases, it does not. The pilot is technically successful and financially invisible. This is what the 95% number actually looks like at close range. It is not that the technology failed. It is that the technology delivered exactly what it promised at a node that did not matter to the financial outcome.

This also explains a related fact about how AI is currently being deployed inside large companies, which is that the only ROI mechanism most pilots can demonstrate is cost reduction at the node where the pilot lives. If the productivity uplift cannot push more output through the bottleneck downstream, the only available financial benefit is to take the cost of the node down. That is what most enterprises are doing. They are reporting cost reductions at non-bottleneck nodes and treating them as evidence that the AI is working. The income statement six quarters later will not, in most cases, support that claim. The savings will be partially absorbed by the unintended consequences of removing competence at a node that had been holding the rest of the system together — Sutherland's doorman fallacy at organizational scale — and the productivity uplift the pilot was supposed to deliver will turn out to have been an accounting artefact.

The pattern this produces is the same pattern the Stanford and MIT researchers are documenting, with one difference. Their research describes the symptoms. The path-based model describes the underlying mechanism. Once you see it, the rest of the failure pattern reads as a series of consequences rather than a list of independent organizational pathologies. The Center of Excellence problem, the proof-of-concept purgatory, the coexistence of universal adoption and universal failure to scale — all of these are downstream of a more basic problem, which is that almost no firm has done the analytical work to identify where its bottlenecks actually are. Until that map exists, every AI investment is a guess about whether the spend will land at the bottleneck or somewhere accessory to it. The base rate of guessing right is, evidently, around five percent.

Centralization, again

There is a specific failure mode that the research documents, and it deserves its own description because organizations are currently building it at scale and calling it best practice.

Accenture's research suggests that between 80 and 85% of enterprises are stuck in what they describe as "proof of concept purgatory" — generating pilots that work in a controlled setting, fail to scale, and get replaced by new pilots. There is a separate question worth asking, which is what happens to those pilots after they fail. The evidence suggests they become the foundation for the next pilot. This is sometimes described as an iterative learning process.

Most enterprise AI governance now centers around a Center of Excellence model — a centralized function, typically reporting into IT or a Chief Data Officer, that owns AI strategy, standards, and implementation across the enterprise. The logic is sensible on its face: AI is complex, risk needs to be managed, and you want to avoid 47 different business units running 47 different experiments with no coherent architecture underneath them. The underlying assumption is that the people who understand the technology are the right people to govern its application to business problems. This is roughly equivalent to concluding that the best person to run a hospital's diagnostic program is the assembly worker who put together the motherboard inside the MRI machine. Not the radiologist interpreting the image. Not even the engineer who designed the scanner. The person with the deepest familiarity with the hardware.

The problem is that this is exactly how enterprise IT has been organized before, and the results have been consistent. It is, in Yogi Berra's formulation, déjà vu all over again. Mainframe computing was centralized. ERP implementation was centralized. CRM deployment was centralized. In each cycle, the central function controlled the technology while the business users — the people who actually understood the operational problems the technology was supposed to solve — were positioned as requirements-providers rather than solution-builders. In each cycle, the implementations that generated value were the ones where domain experts had genuine ownership of the outcome. The ones that failed were the ones where the distance between technical implementation and business context was too large.

The Stanford research documents this dynamic directly. When AI is technology-led and technology-first, it does not work. The invisible costs — organizational disruption, change management, misalignment between technical capability and business need — account for the majority of what determines whether an AI program succeeds or fails. Not the model. Not the infrastructure. The organizational scaffolding around it.

The Center of Excellence model, built on the assumption that AI requires a specialized technical function to own it, systematically recreates the conditions that have historically produced failed enterprise technology programs. It places AI decisions in the hands of people who understand the technology but not the domain — and then wonders why the output doesn't match what the business actually needs. In the path-based model, this is also the structural reason the CoE keeps spending at the wrong nodes. The team picking the deployment targets does not own the bottleneck and is not measured against the bottleneck. They are measured against the count of pilots launched, the number of business functions touched, the model performance metrics inside the pilot. Each of these is a measurement at a node. None of them is the firm's actual constraint. The CoE, by the construction of its incentives, is structurally guaranteed to invest at the wrong place.

The scarcest variable in enterprise AI is not compute. It is domain expertise: the deep, specific understanding of a business problem, its operational drivers, and what a useful prediction or recommendation in that context actually looks like. Domain expertise does not live in the center. It lives in the business functions, the customer teams, the people who deal with the consequences of operational decisions every day. Centralizing AI governance means systematically separating AI capability from the one thing that makes it useful — and from the one source of knowledge in the firm that would, if consulted, identify the bottleneck the spend should be aimed at.

Socializing the failure

The obvious question is why this pattern persists. If centralized IT governance has produced the same outcome across every major technology cycle for forty years, and if the research documenting that outcome is now extensive enough to fill a board presentation, why are organizations building Centers of Excellence at scale and describing them as best practice?

The answer has less to do with ignorance than with incentives. Internally built AI programs represent career opportunities for the people running them — specifically the opportunity to build something significant, with meaningful budget and headcount, attached to their name. The outcome of the program is a secondary consideration. The existence of the program is the primary one.

When those programs fail, the failure is rarely absorbed by the individuals who made the decision to build. It is socialized. The technology was not ready. The data was not clean enough. AI as a concept was overhyped and the market got ahead of itself. These explanations are available in large quantities and have the advantage of being partially true, which makes them difficult to argue with. The decision to build internally rather than deploy a proven vendor solution is almost never identified as the cause, because doing so would require acknowledging that the decision was wrong — and the person who made it was sufficiently senior to have made it with confidence.

This makes internal AI development a relatively low-risk career choice even when it produces poor organizational outcomes. You build something, it fails to scale, the failure is attributed to structural forces beyond anyone's control, and the organization moves to the next program. The costs are absorbed by shareholders and, eventually, customers.

There is a counter-pattern worth naming, because most career advice quietly ignores it. The fastest trajectories in any large organization belong to the people who solve problems significant enough that executives want them in the room next time. Executive leadership, in my experience, is interested in results — not in the dignified management of programs that produce none. The career structured to avoid being blamed for a failure is also structured to never be credited with a success, and over a long enough horizon the absence of credit is the more expensive outcome. Safety, in this sense, is not actually safe. The people running the lowest-risk AI programs in their organizations today are, in many cases, also the people whose careers are quietly stalling because nothing they have built has produced a result worth noticing.

The mechanism that breaks this is straightforward, though it requires a level of leadership honesty that is not common. Senior leaders can require that any team choosing to build internally rather than purchase must formally commit to measurable business outcomes — not program metrics or model performance scores, but revenue, cost, or risk outcomes with a defined timeline and a named owner. If you want to build it your way, you take accountability for what it produces.

The problem with this mechanism is that it only works if measurable outcomes exist to begin with. And if the program has been positioned as a sandbox, a pilot, or a proof of concept — which most of them have — then the bar has already been set low enough that accountability becomes a theoretical construct. The sandbox framing is not accidental. It is the organizational equivalent of not keeping score, which ensures that no one loses and nothing is required to change.

The case for buying rather than building

The MIT research produces one finding that the enterprise market has been slow to absorb: vendor-deployed AI solutions succeed at roughly twice the rate of internally built ones — approximately 67% versus 33%. This gap does not shrink with more internal investment. It is structural.

The structural reason is worth spelling out. Building an enterprise AI capability internally requires domain expertise, data infrastructure, model development, deployment capability, and the organizational capacity to iterate on all of these simultaneously. Most enterprises have some of these things and none of them completely. Vendor solutions — the ones that work — are built by organizations that have already solved the same problem for multiple clients and have incorporated those learnings into the product. The R&D is pooled. The domain knowledge is accumulated. The risk has already been absorbed by someone else's budget.

There is a related accounting convention worth examining. The standard business case for internal AI development typically treats internal employee time as carrying no marginal cost — the people working on the project are, in the logic of the spreadsheet, free. If this principle were applied consistently, presumably those employees would be happy to confirm it by volunteering their salaries for the duration of the project. The fully loaded cost of a mid-size internal AI development team almost certainly exceeds the annual license cost of purpose-built vendor software. The math that makes internal builds look economically attractive depends heavily on not counting the people doing the building.

The Stanford research adds something important to this picture. Approximately 42% of enterprise AI implementations treat the underlying model as essentially interchangeable — moving between providers without significant impact on outcomes. If you are building an AI program around the assumption that your competitive advantage lies in your choice of foundation model, you are building on the wrong layer. The durable advantage sits above the model, in the orchestration layer: the logic connecting AI capability to operational context, the data pipelines, the workflow integration, the feedback loops. This is what takes years to develop and cannot be replicated quickly. Organizations building that layer internally are absorbing both the cost and the risk of development. Organizations deploying purpose-built applications are accessing it on day one.

I want to be clear that this next observation is a conclusion I draw from the research rather than something the researchers themselves argue. Every technology cycle moves through a predictable sequence. Infrastructure is built first — the picks and shovels. Early applications emerge that are crude but functional. Eventually, purpose-built software displaces DIY assembly as the rational choice for most buyers. The gold miners who built their own shovels in 1849 were eventually replaced by gold miners who bought them. Enterprise AI is currently somewhere in the middle of this transition. The infrastructure layer — foundation models, compute, APIs — is commoditizing faster than most organizations have adjusted their build-versus-buy calculus. The value is migrating to the applications layer: purpose-built, domain-specific software that incorporates accumulated expertise and pooled R&D. Organizations currently investing heavily in internal AI infrastructure are making a bet that the economics of this transition will not play out the way every previous technology transition has played out.

There is a thought experiment that clarifies what kind of bet that is. Imagine a steady state in which every Fortune 1000 company runs its own bespoke AI capability — built internally, on a unique architecture, maintained by its own engineering organization. Does that sound credible? Has any prior technology cycle ever ended in that equilibrium? Mainframe computing, ERP, CRM, supply chain, expense management, payroll — every one of these started with custom internal builds and ended with a small number of dominant vendors serving most of the market. The companies that stuck with homegrown systems through the 1980s spent the next two decades describing them as legacy traps. The cost of maintaining bespoke proprietary technology absorbs the engineering capacity that would otherwise be spent innovating, and the firms carrying that load steadily lose the ability to keep pace with the firms that aren't. The argument that AI is structurally exempt from this pattern requires a specific belief that almost no one has the courage to state out loud — that your company has assembled a top-tier AI engineering team, better than the teams at the firms whose entire business is building this technology. On the available evidence of where AI talent actually works, that belief is not well supported.

Three things worth acting on

The research points in a fairly consistent direction, and the conclusion I'd draw is that most enterprise AI programs are not primarily failing because of technology. They are failing because of how they are structured, measured, and governed. That suggests three decisions worth revisiting.

The first is how you define success. The Stanford research finds that the clearest differentiator between programs that generate value and programs that don't is whether outcomes are defined in financial terms from the start. Not engagement metrics, not adoption rates, not model performance scores — revenue, cost, or risk outcomes that connect to the P&L. If your AI program cannot currently trace a line between what it does and a financial outcome, the probability that it is in the 5% generating real business impact is low. The path-based model adds a sharper test of the same point. Trace the financial outcome backwards through the production graph. Find the slowest-moving step. That is the bottleneck. If the AI spend is somewhere else in the graph, it is paying a bottleneck tax to the rest of the system and the financial outcome is not going to arrive on the timeline the spreadsheet projected.

The second is how you weight domain expertise in make-versus-buy decisions, and over what time horizon you think the bet has to hold. The standard procurement framework optimizes for technical capability, integration flexibility, and price. It rarely treats domain knowledge — accumulated understanding of a specific business problem across many deployments — as a primary evaluation criterion. Given that the MIT data suggests vendor solutions succeed at twice the rate of internal builds, and given that the structural reason for that gap is domain expertise, the procurement framework may be optimizing for the wrong variables. The time-horizon point is connected. Organizations building significant internal AI infrastructure are betting that the technology layer will be their durable advantage. The Stanford research on model interchangeability and the broader pattern of technology commoditization both suggest that advantage has a shorter shelf life than the investment implies. Durable competitive advantage in AI is concentrating in the applications layer — purpose-built, domain-specific, with R&D pooled across multiple clients. That migration changes where the build-versus-buy line should sit.

The third is whether your governance model is recreating the centralization failure. If your AI Center of Excellence is primarily staffed with technologists, reports into IT or a central data function, and positions business units as stakeholders rather than owners, you are running a program that is structurally similar to the ones that failed in every previous enterprise technology cycle. The question is not whether centralization has any value — some coordination and standards-setting is useful — but whether the people with the deepest domain expertise have genuine decision-making authority over AI programs in their areas, or whether they are being asked to provide requirements to a function that will translate them into something unrecognizable.

There is a reasonably simple test of whether any of this has been adequately considered. Take whatever generative AI tool your organization is currently piloting and ask it: "What does the last 40 years of enterprise technology history suggest about centralizing technology governance away from the people who understand the business problem — and does our current AI program repeat that mistake?" The answer will be available in seconds. The willingness to act on it is the harder variable.

For those who prefer a shorter prompt: "Are we the baddies?" The question, in this context, has nothing to do with ethics. It is asking whether your program is structured to produce measurable financial outcomes or structured to avoid being asked whether it has. Most AI tools will answer it honestly.

MIT says 5% of enterprise AI programs are generating measurable financial impact. The question worth asking is not whether you are using AI. It is whether you are in the 5%, and if not, whether the structure of your program gives you any real reason to expect that will change.