Is AI raising your team's bar?

For many, AI makes you confidently wrong... faster.

May 11, 2026

10 minute read

In the past few years, generating code got faster, but understanding the problem well enough to generate the right code didn’t.

Speed went up along with rejected pull requests. More output, but more rework with it, widening the distance between what the business needed and what got built. The velocity metrics still look fine. The value metrics, where teams are measuring them at all, tell a different story.

The pattern is consistent enough to be worth naming clearly. AI is an amplifier. It accelerates what was already working; however, it also accelerates what wasn’t, which means imprecise requirements and missing guardrails produce consequences faster than they used to. The teams pulling ahead have figured that out. The ones still struggling are treating AI adoption primarily as a speed problem.

Friction Has Purpose

Code generation intentionally creates friction. Writing, compiling, and watching tests fail forced engagement with the problem at each step, and that engagement was also how understanding built. AI tools eliminated much of it, and that was the point.

But generating more code was never the real constraint. The constraint was, and still is, understanding the problem space well enough to generate the right code. That means the customer’s actual expectation, not just the stated requirement. It means knowing which edge cases will matter in production even when they didn’t surface in planning.

AI prompted to “solve a problem” tends to produce a literal interpretation of the prompt. When the spec is imprecise, the output is coherent but wrong. When test intent wasn’t defined before generation, the model generates tests that validate its own assumptions, which may not match yours. The loop between writing code and understanding what it does can disappear entirely, and most development workflows aren’t designed around that shift.

This is why the comprehension requirement went up, not down. Working above the generation layer requires knowing what you’re asking for and how you’ll recognize a wrong answer when you see it. Those judgments now need to happen before the prompt rather than during implementation.

CodeRabbit’s analysis of 470 open-source GitHub pull requests compared AI-generated code against human-written code across logic, security, and maintainability dimensions. The gap was larger than most teams expect.

Intent Before Generation

Concepts like test driven development (TDD) and red-green-refactor aren’t universal. Many teams ship solid software without them, and the argument here isn’t that they should have been practicing TDD all along. What’s different now is what happens when intent isn’t documented before generation starts.

The 2025 DORA report found that AI acts as an amplifier across engineering practices. Teams with strong foundations captured more of the productivity gain, and teams without them absorbed more of the quality cost. Teams running test-driven practices found that AI generated more useful, better-bounded code when the tests already defined the intended behavior. That’s not surprising in retrospect. A model generating code against a clear behavioral specification has less ambiguity to resolve through assumption.

A 2026 study on test-driven agentic development found that giving AI agents generic procedural TDD instructions — “follow red-green-refactor” — without specific targeted test context actually increased regression rates, from 6.08% to 9.94%. Providing specific test context with explicit behavioral intent dropped regressions to 1.82%. Generic process instructions without content add overhead. Specific intent gives the model what it needs to generate bounded output.

In practice, this means documenting what the solution needs to accomplish and how you’ll recognize a wrong answer before a model sees the prompt. Not a full specification; just enough precision that the review becomes a genuine quality check rather than a syntax check. The things worth capturing before generation starts are who this serves and what failure looks like in production. The rest can emerge from the generation, but those two anchors force the review to be about intent rather than compilation.

When the same model writes both the implementation and the tests, it can fail identically in both places. A passing test suite validates internal consistency. It doesn’t validate that the right thing got built.

The phrase “tests all pass” carries implicit weight when humans write the tests. The test is itself a record of understanding. That weight doesn’t transfer when both artifacts emerge from the same generation. The discipline worth recovering isn’t the test itself. It’s the intent that preceded it, and the habit of writing that intent down before asking a model to generate anything.

This also matters when the question broadens beyond “did the tests pass?” to whether the behavior can be explained to a teammate today or a regulator tomorrow. The operational infrastructure for AI in production is a version of this same discipline applied downstream, monitoring whether behavior matches intent over time, not just at the moment of release.

The Compounding Cost

The pattern of failure is documented across a large enough sample to be meaningful. MIT’s Project NANDA tracked more than 300 AI initiatives and found that 95% of pilots failed to produce measurable return. In 2025, organizations invested roughly $684 billion in AI, with about 80% failing to deliver intended business value.

The usual explanations are cost overruns and unclear business value. Those are accurate, but they’re downstream symptoms. The upstream cause is almost always identical. Teams that didn’t define what success looked like before building started failed at predictable rates. Projects with explicit pre-approval success metrics succeed at roughly 54%. Projects without them succeed at around 12%. That’s not an engineering gap. It’s a definition gap, and it’s a product leadership problem as much as a technical one.

I loathe the term “AI Slop” as, in most cases, it’s rarely because of the AI itself. It’s because of the lack of discipline around how to use it effectively. The tools are powerful, but they require a different approach to get the best results.

The Cortex 2026 Engineering Benchmark shows the velocity gains landing alongside meaningful increases in incidents per PR and change failure rates, measured together as the acceleration materialized.

When Speed Outpaces Comprehension

Downstream costs at every level

Dev Teams

+23.5%

incidents per pull request, even as PR volume grew 20% year-over-year. Review burden intensifies as output scales.

Organizations

~30%

rise in change failure rates as AI-assisted output scaled without proportional quality gates. Rework absorbs what acceleration saved.

Customers

41%

of consumers say service quality has worsened due to AI. Trust is the metric that doesn't appear in velocity reports until it's already gone.

Sources: Cortex 2026 Engineering Benchmark · SurveyMonkey 2026 Customer Service Statistics

Rework absorbs what acceleration is supposed to save. Teams work more and ship more, producing more of something that needs to be fixed. That’s confident wrongness in the aggregate. Not dramatic failures, but a quiet accumulation of incidents that erodes the rationale for the investment.

A Different Orientation

Not every organization is in this position. The value from AI isn’t distributing randomly; it’s concentrating in organizations that approached the tooling differently.

The distinguishing factor, according to PwC’s 2026 AI Performance Study, isn’t how much AI gets deployed but how it’s directed. The companies in that 20% aren’t primarily using AI for efficiency. They’re applying it toward growth, pursuing capabilities and business models that weren’t previously accessible, and they’re 2.6 times more likely than peers to report that AI improved their ability to change how they operate, not just how fast.

The pattern that distinguishes them shows up in how work is structured before generation rather than how it’s reviewed after. Success gets defined before build starts. Intent gets documented, briefly and specifically, before a model sees the prompt. Review asks whether the output matches stated goals, not just whether it compiles. The things shipped get measured against whether they actually did what users needed.

Smaller organizations are self-correcting on this faster than large enterprises. Without the organizational buffer to absorb rework costs, the gap between speed and value becomes visible sooner. The teams that respond by raising their comprehension standards, not just their output targets, build process advantages that headcount and budget can’t quickly replicate.

In Practice

The patterns above translate differently depending on where you sit, so it’s worth being specific about what each looks like in practice.

The Growth-First flow will look familiar to product teams that already work from user outcomes. Defining success before building and validating against intent rather than compilation is what good product practice has always required. AI didn’t create that discipline. It just raised what skipping it costs.

For engineering teams, the shift is from generating first and validating after to specifying intent first and generating within it. This doesn’t require formal TDD methodology in every context. It does require that the test intent (what behavior should this verify?) exists before the model generates the tests. The TDAD research finding is worth internalizing. Generic TDD instructions increase regressions; specific behavioral context reduces them sharply. Ambiguity, not the model, is the bottleneck.

For product leaders, the most valuable intervention happens before a line of code is generated. Defining what success looks like, framed in terms users would recognize, is the single biggest predictor of project outcome. The MIT failure data shows a 54% vs. 12% success rate split on whether success metrics were agreed before build started. That’s a product decision, not an engineering one. It’s also the entry point for thinking about explainability. If someone outside the team asked why this feature behaves the way it does, would the answer be ready? For customer-facing AI especially, “we can explain this” is increasingly both a product quality standard and a regulatory posture.

For leadership, the orientation question is the right frame. Velocity measures are easy to improve with AI tools. They’re also easy to optimize in ways that don’t translate to value. The organizations capturing most of AI’s gains are asking what they can do now that they couldn’t do before. Not just “how much faster?” but “what became possible?” The measurement framework worth building tracks quality gate adherence and whether shipped features actually did what users needed. The more interesting signal is where AI-assisted work is enabling genuinely new capability rather than just accelerating existing work.

The tell is in what gets verified. An efficiency orientation checks that output compiles and tests pass. A growth orientation verifies that output matches intent and can be explained to the next person who didn’t write the prompt. Edge cases should have reasoning behind them, not just test coverage. Those are different quality gates, and only one of them raises the bar.

The point isn’t to slow down. AI-assisted development at its best is substantially faster than what came before. The point is that the speed is only valuable if what’s being built matches what was needed, and that alignment requires deliberate work at the front of the process, not cleanup at the back.

Getting There

The organizations doing this well share one common pattern. They treat AI competency as a practice, not just AI access as a tool. The practice includes clear intent before generation and targeted test context rather than generic process instructions. Review criteria ask about goal alignment rather than just compilation. That practice is learnable and improvable, and it compounds in the same direction as the tools themselves.

The teams still struggling are treating this primarily as a tooling and speed problem, optimizing for generation volume while the quality signals arrive late. The fix isn’t less AI. It’s raising the front-end discipline to match the back-end capability.

That’s the bar. The tools can clear it, when they’re given what they need.

blog

Home

About

Blog

Categories

Contact

Resume

Recent Posts

The Container That Isn't a Container

The Coach You Didn't Know You Needed

Grammar Is Code