David R. Longnecker

Your AI Shipped. The Ops Didn't.

While everyone panics about vibe coding, the bigger problem is brute-forcing AI to production without the infrastructure to run it

November 24, 2025

9 minute read

The vibe coding panic is real, but I’m seeing a different problem unfold. Teams with actual budgets and executive backing are shipping AI to production through brute force, and they’re still missing their goals.

The proof of concept works and leadership gets excited and then the chatbot ships to customers. Six months later, support tickets start mentioning the bot giving wrong answers, but nobody knows if the problem started last week or two months ago. Maybe the model drifted, maybe the prompts need tuning. It’s hard to tell when you’re not measuring any of it. Nobody verified the architecture, documentation is sparse at best, and instrumentation for observability is probably nonexistent. Scalability became “someone else’s problem” the moment it shipped.

I’ve written about vibe coding’s speed-versus-reality tradeoffs and when to build, buy, or skip AI entirely. This post is about something different: not whether you should build AI, but what happens when you build it without the operational infrastructure to actually run it or the systems in place to support it long-term.

The Deployment-Operations Gap

S&P Global’s 2025 survey found that 42% of companies abandoned most of their AI initiatives this year, up from 17% in 2024, with the average organization scrapping 46% of proof-of-concepts before they reached production.

The usual suspects get blamed: cost overruns and unclear business value. But there’s another factor that doesn’t show up in surveys: operational blindness. Teams ship AI that works in demos and fails in production because nobody built the infrastructure to know when it stops working.

A 2025 LLMOps study from OneReach.ai reported that LLM hallucinations cost businesses over $67 billion in losses during 2024. Not from spectacular failures that make headlines, but from the quiet accumulation of wrong answers and eroded trust that nobody noticed until it was too late.

The danger isn’t AI that fails dramatically. It’s AI that degrades silently. By the time you notice, the damage to customer trust is already done.

What “Running AI” Actually Requires

Traditional software has well-established monitoring. You know when your API is down and you get paged when response times spike. LLM-based systems fail differently. They stay “up” while giving increasingly wrong answers. Response times look fine while relevance deteriorates. The system health dashboard stays green even as customers quietly lose trust.

Running AI in production requires monitoring dimensions most teams haven’t built.

Model consistency matters more than most teams realize. Is the model behaving the same way it did during testing, and are you validating that updates don’t degrade your specific use case? Research from the Fiddler AI team shows that models left unchanged for 6+ months see error rates jump 35% on new data.

Prompt effectiveness degrades silently too. A prompt that worked brilliantly with early adopters might fail when mainstream users phrase questions differently, and without monitoring you won’t know until complaints pile up.

If you’re using RAG, retrieval quality needs its own tracking. Are you measuring whether retrieved chunks actually answer user questions, or just match keywords?

Output drift often precedes major failures. Are responses trending longer, shorter, more generic? Are certain query types now frequently unanswered? These signals require instrumentation to detect.

Published maturity frameworks exist for this space. Microsoft’s GenAIOps model outlines four levels from initial experimentation through operational excellence, and Google’s MLOps framework covers similar ground for traditional ML. But from a practitioner’s perspective, I think about it more simply:

LLMOps Maturity: A Practitioner's View

Level 0

"It shipped"

No monitoring beyond uptime

Level 1

Basic logging

Inputs/outputs captured

Level 2

Quality metrics

Relevance, accuracy tracked

Level 3

Drift detection

Automated quality gates

Most organizations are at Level 0 or 1. Brand damage happens between Level 1 and recognizing the need for Level 2.

The Shortcut That Backfires

I’m also seeing a pattern in products where teams try to force quality through the wrong mechanisms. Instead of building robust monitoring and feedback loops, they overfit their RAG content to expected questions or hardcode answers directly into prompts for common queries.

This creates brittle systems that look good in demos but fail the moment a customer asks something unexpected. The Air Canada chatbot that hallucinated a bereavement fare policy wasn’t designed to handle that edge case. NYC’s MyCity chatbot told business owners they could fire employees for reporting harassment because it encountered questions outside its anticipated scope. A Chevy dealer’s chatbot agreed to sell a Tahoe for $1 because someone asked a question no one had prepared it for.

The underlying problem isn’t the model. It’s that narrowing scope to avoid errors isn’t the same as building systems that handle the real world gracefully. You’re optimizing for the questions you expect while remaining vulnerable to every question you didn’t anticipate.

The Brand Reputation Iceberg

The visible failures make headlines. McDonald’s shut down their drive-thru AI after viral videos of confused customers. These make news because they’re dramatic and easy to point at.

But the bigger damage happens below the waterline. Customers who get wrong answers don’t always complain. They just leave. Support agents lose confidence in tools they can’t verify. The cumulative effect is trust erosion that nobody measures until it shows up in churn numbers months later.

BCG research shows 74% of companies struggle to achieve and scale value from AI. Klarna’s aggressive chatbot deployment backfired when customer satisfaction dropped sharply due to the bot’s inability to handle complex issues, and they ended up rehiring human agents to fill the gap.

One viral hallucination can cost more in brand damage than a year of successful interactions can build in trust. The asymmetry makes operational investment a defensive necessity, not just an operational nicety.

What “Good Enough” Actually Looks Like

I’m not advocating for perfection before launch, but there’s a minimum viable operations bar that most teams skip entirely.

Before you go live, define what “quality” means for your specific use case and instrument for basic signals like user ratings or escalation rates. This gives you a baseline you can compare against later, which is surprisingly rare.

Within the first month, build dashboards that surface quality trends rather than just uptime, set up alerting for significant degradation, and document your prompt and retrieval architecture so future debugging doesn’t require an archaeology dig.

As you approach the first quarter, implement drift detection that compares current performance to your baselines and create feedback loops that surface issues before they become customer complaints.

This isn’t enterprise-scale LLMOps infrastructure. It’s the operational equivalent of having error logging before you ship web applications–basic discipline that somehow gets skipped when AI is involved.

The Capability Mirage

AI has made it genuinely hard to tell who knows what they’re doing. Someone can make a chatbot “work” in a demo without understanding the first thing about running it in production, and the gap between “I built this” and “I can operate this” has never been wider.

This creates a dangerous situation for product leaders. You might hire someone, or approve a project from someone, who can absolutely build the thing. Building isn’t the problem. MIT’s 2025 NANDA research found that about 95% of AI pilots fail to achieve rapid revenue acceleration, with most stalling and delivering little to no measurable impact. The pattern isn’t building failure, but an operational failure.

Before any AI deployment, there are questions worth asking. How will you know if quality degrades? What’s your response time from detecting a problem to fixing it? Who owns ongoing model and prompt maintenance? Who owns the content used in your RAG? What happens when the underlying model gets updated?

If the answers are vague or deferred to “phase two,” you’re setting up for the brand damage that happens when phase two never arrives.

The Real Cost: Build, Operate, Sustain

Most AI budgets account for building the thing. Fewer account for the infrastructure to run it properly. The smallest number account for what it costs to sustain it over years, and that’s where the real expense lives.

I’ve watched teams spend $500,000 building AI features and $0 building infrastructure to know if those features work. But even teams that get the infrastructure right often miss the third piece: ongoing maintenance costs run 17-30% of initial development cost annually, with worst-case scenarios hitting 50%. That includes model retraining, data pipeline maintenance, compliance updates, and the monitoring we’ve been discussing. For complex deployments, cloud infrastructure alone for retraining and inference can cost $50,000-$500,000 per year.

This changes the build-versus-buy calculation significantly. Building gives you a bespoke experience tailored to your exact needs, but the maintenance investment is forever. Buying shifts some of that sustain burden to vendors, but you’re dependent on their roadmap and pricing. Neither choice eliminates operational costs. They just distribute them differently and allows you to make a value judgment based on your organization’s core competencies.

Here’s what makes this more interesting: the costs everyone worries about today are the ones that will matter least tomorrow. LLMs are rapidly commoditizing. Token costs are falling, API pricing is in a race to the bottom, and switching between providers is getting easier. As Microsoft’s WorkLab recently noted, “the real value of AI comes from how you steer, ground, and fine-tune these models with your business data and workflow.”

Today’s expensive line item (API calls, token usage) will become tomorrow’s commodity. But the operational infrastructure, the monitoring, the drift detection, the ongoing maintenance? That investment compounds rather than commoditizes. The organizations that treat operations as a first-class concern aren’t just avoiding risk; they’re building the capability that will differentiate them as the model layer becomes table stakes.

Gartner predicts that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. The survivors will be the ones who understood that sustain is forever and planned accordingly.

Starting Monday

You probably have AI in production right now. The immediate question is this: How would you know if it started giving worse answers last Tuesday?

If you can’t answer that, you’re operating on faith. Faith doesn’t scale, and it definitely doesn’t protect your brand when something goes wrong.

The fix isn’t waiting for perfect LLMOps infrastructure. It’s starting with the basics. Define what “good” looks like for your AI’s outputs and write it down somewhere your team can reference. Instrument basic quality signals, because even just tracking user thumbs up/down gives you something to work with. Create a dashboard that shows quality trends over time rather than just system health.

The teams shipping AI without operations aren’t evil or incompetent. They’re responding rationally to pressure to move fast. But speed without visibility is just running blind, and eventually you hit something.

What’s the monitoring gap in your AI deployment right now? That’s where the brand risk lives.