EVIDENCE

The Measured Cost of Structural Failure in AI-Generated Codebases

Structural failure in AI-generated codebases is not a theoretical risk. It is a measured condition with quantified consequences — in defect rates, resolution time, delivery speed, and deployment reliability. The evidence below is drawn from peer-reviewed research, large-scale industry benchmarks, and empirical studies of AI-generated code. Every claim includes its source, year, and evidence tier.

This page documents what structural failure costs — not in hypothetical projections, but in published, reproducible findings.


Who This Is For

You built an application with Cursor, Lovable, Bolt.new, Replit, or v0. It works — but you suspect something is wrong underneath. You're weighing whether to investigate or keep building.

  • You want to know if structural problems have a real, measurable cost — or if it's just "best practice" advice
  • You want peer-reviewed evidence, not marketing claims
  • You're deciding whether to invest in a diagnostic now or wait
  • You want to understand what happens to codebases like yours over 6, 12, and 24 months

The Core Finding

Across multiple independent studies (2019–2025), three facts are consistently supported:

  1. Low-quality code contains up to 15× more defects than high-quality code — a direct, measurable tax on delivery. — Tornhill & Borg, 2022 (39 proprietary codebases, arXiv:2203.04374)

  2. AI magnifies existing structural conditions — it amplifies strengths and dysfunctions rather than automatically improving delivery outcomes. — DORA, 2025 (5,000 respondents, 100+ hours qualitative data)

  3. ~32% of AI-generated multi-file projects fail to execute without manual intervention — dependency and environment specification failures are now a measurable productivity tax. — arXiv:2512.22387, 2025 (300 LLM-generated projects)

These are not edge cases. They describe the structural baseline of AI-generated codebases when architectural enforcement is absent.


Evidence by Root Cause

RC01: Architecture Drift

Architecture drift — the erosion of layer boundaries, file growth, and cross-domain imports — is rarely measured directly as "drift." It surfaces as low code maintainability, which has clear quantified consequences.

Finding Value Source Year
Developer time lost to technical debt ~23% of working time Besker, Martini & Bosch (peer-reviewed) 2019
Defect density in low-quality code Up to 15× more defects Tornhill & Borg (39 codebases) 2022
Issue resolution time penalty 78–124% longer than in healthy code Tornhill & Borg 2022
Onboarding to first meaningful PRs 72% of orgs report >1 month Cortex survey (vendor) 2024

What this means: Architecture drift behaves like compounding debt. ~1 day per week of developer capacity is consumed by structural overhead — before accounting for defect-driven churn. When code quality degrades, both "build new" and "keep running" costs increase simultaneously.

"Low quality code contains 15 times more defects than high quality code." — Tornhill & Borg, 2022

RC02: Dependency Graph Corruption

Dependency corruption — circular imports, undeclared runtime dependencies, shared utils overuse — has the strongest AI-specific evidence of any root cause.

Finding Value Source Year
AI projects failing out-of-the-box ~32% failure rate arXiv:2512.22387 (300 projects) 2025
Declared vs. runtime dependency gap ~13.5× gap (~3 declared → ~37 runtime) Same study 2025
Debugging time per failed project ~15 minutes average Same study 2025
Cyclic dependency evolution Complexity increases over releases Gnoyke et al., JSS (485 releases, peer-reviewed) 2024

What this means: A project can appear simple in its declared dependencies while being operationally complex at runtime. When dependency direction rules break down, the hidden complexity grows silently — and each release makes it harder, not easier, to isolate and refactor modules.

"Only 68.3% of projects execute out-of-the-box." — arXiv:2512.22387, 2025

RC03: Structural Entropy

Structural entropy — naming inconsistency, duplicate logic, missing standard files — manifests as high cycle-time variance and unpredictable delivery.

Finding Value Source Year
Worst-case cycle time inflation ~9× longer in low-quality code Tornhill & Borg 2022
Defect amplification 15× more defects in low-quality code Tornhill & Borg 2022
AI-generated code clone rates Up to ~7.50% Type-1/2 clones FSE 2025 (conference paper) 2025
Resolution time penalty 78–124% longer Tornhill & Borg 2022

What this means: Entropy doesn't cause single catastrophic failures. It causes predictability to collapse. When cycle-time volatility reaches 9×, deadlines become unreliable — not because of any single incident, but because the structural condition makes every task's duration unpredictable. AI can scale this entropy by generating near-duplicate implementations unless structure and conventions are enforced.

"Issue resolutions in low quality code involve higher uncertainty … as 9 times longer maximum cycle times." — Tornhill & Borg, 2022

RC04: Test Infrastructure Failure

Missing or unreliable test infrastructure is measured most directly through delivery performance benchmarks — speed and stability outcomes that differentiate high from low performing teams.

Finding Value Source Year
Change failure rate (low performers) 46–60% vs 0–15% (high) DORA 2022 (Tier 1) 2022
Lead time for changes (low performers) 1–6 months vs 1 day–1 week (high) DORA 2022 2022
Time to restore (low performers) 1 week–1 month vs <1 day (high) DORA 2022 2022
AI code review effort 38% of devs say AI code review requires more effort SonarSource survey (vendor) 2025

What this means: Without fast, trustworthy tests, the feedback loop between code changes and correctness verification is broken. The result is not just slower releases — it is fundamentally riskier releases. In AI-assisted development, the bottleneck shifts from writing code to verifying and integrating it.

"Manual regression testing is time-consuming to execute and expensive to perform, which makes it a bottleneck…" — DORA, 2025

RC05: No Deployment Safety Net

Absent deployment automation and rollback mechanisms correlate with dramatically slower release cadence and worse stability outcomes.

Finding Value Source Year
Deployment frequency (low performers) Monthly to semiannual vs on-demand (high) DORA 2022 (Tier 1) 2022
AI coding gains absorbed by bottlenecks Gains "swallowed by bottlenecks" in testing/deployment DORA 2025 2025
AI impact on delivery performance Small negative estimate on system-level delivery DORA 2025 2025

What this means: AI increases the rate at which code is produced. If the release pathway is not automated and standardized, you grow a delivery bottleneck — increasing work-in-progress, review queues, and late-stage failures. The faster you generate code, the more critical the deployment safety net becomes.

"Gains in coding speed are often swallowed by bottlenecks in testing, security reviews, and complex deployment processes." — DORA, 2025


How Structural Problems Compound

A fully quantified "interest rate" for structural problems does not exist in published research. What does exist is convergent evidence for compounding mechanisms:

Within ~6 months: Verification and coordination tax becomes visible

  • AI and fast shipping increase the volume of changes; if structure is weak, teams spend more time on review, debugging, and dependency reconciliation rather than feature delivery.
  • Developers lose time to context gathering — often >30 minutes/day in large codebases (Stack Overflow, 2024).

Within ~12 months: Throughput collapses into bottlenecks

  • Broken test and deployment safety nets shift the system into slower performance clusters: longer lead time, higher change failure, longer recovery — observable in DORA benchmark ranges.
  • Increased defect density and longer resolution time convert planned work into unplanned work, increasing volatility and pushing roadmap commitments out.

Within ~24 months: Structural debt locks in

  • Teams avoid refactoring because cross-cutting changes are costly; reduced refactoring increases future refactoring cost (a reinforcing loop).
  • Dependency and architecture smells "merge" into larger, more complex structures across releases, making modular testing and refactoring harder, not easier (Gnoyke et al., 2024).
  • AI adoption — if treated as a tooling rollout rather than a systems change — can magnify downstream disorder.

What This Means for Your Codebase

The evidence does not predict your specific situation. It establishes measurable baselines:

  • If your code quality is low, you are likely paying a 15× defect penalty and 78–124% resolution time premium.
  • If your test infrastructure is weak, your change failure rate is likely in the 46–60% range — meaning roughly half your deployments introduce problems.
  • If your dependency graph has circular chains, they are likely to grow more complex over time, not resolve themselves.
  • If you are using AI tools without architectural enforcement, AI is likely amplifying existing structural problems, not solving them.

The diagnostic measures where your codebase sits on these dimensions. The AI Chaos Index quantifies the structural risk across all five root causes — in 24 hours.


Structural risks compound over time.

The evidence shows that early diagnosis reduces long-term remediation cost. Measure your codebase now.