test-infrastructure-failure

Test Infrastructure Failure in AI-Generated Code: The Root Cause Explained

Test infrastructure failure is a structural root cause in AI-generated codebases where the feedback loop between code changes and correctness verification is absent. No automated tests run before deployment. Regressions reach production undetected. The cost of every change includes an invisible manual verification tax that grows with the codebase.

The defining characteristic of test infrastructure failure is not the absence of test files — it is the absence of a functioning feedback loop. A codebase can have test files that never run, tests that always pass because they test nothing, and a CI/CD pipeline that reports green on every commit while shipping broken code to production. All of these are forms of test infrastructure failure.

This page explains the mechanism by which prompt-driven development produces test infrastructure failure, how to assess the actual state of the feedback loop with concrete checks, and what structural intervention is required to establish a functioning safety net.

Who This Is For

Developers, architects, and technical leads working with AI-generated codebases who need a precise technical understanding of:

Why test infrastructure failure is a structural consequence of prompt-driven development, not a discipline failure
How to distinguish between the presence of test files and the presence of a functioning feedback loop
What the difference is between retroactive test writing (symptom treatment) and test infrastructure establishment (root cause intervention)
How test infrastructure failure interacts with architecture drift and dependency graph corruption to make stabilization high-risk

For the founder-facing explanation of what test infrastructure failure feels like in practice, see Regression Fear and Hidden Technical Debt.

The Mechanism: Why Prompt-Driven Development Produces Test Infrastructure Failure

Test infrastructure failure is not caused by developers who do not value testing. It is a structural consequence of how prompt-driven development optimizes for the speed of the first ship.

The Local Optimization for Delivery

Each prompt session in AI-assisted development is optimized for a specific outcome: produce working code for the immediate feature. The AI generates application code that solves the immediate problem. Test code is a second-order concern — it requires a separate prompt, a separate session, and a deliberate decision to invest in feedback infrastructure that has no visible impact on the demo.

In practice, that second session rarely happens. The feature works. The demo is ready. The next feature is already in the backlog. The test session is deferred — and deferred sessions accumulate into a codebase with no test infrastructure.

The structural reason: prompt-driven development has no built-in mechanism for enforcing test coverage as a precondition for feature completion. In a manually developed codebase with a CI/CD pipeline that requires passing tests, a feature is not complete until its tests pass. In a prompt-driven workflow without that enforcement, a feature is complete when it works in the browser.

The Retroactive Test Problem

Once a codebase has been developed without tests for 2–4 months, adding tests retroactively becomes structurally difficult — not because of the volume of code, but because of the architectural state of the code.

Tests require isolation: the ability to instantiate a module, call a function, and verify its output without loading the entire application. In a codebase with architecture drift (RC01) and dependency graph corruption (RC02), isolation is structurally compromised:

A service that should be independently testable imports directly from the database layer — instantiating it requires a live database connection
A function that should have a single responsibility has accumulated logic from multiple domains — testing it requires understanding the entire codebase
Circular dependencies make it impossible to import module A without also importing modules B, C, and D — a unit test for A is not a unit test

The result: retroactive test writing in a structurally degraded codebase produces tests that are either (a) integration tests masquerading as unit tests, requiring the full application stack to run, or (b) tests that mock so much of the system that they test nothing meaningful.

The Stale Test Problem

A third form of test infrastructure failure is stale tests — tests that were written at some point but have not been maintained as the codebase evolved. In AI-generated codebases, stale tests appear through a specific mechanism:

Prompt-driven regeneration replaces entire files (FP005). When a file is regenerated, the function signatures, return types, and behavior may change. Tests written against the previous version of the file now test a function that no longer exists — or test a function that exists but behaves differently. The tests continue to pass (because the function name still matches) but verify nothing about the current behavior.

Stale tests are more dangerous than absent tests: they create a false sense of coverage. A CI/CD pipeline that reports 40% test coverage may be reporting 40% of the codebase covered by tests that test the previous version of the code.

Technical Depth: The Three Failure Patterns of Test Infrastructure Failure

FP014: Low Test Coverage Ratio — The Feedback Loop Absence Signal

Low test coverage ratio is the primary quantitative signal of test infrastructure failure. The ratio measures the proportion of production code files that have corresponding test files — a proxy for the proportion of the codebase that has any automated feedback loop.

Scoring thresholds (from AI Chaos Index):

Test-to-production file ratio	RC04 base severity
≥50%	0
30–50%	3
10–30%	6
<10%	9

The ratio is a proxy, not a precise coverage measurement. A codebase with 40% test file ratio but stale tests has a lower effective coverage than the ratio suggests. The ratio is the starting point — the stale test check (FP015) and test isolation check (FP016) provide the full picture.

FP015: Stale Tests — The False Coverage Signal

Stale tests are tests that exist but do not verify current behavior. The detection signal: tests that import functions or classes that no longer exist in the production code, or tests that always pass regardless of the production code state.

The mechanism in AI-generated codebases: prompt-driven regeneration changes function signatures without updating the corresponding tests. The test file imports calculateDiscount(price, rate) — but the regenerated production file now exports applyDiscount(item, discountConfig). The test fails to import — or worse, the test imports a stale cached version and passes.

FP016: No Test Isolation — The Integration Dependency Signal

No test isolation is the structural consequence of architecture drift in the test layer. Tests that cannot run without a live database, a running API server, or a full application stack are not unit tests — they are integration tests that happen to be in a test file. The consequence: the test suite is slow, flaky, and environment-dependent. It cannot run in CI/CD without a full infrastructure setup. It is not run locally because it takes too long. The feedback loop is absent in practice even if the test files exist.

Detection: Assessing the Actual State of the Feedback Loop

The following checks distinguish between the presence of test files and the presence of a functioning feedback loop.

Step 1: Test Coverage Ratio (primary signal)

# Python: test-to-production file ratio
echo "=== Python test coverage ratio ==="
PROD=$(find . -name "*.py" \
  -not -path "*/test*" \
  -not -path "*/__pycache__/*" \
  -not -path "*/migrations/*" \
  -not -name "conftest.py" | wc -l)
TEST=$(find . \( -name "test_*.py" -o -name "*_test.py" \) | wc -l)
echo "Production files: $PROD"
echo "Test files: $TEST"
echo "Test ratio: $(echo "scale=1; $TEST * 100 / ($PROD + 1)" | bc)%"

# TypeScript: test-to-production file ratio
echo "=== TypeScript test coverage ratio ==="
PROD=$(find src -name "*.ts" -o -name "*.tsx" 2>/dev/null | \
  grep -v "\.test\." | grep -v "\.spec\." | grep -v "__tests__" | wc -l)
TEST=$(find src \( -name "*.test.ts" -o -name "*.test.tsx" \
  -o -name "*.spec.ts" -o -path "*/__tests__/*.ts" \) 2>/dev/null | wc -l)
echo "Production files: $PROD"
echo "Test files: $TEST"
echo "Test ratio: $(echo "scale=1; $TEST * 100 / ($PROD + 1)" | bc)%"

Step 2: Stale Test Detection (secondary signal)

# Python: find test files importing non-existent functions
echo "=== Stale import detection (Python) ==="
find . -name "test_*.py" -o -name "*_test.py" | while read testfile; do
  # Extract imported names from test file
  grep "^from\|^import" "$testfile" 2>/dev/null | while read import_line; do
    module=$(echo "$import_line" | grep -oP "from \K[\w.]+")
    if [ -n "$module" ]; then
      module_path=$(echo "$module" | tr '.' '/').py
      if [ ! -f "$module_path" ] && [ ! -f "${module_path%/*}/__init__.py" ]; then
        echo "STALE: $testfile imports from missing module: $module"
      fi
    fi
  done
done | head -20

# TypeScript: find test files with broken imports
echo "=== Stale import detection (TypeScript) ==="
find src -name "*.test.ts" -o -name "*.test.tsx" 2>/dev/null | while read testfile; do
  grep "^import" "$testfile" 2>/dev/null | \
    grep -oP "from ['\"]([^'\"]+)['\"]" | \
    grep -oP "['\"]([^'\"]+)['\"]" | tr -d "'\""  | while read import_path; do
      # Check relative imports only
      if [[ "$import_path" == .* ]]; then
        dir=$(dirname "$testfile")
        resolved="$dir/$import_path"
        if [ ! -f "${resolved}.ts" ] && [ ! -f "${resolved}.tsx" ] && \
           [ ! -f "${resolved}/index.ts" ]; then
          echo "STALE: $testfile → $import_path (not found)"
        fi
      fi
    done
done | head -20

# Quick check: run tests and count failures vs passes
echo "=== Test suite health check ==="
# Python
python -m pytest --tb=no -q 2>/dev/null | tail -3 || echo "pytest not configured"
# TypeScript
npx jest --passWithNoTests --silent 2>/dev/null | tail -3 || \
  echo "jest not configured"

Step 3: Test Isolation Assessment (tertiary signal)

# Check for database/external dependencies in unit test files
echo "=== External dependencies in test files ==="
grep -rn "database\|db\.\|session\.\|redis\|elasticsearch\|requests\.\|fetch(" \
  --include="test_*.py" --include="*_test.py" \
  --include="*.test.ts" --include="*.test.tsx" \
  . 2>/dev/null | grep -v "mock\|patch\|stub\|fake\|fixture" | head -20

# Check for test setup requiring full application stack
echo "=== Full stack test dependencies ==="
grep -rn "TestClient\|app\.test\|supertest\|createServer\|startServer" \
  --include="test_*.py" --include="*.test.ts" \
  . 2>/dev/null | wc -l | \
  xargs -I{} echo "{} test files require full application stack"

# Check for missing test fixtures/factories
echo "=== Test fixture presence ==="
[ -d "tests/fixtures" ] && echo "✓ tests/fixtures/" || echo "✗ tests/fixtures/ MISSING"
[ -d "tests/factories" ] && echo "✓ tests/factories/" || echo "✗ tests/factories/ MISSING"
[ -f "tests/conftest.py" ] && echo "✓ conftest.py" || echo "✗ conftest.py MISSING"
[ -f "src/__tests__/setup.ts" ] && echo "✓ test setup file" || \
  echo "✗ test setup file MISSING"

Step 4: RC04 Severity Calculation

primary_signal = test_to_production_ratio (%)
secondary_signals = [
  stale_tests_present,              # boolean
  no_test_isolation,                # boolean (external deps in unit tests)
  ci_cd_absent_or_no_test_step      # boolean
]
secondary_bonus = count(secondary_signals_present) × 0.75

RC04_severity = min(lookup(primary_signal) + secondary_bonus, 10)

Example calculation:

Codebase: 8% test ratio → base score: 9
Secondary: stale_tests(✓) + no_isolation(✓) = 2 × 0.75 = 1.5
RC04_severity = min(9 + 1.5, 10) = 10 (capped)
RC04 contribution to ACI = 10 × 0.20 = 2.0 (maximum possible)

The Test Infrastructure Establishment Model

The structural intervention required to address test infrastructure failure is not "write more tests." It is establishing the infrastructure that makes tests possible, maintainable, and automatically enforced.

1. Establish Test Isolation Prerequisites

Before writing tests, the architectural preconditions for testable code must be present:

# ❌ Not testable: service with direct database dependency
class CalculateOverageService:
    def execute(self, user_id: str) -> OverageResult:
        db = get_db_connection()  # direct dependency — cannot test without DB
        credits = db.query(f"SELECT credits FROM users WHERE id = '{user_id}'")
        return OverageResult(overage=max(0, int(credits) - 100))

# ✅ Testable: service with injected repository
class CalculateOverageService:
    def __init__(self, repo: CalculateOverageRepository) -> None:
        self.repo = repo  # injected — can be mocked in tests

    def execute(self, request: CalculateOverageRequest) -> CalculateOverageResponse:
        credits = self.repo.get_credits(request.user_id)
        overage = max(0, int(credits) - 100) * 0.005
        return CalculateOverageResponse(amount=round(overage, 2))

The testable version can be instantiated with a mock repository — no database required. This is the architectural precondition for unit tests: dependency injection rather than direct dependency resolution.

2. Establish the Test Baseline

The test baseline is the minimum test coverage required to make the codebase safe to refactor. It is not 100% coverage — it is coverage of the highest-risk code paths:

Priority 1 — Business logic (highest risk):
  - Pricing and discount calculations
  - Permission and authorization checks
  - Data validation logic
  - State transition logic

Priority 2 — Integration boundaries (medium risk):
  - API endpoint contracts (input/output shapes)
  - Database query correctness
  - External service adapter behavior

Priority 3 — UI behavior (lower risk for backend-heavy systems):
  - Form validation
  - State management
  - Component rendering

The baseline is established in a dedicated sprint — not incrementally alongside feature development. Incremental test writing under feature pressure produces tests that are written to pass, not tests that are written to catch regressions.

3. CI/CD Enforcement

# .github/workflows/ci.yml
name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install -r requirements.txt  # or npm ci

      - name: Run linters
        run: |
          flake8 .
          mypy .

      - name: Run tests with coverage
        run: |
          pytest --cov=. --cov-report=term-missing --cov-fail-under=30
          # Fail if coverage drops below 30% — enforces the baseline

      - name: Check for stale imports
        run: python -m pytest --collect-only -q 2>&1 | grep "ERROR" || true

The critical enforcement: --cov-fail-under=30 (or equivalent for TypeScript: jest --coverageThreshold). This makes the CI/CD pipeline fail if test coverage drops below the established baseline — preventing regression in the feedback loop itself.

Why Test Infrastructure Failure Makes All Other Root Causes Harder to Address

Test infrastructure failure (RC04) is the root cause that makes every other structural intervention high-risk:

Root Cause	Without Tests	With Tests
RC01 Architecture Drift	Refactoring layer violations risks silent regressions	Refactoring is safe — tests catch behavioral changes
RC02 Dependency Corruption	Breaking circular deps may change behavior silently	Breaking cycles is verifiable — tests confirm behavior preserved
RC03 Structural Entropy	Consolidating duplicate logic risks divergence	Consolidation is safe — tests verify the canonical implementation
RC05 No Deployment Safety Net	CI/CD has nothing to enforce	CI/CD enforces test passage before every deployment

This is why RC04 carries a 20% weight in the AI Chaos Index — equal to RC02. A codebase with high RC04 severity cannot be safely stabilized regardless of how well the other root causes are addressed. Every structural intervention carries regression risk that is only manageable with a functioning test feedback loop.

The correct remediation sequence: establish the test baseline (RC04) before or in parallel with addressing architecture drift (RC01) and dependency graph corruption (RC02). Attempting to refactor a structurally degraded codebase without tests is the highest-risk operation in software development.