oversized-files

Oversized Files in AI-Generated Code: Detection and Remediation

Oversized files are the most reliable early signal of architecture drift in AI-generated codebases. A file that has grown beyond 500 lines of code is not a large file — it is a structural signal that architectural boundaries have eroded and that the file is accumulating logic it should not own.

The mechanism is specific to prompt-driven development: each session adds logic to the most convenient location in the current context window. The most convenient location is typically the largest existing file — because it already contains the most related logic. The file grows. The next session adds more. By the time the file reaches 800 or 1000 lines, it contains business logic from multiple domains, database queries in the wrong layer, and utility functions that belong in dedicated modules.

This page explains how to detect oversized files with precision, how to interpret the findings, and what the remediation path looks like.

What We Observe

Oversized files in AI-generated codebases present with specific structural characteristics that distinguish them from legitimately large files (e.g., generated code, data files, migration scripts):

Multi-domain logic accumulation — a single file contains authentication logic, billing calculations, and user profile management — three domains that should be in separate modules
Layer boundary violations — database queries, business logic, and HTTP response formatting coexist in the same file
Duplicate utility functions — helper functions that appear in multiple files because the file grew too large to search before adding a new one
Inconsistent naming within the file — the file was modified across many sessions, each using slightly different naming conventions
High churn rate — the file appears in a disproportionate number of commits because it is the default location for new logic

These are not symptoms of a single problem. They are symptoms of a file that has become a structural accumulation point — the place where architectural decisions go to die.

Detection: Identifying Oversized Files

Primary Detection: File Size Distribution

# Complete file size distribution — the architecture drift fingerprint
echo "=== File size distribution ==="
find . \( -name "*.py" -o -name "*.ts" -o -name "*.tsx" \) \
  -not -path "*/node_modules/*" -not -path "*/.git/*" \
  -not -path "*/__pycache__/*" \
  -not -name "*.test.*" -not -name "*.spec.*" \
  -not -name "*.min.*" -not -name "*.generated.*" | \
  xargs wc -l 2>/dev/null | grep -v total | \
  awk '{
    if ($1 < 100) s++
    else if ($1 < 300) m++
    else if ($1 < 500) l++
    else { c++; hotspots[c]=$2" ("$1" LOC)" }
    total++
  }
  END {
    print "< 100 LOC (healthy):    ", s, "files"
    print "100-300 LOC (normal):   ", m, "files"
    print "300-500 LOC (warning):  ", l, "files"
    print "> 500 LOC (drift signal):", c, "files"
    if (total > 0) print "Drift ratio:", int(c/total*100) "%"
    print "\nDrift hotspots (top 10):"
    for (i=1; i<=c && i<=10; i++) print "  ", hotspots[i]
  }'

Secondary Detection: Churn Rate Analysis

# Files with highest commit frequency (structural accumulation points)
echo "=== High-churn files (last 90 days) ==="
git log --since="90 days ago" --name-only --format="" 2>/dev/null | \
  grep -E "\.(py|ts|tsx)$" | \
  sort | uniq -c | sort -rn | head -15

# Cross-reference: files that are both large AND high-churn
echo ""
echo "=== Large + high-churn files (highest risk) ==="
git log --since="90 days ago" --name-only --format="" 2>/dev/null | \
  grep -E "\.(py|ts|tsx)$" | sort | uniq -c | sort -rn | \
  awk '{print $2}' | while read f; do
    if [ -f "$f" ]; then
      lines=$(wc -l < "$f" 2>/dev/null)
      if [ "$lines" -gt 400 ]; then
        commits=$(git log --since="90 days ago" --oneline -- "$f" 2>/dev/null | wc -l)
        echo "$lines LOC, $commits commits: $f"
      fi
    fi
  done | sort -rn | head -10

Tertiary Detection: Multi-Domain Logic Check

# Check if a large file contains logic from multiple domains
# (business logic keywords from different domains in same file)
echo "=== Multi-domain logic in large files ==="
find . \( -name "*.py" -o -name "*.ts" -o -name "*.tsx" \) \
  -not -path "*/node_modules/*" -not -path "*/__pycache__/*" | \
  xargs wc -l 2>/dev/null | awk '$1 > 400 {print $2}' | \
  while read f; do
    auth=$(grep -c "auth\|login\|token\|password\|session" "$f" 2>/dev/null || echo 0)
    billing=$(grep -c "price\|payment\|invoice\|discount\|billing" "$f" 2>/dev/null || echo 0)
    user=$(grep -c "profile\|user\|account\|registration" "$f" 2>/dev/null || echo 0)
    domains=0
    [ "$auth" -gt 2 ] && domains=$((domains+1))
    [ "$billing" -gt 2 ] && domains=$((domains+1))
    [ "$user" -gt 2 ] && domains=$((domains+1))
    if [ "$domains" -ge 2 ]; then
      echo "MULTI-DOMAIN ($domains domains): $f (auth:$auth billing:$billing user:$user)"
    fi
  done

Interpretation: What the Findings Mean

Finding	Interpretation	Severity
`>500 LOC`, single domain	File is large but focused — monitor, not critical	Low
`>500 LOC`, multi-domain	Architecture drift confirmed — boundary erosion present	Medium
`>800 LOC`, any	Critical accumulation point — refactoring required	High
`>500 LOC` + high churn	Active drift hotspot — every commit adds risk	High
`>30%` of files `>500 LOC`	Architecture has dissolved — structural intervention required	Critical

What oversized files are not:

Generated files (*.generated.ts, schema.prisma, migration files) — these are legitimately large and should be excluded from analysis
Test fixtures and seed data files — large by design
Configuration files (package-lock.json, yarn.lock) — not application logic

Remediation Path

Oversized files are addressed through a two-step process: split the file along domain boundaries, then enforce the boundary with a linter.

Step 1: Identify the Split Boundaries

Before splitting, map the logical domains present in the file:

# Generate a function/class inventory of a large file
# Python
grep -n "^def \|^class \|^    def " large_file.py | head -40

# TypeScript
grep -n "^export function\|^export class\|^export const\|^  async\|^function" \
  large_file.ts | head -40

Group the functions by domain. Each domain becomes a separate module.

Step 2: Extract to Dedicated Modules

Before (single oversized file):
  services/main.py (847 LOC)
    - authenticate_user()
    - validate_token()
    - calculate_discount()
    - apply_coupon()
    - get_user_profile()
    - update_user_preferences()

After (domain-separated modules):
  domains/auth/authenticate/service.py (95 LOC)
  domains/auth/validate_token/service.py (67 LOC)
  domains/billing/calculate_discount/service.py (112 LOC)
  domains/billing/apply_coupon/service.py (88 LOC)
  domains/user/get_profile/service.py (78 LOC)
  domains/user/update_preferences/service.py (91 LOC)

Step 3: Enforce the Boundary

After splitting, configure the boundary linter to prevent re-accumulation:

// .dependency-cruiser.js — prevent files from growing beyond threshold
// Note: file size enforcement requires a custom script or pre-commit hook

# pre-commit hook: fail if any file exceeds 500 LOC
#!/bin/bash
OVERSIZED=$(git diff --cached --name-only | \
  grep -E "\.(py|ts|tsx)$" | \
  xargs wc -l 2>/dev/null | \
  awk '$1 > 500 {print $2}')
if [ -n "$OVERSIZED" ]; then
  echo "❌ Pre-commit: files exceed 500 LOC limit:"
  echo "$OVERSIZED"
  echo "Split the file before committing."
  exit 1
fi