Oversized Files in AI-Generated Code: Detection and Remediation
Oversized files are the most reliable early signal of architecture drift in AI-generated codebases. A file that has grown beyond 500 lines of code is not a large file — it is a structural signal that architectural boundaries have eroded and that the file is accumulating logic it should not own.
The mechanism is specific to prompt-driven development: each session adds logic to the most convenient location in the current context window. The most convenient location is typically the largest existing file — because it already contains the most related logic. The file grows. The next session adds more. By the time the file reaches 800 or 1000 lines, it contains business logic from multiple domains, database queries in the wrong layer, and utility functions that belong in dedicated modules.
This page explains how to detect oversized files with precision, how to interpret the findings, and what the remediation path looks like.
What We Observe
Oversized files in AI-generated codebases present with specific structural characteristics that distinguish them from legitimately large files (e.g., generated code, data files, migration scripts):
- Multi-domain logic accumulation — a single file contains authentication logic, billing calculations, and user profile management — three domains that should be in separate modules
- Layer boundary violations — database queries, business logic, and HTTP response formatting coexist in the same file
- Duplicate utility functions — helper functions that appear in multiple files because the file grew too large to search before adding a new one
- Inconsistent naming within the file — the file was modified across many sessions, each using slightly different naming conventions
- High churn rate — the file appears in a disproportionate number of commits because it is the default location for new logic
These are not symptoms of a single problem. They are symptoms of a file that has become a structural accumulation point — the place where architectural decisions go to die.
Detection: Identifying Oversized Files
Primary Detection: File Size Distribution
# Complete file size distribution — the architecture drift fingerprint
echo "=== File size distribution ==="
find . \( -name "*.py" -o -name "*.ts" -o -name "*.tsx" \) \
-not -path "*/node_modules/*" -not -path "*/.git/*" \
-not -path "*/__pycache__/*" \
-not -name "*.test.*" -not -name "*.spec.*" \
-not -name "*.min.*" -not -name "*.generated.*" | \
xargs wc -l 2>/dev/null | grep -v total | \
awk '{
if ($1 < 100) s++
else if ($1 < 300) m++
else if ($1 < 500) l++
else { c++; hotspots[c]=$2" ("$1" LOC)" }
total++
}
END {
print "< 100 LOC (healthy): ", s, "files"
print "100-300 LOC (normal): ", m, "files"
print "300-500 LOC (warning): ", l, "files"
print "> 500 LOC (drift signal):", c, "files"
if (total > 0) print "Drift ratio:", int(c/total*100) "%"
print "\nDrift hotspots (top 10):"
for (i=1; i<=c && i<=10; i++) print " ", hotspots[i]
}'
Secondary Detection: Churn Rate Analysis
# Files with highest commit frequency (structural accumulation points)
echo "=== High-churn files (last 90 days) ==="
git log --since="90 days ago" --name-only --format="" 2>/dev/null | \
grep -E "\.(py|ts|tsx)$" | \
sort | uniq -c | sort -rn | head -15
# Cross-reference: files that are both large AND high-churn
echo ""
echo "=== Large + high-churn files (highest risk) ==="
git log --since="90 days ago" --name-only --format="" 2>/dev/null | \
grep -E "\.(py|ts|tsx)$" | sort | uniq -c | sort -rn | \
awk '{print $2}' | while read f; do
if [ -f "$f" ]; then
lines=$(wc -l < "$f" 2>/dev/null)
if [ "$lines" -gt 400 ]; then
commits=$(git log --since="90 days ago" --oneline -- "$f" 2>/dev/null | wc -l)
echo "$lines LOC, $commits commits: $f"
fi
fi
done | sort -rn | head -10
Tertiary Detection: Multi-Domain Logic Check
# Check if a large file contains logic from multiple domains
# (business logic keywords from different domains in same file)
echo "=== Multi-domain logic in large files ==="
find . \( -name "*.py" -o -name "*.ts" -o -name "*.tsx" \) \
-not -path "*/node_modules/*" -not -path "*/__pycache__/*" | \
xargs wc -l 2>/dev/null | awk '$1 > 400 {print $2}' | \
while read f; do
auth=$(grep -c "auth\|login\|token\|password\|session" "$f" 2>/dev/null || echo 0)
billing=$(grep -c "price\|payment\|invoice\|discount\|billing" "$f" 2>/dev/null || echo 0)
user=$(grep -c "profile\|user\|account\|registration" "$f" 2>/dev/null || echo 0)
domains=0
[ "$auth" -gt 2 ] && domains=$((domains+1))
[ "$billing" -gt 2 ] && domains=$((domains+1))
[ "$user" -gt 2 ] && domains=$((domains+1))
if [ "$domains" -ge 2 ]; then
echo "MULTI-DOMAIN ($domains domains): $f (auth:$auth billing:$billing user:$user)"
fi
done
Interpretation: What the Findings Mean
| Finding | Interpretation | Severity |
|---|---|---|
>500 LOC, single domain |
File is large but focused — monitor, not critical | Low |
>500 LOC, multi-domain |
Architecture drift confirmed — boundary erosion present | Medium |
>800 LOC, any |
Critical accumulation point — refactoring required | High |
>500 LOC + high churn |
Active drift hotspot — every commit adds risk | High |
>30% of files >500 LOC |
Architecture has dissolved — structural intervention required | Critical |
What oversized files are not:
- Generated files (
*.generated.ts,schema.prisma, migration files) — these are legitimately large and should be excluded from analysis - Test fixtures and seed data files — large by design
- Configuration files (
package-lock.json,yarn.lock) — not application logic
Remediation Path
Oversized files are addressed through a two-step process: split the file along domain boundaries, then enforce the boundary with a linter.
Step 1: Identify the Split Boundaries
Before splitting, map the logical domains present in the file:
# Generate a function/class inventory of a large file
# Python
grep -n "^def \|^class \|^ def " large_file.py | head -40
# TypeScript
grep -n "^export function\|^export class\|^export const\|^ async\|^function" \
large_file.ts | head -40
Group the functions by domain. Each domain becomes a separate module.
Step 2: Extract to Dedicated Modules
Before (single oversized file):
services/main.py (847 LOC)
- authenticate_user()
- validate_token()
- calculate_discount()
- apply_coupon()
- get_user_profile()
- update_user_preferences()
After (domain-separated modules):
domains/auth/authenticate/service.py (95 LOC)
domains/auth/validate_token/service.py (67 LOC)
domains/billing/calculate_discount/service.py (112 LOC)
domains/billing/apply_coupon/service.py (88 LOC)
domains/user/get_profile/service.py (78 LOC)
domains/user/update_preferences/service.py (91 LOC)
Step 3: Enforce the Boundary
After splitting, configure the boundary linter to prevent re-accumulation:
// .dependency-cruiser.js — prevent files from growing beyond threshold
// Note: file size enforcement requires a custom script or pre-commit hook
# pre-commit hook: fail if any file exceeds 500 LOC
#!/bin/bash
OVERSIZED=$(git diff --cached --name-only | \
grep -E "\.(py|ts|tsx)$" | \
xargs wc -l 2>/dev/null | \
awk '$1 > 500 {print $2}')
if [ -n "$OVERSIZED" ]; then
echo "❌ Pre-commit: files exceed 500 LOC limit:"
echo "$OVERSIZED"
echo "Split the file before committing."
exit 1
fi