AI Code Quality Metrics: What to Measure and What to Ignore
AI Code Quality Metrics: What to Measure and What to Ignore
Most teams measuring AI code quality are tracking the wrong things. Lines of code generated per day. Acceptance rate. Time from prompt to commit. These metrics measure AI usage, not AI quality. Optimizing for them is like optimizing for how fast you can dig holes without checking if you're digging in the right place.
After setting up quality tracking for 8 engineering teams, I've identified the metrics that actually predict whether AI code generation helps or hurts your codebase long-term.
The Metrics That Don't Matter
Let me save you some time. Stop tracking these:
Lines of code generated. More code isn't better code. AI can generate 500 lines in seconds. That doesn't mean 500 lines was the right solution. Often, 50 lines of well-thought-out code would have been better.
Acceptance rate. A high acceptance rate means developers are accepting AI suggestions frequently. This could mean the AI is producing great code. It could also mean developers are vibe coding. You can't tell from the metric alone.
Prompt-to-commit time. Faster isn't better if the code needs to be rewritten in 3 months. This metric actively encourages skipping review.
Developer satisfaction with AI tools. Developers love tools that make them feel productive. But feeling productive and being productive aren't the same thing.
The Metrics That Actually Matter
Tier 1: Leading Indicators (Measure Weekly)
These metrics predict problems before they become expensive.
1. AI Code Modification Rate
Track how often AI-generated code is modified within 30 days of being merged. This is the single most informative metric I've found.
// Pseudocode for tracking this metric
interface AICodeModificationRate {
// How to calculate:
// 1. Tag PRs as AI-assisted or human-written at merge time
// 2. Track subsequent modifications to files from each category
// 3. Calculate the percentage modified within 30 days
aiAssistedFilesModified30d: number; // e.g., 34%
humanWrittenFilesModified30d: number; // e.g., 18%
ratio: number; // e.g., 1.89x
// Target: ratio should be < 1.5x
// If AI code is modified nearly 2x as often as human code,
// your generation quality needs work
}Healthy range: AI code should be modified at most 1.5x more frequently than human code. If it's 2x or higher, your team is generating code that doesn't fit.
2. Defect Origin Rate
What percentage of production bugs originate from AI-generated code versus human-written code, normalized by lines of code?
| Origin | Bugs per 1K Lines | Assessment |
|---|---|---|
| AI-generated | < 0.5 | Excellent |
| AI-generated | 0.5 - 1.5 | Acceptable |
| AI-generated | > 1.5 | Needs intervention |
| Human-written (baseline) | typically 0.5 - 1.0 | Reference |
3. Pattern Conformance Score
Run automated checks on AI-generated code to measure how well it matches your codebase conventions. I use a custom script that checks:
const patternChecks = {
errorHandling: "Uses team's Result<T,E> pattern vs try/catch",
imports: "Uses aliased imports (@/) vs relative paths",
dataAccess: "Uses established data layer vs direct DB calls",
naming: "Follows naming conventions from .eslintrc",
fileStructure: "Follows colocation rules from ARCHITECTURE.md",
};
// Score each check as pass/fail
// Target: > 80% conformance rateTier 2: Lagging Indicators (Measure Monthly)
These confirm trends you're already seeing in Tier 1 metrics.
4. Code Duplication Delta
Run jscpd or a similar tool monthly. Track how duplication changes over time, specifically in files that were AI-generated.
# Baseline measurement
npx jscpd src/ --min-lines 5 --reporters json > baseline.json
# Monthly comparison
npx jscpd src/ --min-lines 5 --reporters json > current.json
# Compare totals
# Healthy: < 2% increase per month
# Concerning: 2-5% increase per month
# Critical: > 5% increase per month5. Onboarding Velocity
How long does it take a new developer to make their first meaningful contribution? If this is increasing, AI code might be making your codebase harder to understand. This is a lagging indicator but a powerful one.
6. Review Cycle Time for AI PRs
Track how long AI-assisted PRs take to get through code review compared to human-written PRs. If AI PRs take longer to review, reviewers are finding issues that need discussion.
Tier 3: Strategic Indicators (Measure Quarterly)
7. Architecture Integrity Index
Use a tool like dependency-cruiser to measure how well your codebase's dependency graph matches your intended architecture. AI code tends to create unexpected dependencies.
npx depcruise --config .dependency-cruiser.js src/ --output-type err
# Track violation count over time8. Total Cost of AI Code
Calculate: (Time saved generating) minus (Time spent reviewing + Time spent fixing + Time spent maintaining). If this is negative, your AI usage pattern is costing you money.
The AI Quality Dashboard
Here's the dashboard I set up for my teams. Four panels, updated weekly:
+----------------------------+---------------------------+
| AI Modification Rate | Defect Origin Rate |
| Target: < 1.5x human | Target: < 1.0 per 1K LoC |
| Current: [chart] | Current: [chart] |
+----------------------------+---------------------------+
| Pattern Conformance | Duplication Delta |
| Target: > 80% | Target: < 2% monthly |
| Current: [chart] | Current: [chart] |
+----------------------------+---------------------------+
The Contrarian Take: Some Metrics Should Be Ignored on Purpose
I deliberately don't measure code coverage for AI-generated tests. Here's why: AI can generate tests that achieve 95% code coverage and catch zero meaningful bugs. Coverage becomes a vanity metric when AI is generating both the code and the tests. It tells you how much code was executed during testing, not how much behavior was verified.
Instead, I measure mutation testing scores on AI-generated code. Mutation testing modifies your code slightly (e.g., changing > to >=) and checks if any test fails. If no test fails, the test suite has a gap. AI-generated tests routinely score 30-40% on mutation testing while maintaining 90%+ code coverage.
# Using Stryker for JavaScript/TypeScript mutation testing
npx stryker run --mutate "src/ai-generated/**/*.ts"
# Target mutation score: > 60%Setting Up Your Measurement System
Week 1: Establish Baselines
Tag all PRs as AI-assisted or human-written for one week. Measure your current duplication ratio, defect rate, and pattern conformance.
Week 2-4: Track Leading Indicators
Set up automated collection of Tier 1 metrics. Present them in your weekly standup. Don't make changes yet; just observe.
Month 2: Identify Patterns
Look for correlations. Which types of AI-generated code have the highest modification rate? Which developers produce the most conformant AI code? Which prompt patterns produce the best metrics?
Month 3: Act on Data
Now you have enough data to make informed decisions. Maybe AI-generated API routes are fine but AI-generated data migrations are problematic. Adjust your AI usage policy based on actual metrics, not gut feeling.
The Bottom Line
You can't improve what you don't measure. But measuring the wrong things is worse than measuring nothing, because it gives you false confidence. Focus on the metrics that predict long-term code health, not short-term productivity. The investment pays off within one quarter.
$ ls ./related
Explore by topic