Code Complexity Metrics: Which Ones Actually Predict Bugs
Code Complexity Metrics: Which Ones Actually Predict Bugs
For three years, I tracked every production incident at a mid-size SaaS company and correlated them back to code metrics. I expected cyclomatic complexity to be the top predictor. It wasn't even in the top three.
The thing most teams get wrong about complexity metrics is treating them as absolute measures. A function with cyclomatic complexity of 15 isn't inherently buggy. A function with a complexity of 3 that's been modified by 12 different developers in 6 months? That's where your bugs live.
The Metrics Menu
Let's start with what's available, then narrow down to what actually matters.
Cyclomatic Complexity (McCabe, 1976)
Counts the number of independent paths through a function. Every if, else, for, while, case, and catch adds 1 to the count.
// Cyclomatic complexity: 4
function processPayment(payment: Payment): Result {
if (!payment.amount) return { error: "No amount" }; // +1
if (payment.amount < 0) return { error: "Negative" }; // +1
if (payment.currency !== "USD") { // +1
payment = convertCurrency(payment);
}
return chargeCard(payment);
}The problem: Cyclomatic complexity treats all branches as equal. A 10-branch switch statement that maps enum values to strings has the same complexity as a 10-branch nested if-else tree with side effects. One is trivial. The other is a bug factory.
Cognitive Complexity (SonarSource, 2016)
Designed to fix cyclomatic complexity's shortcomings. It penalizes nesting, recognizes that some structures are harder to understand than others, and ignores shorthand patterns.
// Cognitive complexity: 7
function processOrder(order: Order) {
if (order.isPriority) { // +1
for (const item of order.items) { // +2 (nesting)
if (item.requiresApproval) { // +3 (nesting)
requestApproval(item);
}
}
}
}
// Cognitive complexity: 3
function processOrder(order: Order) {
if (!order.isPriority) return; // +1
for (const item of order.items) { // +1
if (item.requiresApproval) { // +1 (less nesting)
requestApproval(item);
}
}
}Same logic, different complexity scores. Cognitive complexity rewards the early-return pattern, which matches how humans actually read code.
Halstead Metrics (1977)
Based on counting operators and operands. Produces metrics like volume, difficulty, and estimated effort. Mostly academic. I haven't found them useful in practice.
Lines of Code (LOC)
The simplest metric. And surprisingly predictive. A 2007 study by Nagappan et al. at Microsoft found that LOC was as good as or better than more sophisticated metrics for predicting post-release defects. Not because long functions are inherently buggy, but because LOC correlates with everything that does cause bugs: more logic, more state, more developers touching the code.
Churn (Change Frequency)
How often a file or function changes. This is a process metric, not a code metric, and that's exactly why it works. High churn means the code is either unstable (bugs keep getting found) or under active development (bugs keep getting introduced).
Coupling (Afferent and Efferent)
Afferent coupling: how many other modules depend on this one. Efferent coupling: how many modules this one depends on. High afferent coupling means a change here breaks many things. High efferent coupling means changes in many places could break this module.
The Metrics That Actually Predict Bugs
Based on my own data and published research, here's my ranking:
The Bug Prediction Power Ranking
| Rank | Metric | Predictive Power | Why It Works |
|---|---|---|---|
| 1 | Churn x Complexity | Very High | Hot, complex code = bugs |
| 2 | Number of Authors | High | More hands = more misunderstanding |
| 3 | Cognitive Complexity | High | Matches human comprehension limits |
| 4 | File Size (LOC) | Moderate-High | Simple proxy for everything |
| 5 | Coupling (Efferent) | Moderate | More dependencies = more failure modes |
| 6 | Cyclomatic Complexity | Moderate | Useful but overrated |
| 7 | Code Duplication | Moderate | Fixes applied in one place, not others |
| 8 | Test Coverage (inverse) | Low-Moderate | Low coverage = undetected bugs |
The number one predictor is the combination of churn and complexity. This concept comes from Adam Tornhill's work in "Your Code as a Crime Scene." The idea is simple: code that's both complex and frequently changed is where bugs congregate.
Complex code that never changes is fine. It's stable. Simple code that changes frequently is also fine. It's easy to get right. But complex code that keeps changing? That's your hotspot.
My Contrarian Take: Stop Optimizing for Metrics
I was wrong for years about how to use these metrics. I'd set up SonarQube, configure quality gates, and block PRs that exceeded complexity thresholds. The result? Developers gamed the metrics. They'd split one complex function into three functions that called each other, reducing cyclomatic complexity while making the code harder to understand.
Metrics are diagnostic tools, not targets. The moment you make a metric a target, it ceases to be a good metric. This is Goodhart's Law, and it applies to code quality just as much as it applies to economics.
Here's what I do instead.
The Hotspot Analysis Framework
Run this analysis monthly. It takes about 30 minutes and gives you more actionable insight than any dashboard.
Step 1: Find your hotspots.
# Get the 20 most frequently changed files in the last 6 months
git log --since="6 months ago" --pretty=format: --name-only | \
sort | uniq -c | sort -rn | head -20Step 2: Measure complexity for each hotspot.
Use your tool of choice. For TypeScript, I like ts-complex. For general use, lizard supports 20+ languages.
npx lizard src/orders/OrderService.tsStep 3: Plot churn vs. complexity.
You don't need a fancy tool. A spreadsheet works. Put churn on the X axis and complexity on the Y axis. The files in the upper-right quadrant are your highest-priority targets for refactoring.
Step 4: Check the author count.
# How many different authors have touched this file?
git log --format="%aN" src/orders/OrderService.ts | sort -u | wc -lFiles with high churn, high complexity, and many authors are the most dangerous. They're where miscommunication and conflicting assumptions turn into bugs.
Step 5: Prioritize and act.
Don't try to fix everything. Pick the top 3 hotspots. For each one, decide: refactor, add tests, or split into smaller modules. Make it a sprint goal, not a side project.
Setting Useful Thresholds
If you want guidelines (not rules), here's what I've found reasonable:
| Metric | Green | Yellow | Red |
|---|---|---|---|
| Cognitive Complexity | < 15 | 15-25 | > 25 |
| Function LOC | < 50 | 50-100 | > 100 |
| File LOC | < 300 | 300-600 | > 600 |
| Monthly Churn (commits) | < 5 | 5-15 | > 15 |
| Number of Authors (6mo) | < 3 | 3-5 | > 5 |
| Efferent Coupling | < 5 | 5-10 | > 10 |
These aren't universal truths. A 200-line React component might be perfectly readable. A 50-line cryptography function might be incomprehensible. Context matters more than thresholds.
Tools Worth Using
- CodeClimate. Good SaaS option. Calculates cognitive complexity, duplication, and file length. Integrates with GitHub PRs.
- SonarQube/SonarCloud. The enterprise standard. Tracks 30+ metrics. Free for open source.
- lizard. Command-line tool that calculates cyclomatic complexity for 20+ languages. No configuration needed.
- git-of-theseus. Visualizes how code ages and who wrote what. Great for identifying knowledge silos.
- CodeScene. Adam Tornhill's tool for hotspot analysis. Best-in-class for churn-based analysis. Commercial but worth it for large codebases.
The Takeaway
Stop chasing a single complexity number. Start thinking about code complexity as a function of three variables: inherent complexity of the code, rate of change, and number of people changing it. The intersection of all three is where your bugs live, and that's where your attention should go.
Measure, diagnose, act. In that order.
$ ls ./related
Explore by topic