codeintelligently
Back to posts
Developer Productivity

Engineering Metrics That Drive Behavior (For Better or Worse)

Vaibhav Verma
8 min read
engineering metricsdeveloper productivityGoodhart's Lawengineering managementcode coveragestory points

Engineering Metrics That Drive Behavior (For Better or Worse)

Every metric you put on a dashboard changes behavior. This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. And in engineering organizations, the behavioral consequences of bad metrics are devastating.

I learned this the hard way. In 2021, I introduced "PR cycle time" as a team metric. Within two months, cycle time dropped 40%. I was thrilled. Then I noticed something: average PR size had tripled. Developers were merging larger PRs less frequently instead of shipping smaller ones more often. They'd optimized for fewer trips through the pipeline, not faster delivery.

The metric went down. The actual outcome I cared about got worse.

That experience turned me into a student of metric-driven behavior. Here's what I've learned about which metrics help, which hurt, and why the difference matters.

Metrics That Poison Teams

Story Points Velocity

Velocity was designed as a planning tool. It answers: "How much work can this team typically finish in a sprint?" It was never meant to be a performance metric. But the moment management starts tracking velocity as a KPI, the gaming begins.

Teams inflate point estimates. A task that was a 3 becomes a 5 becomes an 8. Velocity goes up. Actual output doesn't change. Worse, teams avoid taking on uncertain or hard work because it threatens their velocity number. Innovation dies. Only predictable, well-understood work gets prioritized.

I've seen a team with a "velocity" of 80 points per sprint that shipped less customer value than a team with a "velocity" of 30. The numbers were meaningless because the point scales were completely different.

The fix: Use velocity only for sprint planning within a team. Never compare velocity between teams. Never report it upward. If leadership asks for velocity numbers, that's a red flag.

Individual Commit or PR Counts

Counting commits or PRs per developer seems objective. It's not. It incentivizes:

  • Splitting logical work into artificially small commits
  • Avoiding large, valuable refactors that produce "only one PR"
  • Skipping code review (reviewing doesn't produce commits)
  • Creating trivial PRs (README typo fixes, config formatting) to pad the count

I once worked with a developer who averaged 4 PRs per day. Impressive on paper. Every PR was 5-20 lines. Meaningful PRs that required actual thought and testing? Maybe one per week. But his metrics looked fantastic.

The fix: Track PR throughput at the team level only. Focus on PR size distribution, not count. The goal is a steady flow of small-to-medium PRs, not a high count of trivial ones.

Code Coverage Percentage

"We need to hit 80% code coverage" is a statement I've heard at nearly every company I've worked with. And nearly every time, it produces tests that are worse than no tests.

When coverage is a target, developers write tests that execute code without verifying behavior. They test getters and setters. They mock everything, achieving "coverage" without any actual testing. The suite passes, the coverage number is green, and bugs ship anyway because nobody tested the important paths.

The fix: Track coverage trends, not absolute numbers. A team whose coverage is going from 45% to 50% is improving. A team at 90% with a suite full of meaningless tests is wasting CI time. Replace coverage targets with mutation testing scores if you want to measure test quality.

Lines of Code

I covered this in depth in a separate post, but the short version: LoC incentivizes verbosity, discourages refactoring, and punishes the most valuable engineering work (simplification, deletion, abstraction). Never use it.

Metrics That Actually Help

Cycle Time (With the Right Definition)

Cycle time is the duration from first commit to deployment in production. It's useful because it captures the entire delivery pipeline: coding, review, CI, merge, deployment.

The key is measuring it at the team level and looking at the distribution, not just the average. A team with a median cycle time of 4 hours and a p95 of 48 hours has a very different problem than a team with a median of 24 hours and a p95 of 72 hours. The first team has outlier PRs getting stuck. The second team has a systemically slow pipeline.

Behavior it drives: Small PRs (they cycle faster), prompt reviews (they unblock others), pipeline investment (slow CI hurts the number).

Review Turnaround Time

How long from "PR opened" to "first substantive review comment"? This metric drives the single highest-impact behavioral change I've seen in engineering teams: developers start reviewing each other's code promptly.

Fast reviews unblock others. They keep PRs small (because there's no penalty for shipping frequently). They improve code quality because reviewers engage while the context is fresh.

Target: first review within 4 hours during business hours.

Behavior it drives: Review prioritization, smaller PRs, better team collaboration.

Deployment Frequency

How often the team deploys to production. This metric is hard to game because the only way to increase it is to actually improve your delivery pipeline: faster CI, better tests, smaller changes, automated deployments.

A team deploying multiple times per day has fundamentally different engineering practices than a team deploying weekly. The daily deployers have invested in the tooling and processes that make small, safe changes easy.

Behavior it drives: Pipeline investment, small changes, automated testing, deployment confidence.

Developer Experience Survey Score

A quarterly survey asking developers to rate their satisfaction with tools, processes, and ability to do their best work. This metric drives investment in developer experience because it makes DX visible to leadership.

The key is keeping the survey short (10 questions max), consistent (same questions each quarter), and anonymous. Track trends, not absolute scores. A declining trend is always worth investigating, even if the absolute number looks OK.

Behavior it drives: Leadership attention to DX, tool investment, process improvement.

The Contrarian Take: Pair Metrics or Don't Bother

Single metrics always get gamed. Always. The solution is metric pairing: for every speed metric, pair a quality metric. For every output metric, pair an outcome metric.

Examples:

Speed Metric Quality Pair
Deployment frequency Change failure rate
PR cycle time Defect escape rate
Sprint velocity Customer satisfaction score
Build time Test suite reliability

When metrics are paired, gaming one at the expense of the other becomes immediately visible. A team that doubles deployment frequency while their change failure rate triples is clearly cutting corners. A team that doubles velocity while customer satisfaction drops is building the wrong things.

Paired metrics create tension. That tension is healthy. It prevents the single-minded optimization that makes individual metrics dangerous.

The Stealable Framework: The Metric Health Check

Run this exercise before introducing any new engineering metric. It takes 15 minutes and prevents months of unintended consequences.

Ask five questions about the proposed metric:

1. What behavior does this metric incentivize? Not "what behavior do we want." What behavior does the metric actually incentivize? Be cynical. Assume people will optimize for the number.

2. How can this metric be gamed? If you can think of three ways to improve the metric without improving actual outcomes, the metric is dangerous.

3. What does this metric ignore? Every metric has blind spots. Identify them explicitly. What valuable work will this metric fail to capture?

4. What's the pairing metric? Every speed metric needs a quality pair. Every output metric needs an outcome pair. If you can't identify a natural pair, reconsider the metric.

5. Will we act on this metric? If the number drops 30%, what would you do differently? If the answer is "nothing," you don't need the metric. If the answer is "investigate and fix the root cause," the metric is useful.

I've used this health check to kill more metrics than I've introduced. That's a good thing. Every metric you don't track is cognitive load you don't impose on your team.

The Bottom Line

Metrics are tools. They're useful when they make invisible problems visible and dangerous when they become targets that distort behavior.

The best engineering organizations I've worked with measure few things, measure them carefully, and act on what they learn. The worst ones have dashboards full of metrics that nobody trusts and everyone games.

Choose your metrics wisely. Pair them for balance. Review them for unintended consequences. And have the courage to remove metrics that aren't driving the behavior you want, even if the dashboard looks impressive.

$ ls ./related

Explore by topic