How to Evaluate Engineering Team Performance Without Gaming
How to Evaluate Engineering Team Performance Without Gaming
Every metric you choose will be gamed. That's not cynicism. It's Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
I've watched teams game story points (just inflate estimates), lines of code (write verbose code), deployment frequency (deploy empty commits), PR count (split one change into ten PRs), and test coverage (write meaningless assertions).
The solution isn't to stop measuring. It's to measure things that are hard to game and that actually correlate with what you care about.
The Contrarian Take: Individual Performance Metrics Destroy Teams
Most engineering orgs try to measure individual engineer performance. Stack ranking. Individual velocity. Personal PR stats. This approach reliably destroys collaboration.
When you measure individuals, engineers stop helping each other. Pair programming drops because it "halves your output." Code reviews become cursory because reviewing someone else's code doesn't count toward your metrics. Knowledge sharing stops because it makes other people more productive at the expense of your own numbers.
I stopped measuring individual performance three years ago. Team-level metrics only. Our output went up, not down, because engineers started optimizing for the team's success instead of their personal dashboard.
Individual evaluation still happens. It just happens through qualitative assessment, peer feedback, and observed behavior, not through metrics dashboards.
The Three Layers of Team Performance
I evaluate teams across three layers. Each captures something the others miss.
Layer 1: Delivery Health
This is the closest to traditional metrics. It answers: "Is this team shipping reliably?"
Cycle time (not velocity). How long does it take from first commit to production deployment? I track the median, not the mean, because outliers skew the mean heavily. A healthy team has a median cycle time under 48 hours. If it's creeping up, something is bottlenecking: review, CI, deployment, or decision-making.
Change failure rate. What percentage of deployments cause an incident or require a rollback? The DORA research says elite teams are under 5%. Most teams I've worked with are between 8-15%. Track the trend, not the absolute number.
Recovery time. When something breaks, how fast does the team fix it? This measures operational maturity. A team that ships fast but recovers slowly is a liability.
Why these are hard to game: cycle time rewards simplicity and good process, not volume. Change failure rate penalizes recklessness. Recovery time rewards operational discipline. None of them can be improved by inflating numbers.
Layer 2: Decision Quality
This is the layer most orgs skip. It answers: "Is this team making good technical decisions?"
ADR follow-through rate. What percentage of architecture decision records (ADRs) are still being followed 6 months after they were written? Low follow-through means either the decisions were wrong or the team isn't disciplined enough to maintain them. Both are problems.
Rework rate. What percentage of code shipped in the last 30 days gets changed within the next 30 days? High rework means the team is building things wrong the first time. Some rework is natural (requirements change, learnings emerge), but rates above 25% indicate a planning or quality problem.
Prediction accuracy. When the team estimates that something will take two weeks, how often does it actually take two weeks? I use a simple ratio: actual time / estimated time. A healthy team is between 0.8 and 1.3. Consistently above 1.5 means the team is systematically underestimating, which usually means they don't understand the codebase well enough or scope isn't being managed.
Layer 3: Team Health
This is the qualitative layer. It answers: "Is this team sustainable?"
Retention rate. Are people staying? If you're losing more than 15% of your engineers per year (excluding company-wide issues), something is wrong with the team environment.
Engagement signals. Do engineers participate in design discussions? Do they suggest improvements unprompted? Do they help each other? Do they push back on bad requirements? Engaged engineers do all of these. Disengaged engineers complete their tickets and go home.
Growth velocity. Are engineers on the team getting better? Track promotions, scope increases, and skill development over 12-month periods. A team where nobody grows is a team where nobody stays.
I gather this data through quarterly anonymous surveys (5 questions, takes 3 minutes), skip-level 1:1s, and my own observations.
The Evaluation Cadence
I run this evaluation quarterly. Monthly is too frequent because the data is noisy. Annually is too rare because problems compound.
Week 1 of the quarter: Collect data. Pull the delivery metrics automatically. Send the team health survey. Review ADRs.
Week 2: Synthesize. Write a one-page team health report. Traffic-light format: green (healthy), yellow (watch), red (act now).
Week 3: Share with the team. No surprises. Walk through the report together. Ask what they think. Their interpretation of the data is often more valuable than the data itself.
Week 4: Action plan. Pick one yellow or red item. Define what "green" looks like. Set specific actions. Track to completion.
The Stealable Framework: The 3x3 Performance Grid
Draw a 3x3 grid. Rows are the three layers (Delivery, Decisions, Health). Columns are the three states (Green, Yellow, Red).
For each layer, place the team:
| Green | Yellow | Red | |
|---|---|---|---|
| Delivery | Cycle time <48h, CFR <10%, recovery <4h | Any metric trending wrong for 2+ sprints | Any metric 2x worse than target |
| Decisions | Rework <15%, predictions within 0.8-1.3x | Rework 15-25% or predictions off | Rework >25%, repeated missed predictions |
| Health | Retention >90%, high engagement, growth visible | Any signal declining | Attrition spike, disengagement, no growth |
A team that's green across all three is high-performing. Keep them happy and stay out of their way.
A team with one yellow needs monitoring and a conversation. Something is slipping.
A team with any red needs immediate intervention. Don't wait for the next quarter.
The beauty of this grid is that it prevents you from overlooking one dimension. A team that ships fast (green delivery) but has high turnover (red health) is not a high-performing team. It's a team that's burning out.
Common Pitfalls
Pitfall: Comparing teams to each other. Every team has different constraints, different codebases, and different complexity. Comparing Team A's cycle time to Team B's cycle time is meaningless unless they're working on comparable problems. Compare teams to their own trajectory.
Pitfall: Over-indexing on delivery metrics. Delivery is the easiest to measure and the most tempting to focus on. But a team that ships fast while accumulating decision debt and burning out engineers is heading for a cliff.
Pitfall: Ignoring the qualitative data. Numbers are comfortable. Conversations are messy. But the skip-level 1:1 where an engineer tells you "I don't feel challenged anymore" is worth more than any dashboard. Don't skip the human part.
The goal of performance evaluation isn't to judge teams. It's to help them get better. When you measure the right things at the right cadence, performance evaluation becomes a coaching tool instead of a surveillance system.
$ ls ./related
Explore by topic