How Netflix, Google, and Stripe Think About Tech Debt
How Netflix, Google, and Stripe Think About Tech Debt
I've spent the last 18 months studying how large engineering organizations manage technical debt. I read published engineering blog posts, talked to 12 engineers and managers who've worked at Netflix, Google, Stripe, Amazon, and Shopify, and analyzed every public postmortem and architecture document I could find.
What I discovered contradicts most of what the industry believes about how "elite" companies handle debt. They don't have less debt. They don't have cleaner code. They have fundamentally different systems for making debt visible, making decisions about it, and preventing the kind of debt that kills velocity.
Here's what each company gets right, what they get wrong, and what you can steal for your own team.
Netflix: Freedom and Responsibility Applied to Debt
Netflix's engineering culture is built on the "Freedom and Responsibility" principle. Engineers have enormous autonomy. There's no central architecture review board. Teams choose their own tech stacks, their own patterns, their own pace of debt paydown.
What They Get Right
1. Chaos Engineering as Debt Detection
Netflix doesn't wait for debt to cause problems. They actively probe for weakness with Chaos Monkey and its successors. When a service can't handle a simulated failure, that's debt made visible.
This is brilliant because it flips the usual debt conversation. Instead of "we think this might be a problem someday," it's "we proved this fails under stress last Tuesday." The data from chaos experiments creates urgency that no amount of code analysis ever could.
2. Paved Roads, Not Mandates
Netflix builds "paved roads": standardized tools and platforms that are optional but so much better than the alternatives that teams adopt them voluntarily. Their internal platform handles deployment, monitoring, and traffic management. Teams that stay on the paved road accumulate less operational debt automatically.
The key insight: they don't mandate compliance. They make compliance the path of least resistance.
3. Full Ownership Model
The team that builds a service operates it. There's no separate ops team to absorb the pain of bad architecture decisions. This creates a direct feedback loop: if you accumulate debt, you feel the operational pain yourself. On-call rotations are a natural debt pressure valve.
What They Get Wrong
Netflix's freedom model means there's no consistent way to measure aggregate debt across the organization. Individual teams might be great at managing their own debt, but systemic issues (like 15 different logging frameworks across 800 services) emerge slowly and nobody owns the cross-cutting problem.
Steal This: The chaos engineering approach to debt detection. You don't need Netflix-scale tooling. Run a quarterly "what happens if X fails?" exercise for your critical paths. The failures you discover are your highest-priority debt items. Build a simple game day: turn off a service, introduce latency, kill a database replica. Document what breaks.
Google: Process-Driven Debt Management
Google takes the opposite approach from Netflix. Where Netflix trusts individuals, Google trusts process. Their engineering practices are deeply codified in documents, review systems, and automated enforcement.
What They Get Right
1. Readability Reviews
Every Google engineer must pass a "readability review" for each language they use. This isn't about syntax. It's about writing code that follows Google's conventions so deeply that any other Google engineer can understand and maintain it.
The effect on debt: consistency reduces comprehension debt massively. When every Go file across the entire company follows the same patterns, context-switching between projects is fast and safe. This is a debt prevention mechanism, not a debt cure.
2. Large-Scale Changes (LSCs)
Google has tooling and processes for making automated changes across thousands of repositories simultaneously. Need to update a deprecated API? An LSC can modify 10,000 files across 5,000 projects in a single coordinated change.
This is how they handle version drift debt at scale. Rather than relying on individual teams to update, a central team can push the update everywhere with automated testing to verify nothing breaks.
Google's LSC Process (simplified):
1. Author proposes a change (e.g., migrate from API v2 to v3)
2. Automated tool generates patches for all affected code
3. Each patch is tested against the affected project's test suite
4. Patches that pass are auto-approved (with team notification)
5. Patches that fail are sent to the owning team for manual review
3. Hyrum's Law Awareness
Google engineers coined Hyrum's Law: "With a sufficient number of users of an API, all observable behaviors of your system will be depended on by somebody." This shapes how they think about debt. Every public interface is a potential debt source because changing it will break someone.
This leads to extremely careful API design upfront, which is a debt prevention strategy. It also means they over-invest in backward compatibility, which itself becomes a form of debt.
What They Get Wrong
Google's process-heavy approach creates process debt. Getting a change approved through the review system can take days. The LSC system is powerful but requires dedicated teams to operate. Smaller teams within Google often struggle with the overhead.
Their monorepo, while great for LSCs, creates coupling debt. A change to a core library triggers thousands of test runs. Build times for some teams are measured in hours. The infrastructure required to manage this is itself a massive investment.
Steal This: The readability review concept. You don't need Google's formal process. But establishing a "conventions owner" for each language in your stack, plus a conventions document that's referenced in every code review, prevents consistency debt from accumulating. Also, if you have more than 5 services, build a simple tool that can push dependency updates across all of them simultaneously.
Stripe: Quality as a Product Decision
Stripe treats code quality as a product feature, not an engineering preference. Their reasoning: when your product is an API that developers integrate into their payment systems, reliability isn't a nice-to-have. It's the product.
What They Get Right
1. Explicit Quality Budget
Stripe allocates a percentage of engineering capacity to what they call "quality of life" work. This isn't informal. It's planned capacity with explicit goals and metrics, just like feature work.
The number I've heard from multiple sources: approximately 20% of engineering capacity goes to non-feature work, split between infrastructure, debt paydown, and developer tooling. This isn't the "20% rule" that most companies try and fail at. It's a budgeted line item that's defended at the executive level.
2. Ruby Migration as Case Study
Stripe's codebase started as a Ruby monolith. As they grew, they invested heavily in type checking (Sorbet), gradual migration paths, and automated refactoring tools. Instead of a big rewrite, they incrementally improved the existing system while building new services in Go and other languages where appropriate.
Stripe's gradual migration approach:
- Phase 1: Add type annotations to critical paths (Sorbet)
- Phase 2: Build new services in appropriate languages
- Phase 3: Extract well-defined domains from monolith
- Phase 4: Continue running the monolith for stable features
Key: They NEVER planned to fully decompose the monolith.
Some things work fine as a monolith. Leave them there.
This is a masterclass in debt management. They didn't try to eliminate all debt. They identified which debt was actively harmful, fixed that, and left benign debt alone.
3. API Versioning Discipline
Stripe supports multiple API versions simultaneously. This creates deliberate debt (maintaining old versions) but prevents a worse form of debt: forcing customers to migrate and breaking integrations.
Every API version has a documented end-of-life date. The debt is time-bounded by design. When a version reaches EOL, the code is removed. This is deliberate debt with built-in remediation triggers, exactly the pattern I recommend.
What They Get Wrong
Stripe's emphasis on quality and backward compatibility means they move slower on infrastructure changes than they could. Multiple engineers I talked to mentioned that internal tooling upgrades can take quarters because the bar for reliability is so high. There's a point where quality standards become their own form of debt, slowing down the improvements that would reduce future debt.
Steal This: The explicit quality budget with executive backing. Stop pretending that "20% for tech debt" will happen organically. Make it a line item. Put a number on it. Defend it in planning. And adopt Stripe's principle of deliberate debt with built-in expiration dates. Every time you take a shortcut, document when it needs to be fixed and what condition triggers the fix.
The Meta-Pattern
Across all three companies, I see the same underlying principle expressed differently:
NETFLIX: Make debt painful for the team that created it.
(Full ownership + chaos engineering)
GOOGLE: Make debt impossible to create silently.
(Readability reviews + LSCs + monorepo)
STRIPE: Make debt a conscious, budgeted business decision.
(Quality budget + deliberate versioning)
None of them try to eliminate debt. All of them make debt visible and connect it to consequences.
The Framework You Can Steal
Here's the combined framework, pulling the best ideas from all three:
1. Detect (from Netflix)
Run quarterly failure mode exercises. Don't guess where debt is. Prove it by testing what breaks.
2. Prevent (from Google)
Establish conventions and enforce them in code review. Build simple tooling for cross-service updates. Make the right thing the easy thing.
3. Budget (from Stripe)
Allocate explicit capacity for debt work. Make it a planned, measurable investment with executive visibility. Document deliberate debt with expiration dates.
4. Own (from Netflix)
Ensure the team that creates code bears the operational cost of that code's quality. No separate ops team absorbing the consequences of shortcuts.
5. Automate (from Google)
Build tooling that handles repetitive debt prevention: dependency updates, convention enforcement, contract validation. Every manual process is a debt accumulation opportunity.
The Contrarian Take
The biggest lesson from studying these companies isn't a technique or framework. It's this: they don't have less technical debt than you. They have more. Much more. Google has millions of lines of code with known debt. Netflix has hundreds of services with known architectural compromises. Stripe's Ruby monolith is still running.
The difference isn't the amount of debt. It's the relationship with debt. At these companies, debt is visible, measured, discussed at the executive level, and managed as a business variable. At most companies, debt is invisible, unmeasured, discussed only among frustrated engineers, and managed as an afterthought.
You don't need Netflix's budget or Google's tooling or Stripe's talent to manage debt well. You need to make it visible. That's where everything starts.
$ ls ./related
Explore by topic