codeintelligently
Back to posts
Technical Debt Intelligence

Technical Debt in Microservices: Why It's Different

Vaibhav Verma
9 min read
technical-debtmicroservicesdistributed-systemsarchitectureplatform-engineering

Technical Debt in Microservices: Why It's Different

I spent 4 years building and operating microservices at two different companies. The first had 38 services, the second had 127. Both had massive technical debt problems. And in both cases, the teams tried to manage that debt using the same playbook they'd used in monoliths.

It didn't work. Microservices debt is fundamentally different from monolith debt, and treating it the same way is why most distributed systems slowly rot from the inside out.

The Monolith Debt Mental Model Is Wrong for Microservices

In a monolith, technical debt is usually concentrated. You have a few bad modules, some tangled dependencies, maybe a legacy ORM holding you back. The debt lives in code you can see, search, and refactor in a single repository.

Microservices debt is distributed. It lives in the spaces between services: the contracts, the communication patterns, the shared assumptions, the operational complexity that nobody owns. You can have 127 individually clean services and still drown in debt because the system-level architecture is a mess.

Here's the contrarian take most architects won't say out loud: microservices don't reduce technical debt. They redistribute it from code complexity to operational complexity. And operational complexity is harder to see, harder to measure, and harder to fix.

The 5 Categories of Microservices-Specific Debt

1. Contract Debt

Every service-to-service communication relies on a contract, whether that's a REST API schema, a gRPC proto, a message format, or an event payload. Contract debt accumulates when these contracts drift.

typescript
// Service A sends this (added "priority" field 3 months ago)
interface OrderEvent {
  orderId: string;
  customerId: string;
  amount: number;
  priority: "standard" | "rush" | "overnight";
  createdAt: string;
}

// Service B still expects this (hasn't been updated)
interface OrderEvent {
  orderId: string;
  customerId: string;
  amount: number;
  createdAt: string;
}
// Service B silently ignores "priority" and defaults all orders to standard processing

At one company, we had 14 services consuming order events. After an audit, we found that 6 of them were using outdated schemas. Three had bugs caused directly by schema drift. One had been silently dropping a field for 5 months.

How to measure it: Run a quarterly contract audit. For every inter-service communication, verify that producer and consumer schemas match. Track the number of mismatches over time.

2. Dependency Graph Debt

In a monolith, your dependency graph is visible in your import statements. In microservices, it's hidden in network calls, message queues, and shared databases.

I once spent 2 days debugging a latency spike in our checkout service. The root cause? A "minor" change to a recommendation service that was 4 hops away in the call chain. Checkout called pricing, pricing called inventory, inventory called recommendations. Nobody on the checkout team even knew that dependency existed.

The dependency chain nobody documented:

checkout-service
  -> pricing-service
    -> inventory-service
      -> recommendations-service  <-- latency spike here
        -> product-catalog-service
          -> external-supplier-api  <-- actual root cause (timeout)

The Dependency Graph Audit Framework:

  1. Generate a runtime dependency map (not just what's documented, but what actually calls what in production)
  2. Identify chains longer than 3 hops
  3. For each long chain, ask: does the originating team know this chain exists?
  4. Classify each dependency as critical (sync, blocking) or non-critical (async, optional)
  5. Score each service by its "blast radius" (how many other services fail if it fails)

Services with a blast radius above 5 need circuit breakers, fallbacks, or architectural changes. No exceptions.

3. Data Ownership Debt

This is the debt that kills microservices architectures. It happens when the theoretical "each service owns its data" principle meets reality.

The theory:
  [Order Service] -> [Order DB]
  [Customer Service] -> [Customer DB]
  [Payment Service] -> [Payment DB]

The reality at most companies:
  [Order Service] -> [Order DB] <-- also read by Payment Service directly
  [Customer Service] -> [Customer DB] <-- also read by Order Service, Shipping Service
  [Payment Service] -> [Payment DB] <-- also read by Reporting Service
  [Shared "Reference" DB] <-- read by everyone, owned by nobody

The shared reference database is the microservices equivalent of a god object. I've seen it at every company that's been running microservices for more than 2 years. It starts as a "temporary" solution for sharing lookup data and becomes a coupling point that prevents any service from evolving independently.

How to measure it: Count the number of cross-service database reads. If any service reads directly from another service's database, that's data ownership debt. Track the count quarterly.

4. Operational Toil Debt

Monoliths have one deploy pipeline, one logging system, one monitoring configuration. Microservices multiply all of that. Operational toil debt is the accumulated cost of running infrastructure that should be automated but isn't.

At the 127-service company, we had:

  • 41 different logging formats across services
  • 23 services with no health checks
  • 67 services with custom (non-standard) deployment scripts
  • 0 services with end-to-end distributed tracing when I started

Each of these individually was "not that bad." Together, they meant our on-call engineers spent 60% of incident response time just figuring out what was happening before they could start fixing anything.

How to measure it: Track Mean Time to Detect (MTTD) and Mean Time to Diagnose (MTTD-iag) separately from Mean Time to Resolve (MTTR). If diagnosis takes longer than resolution, you have operational toil debt.

5. Version Drift Debt

In a monolith, you upgrade a dependency once. In microservices, you upgrade it in every service. Or more commonly, you upgrade it in 3 services and forget about the other 35.

bash
# Audit: which version of Express is each service running?
# Real output from a company I consulted for:
order-service:     express@4.18.2
payment-service:   express@4.17.1
user-service:      express@4.18.2
shipping-service:  express@4.16.4  # 2 major versions behind
notification-svc:  express@4.17.3
analytics-service: express@4.15.5  # security vulnerability!

At scale, version drift creates security vulnerabilities, inconsistent behavior, and makes it impossible to share code or libraries between services.

How to measure it: Build a dependency matrix. For each shared dependency, list the version in every service. Flag anything more than 1 major version behind. Score by: (number of outdated instances) x (severity of the gap).

The Microservices Debt Scoring Framework

Here's the framework I use to assess microservices debt. Score each category 1-5 quarterly.

MICROSERVICES DEBT SCORECARD
=============================

Category                  | Score (1-5) | Trend    | Owner
--------------------------|-------------|----------|--------
Contract Debt             |     ___     |  up/down | ______
Dependency Graph Debt     |     ___     |  up/down | ______
Data Ownership Debt       |     ___     |  up/down | ______
Operational Toil Debt     |     ___     |  up/down | ______
Version Drift Debt        |     ___     |  up/down | ______
                          |             |          |
TOTAL                     |    ___/25   |          |

Scoring guide:
  1 = Minimal debt, well-managed
  2 = Some debt, trending stable
  3 = Moderate debt, needs attention this quarter
  4 = Significant debt, actively causing problems
  5 = Critical debt, blocking team productivity

Action thresholds:
  Total 5-10:  Healthy. Continue monitoring.
  Total 11-15: Caution. Allocate 15% capacity to remediation.
  Total 16-20: Warning. Dedicated remediation sprint needed.
  Total 21-25: Critical. Stop feature work until stabilized.

The Fix: Platform Thinking

The single biggest lever for microservices debt is platform investment. Not a "platform team" that becomes a bottleneck, but shared tooling that makes the right thing easy.

Standardized service template: Every new service starts from a template with logging, health checks, tracing, deployment config, and contract validation built in. This prevents 80% of operational toil debt.

Contract registry: A central schema registry (like Confluent Schema Registry for events or a shared OpenAPI spec repo) that validates contracts at build time. Breaking changes fail CI. This prevents contract debt.

Dependency dashboard: Automated scanning that flags version drift weekly. Make updating easy by automating the PR creation. This prevents version drift debt.

Service mesh or API gateway: Centralized traffic management that gives you visibility into the actual dependency graph. This makes dependency graph debt visible.

What I Got Wrong

For years, I thought the solution to microservices debt was better engineering discipline. "If teams just followed the standards, we wouldn't have these problems." I was wrong.

Discipline doesn't scale. If doing the right thing requires extra effort, some percentage of teams will skip it some percentage of the time. That's not a character flaw. It's probability.

The real solution is making the right thing the default. When the service template includes tracing, teams don't skip tracing. When the CI pipeline validates contracts, teams don't break contracts. When the dependency scanner auto-creates PRs, teams update dependencies.

Design your platform so that accumulating debt requires more effort than avoiding it. That's the only approach I've seen work at scale.

$ ls ./related

Explore by topic