codeintelligently
Back to posts
Developer Productivity

Toil in Software Engineering: Finding and Eliminating It

Vaibhav Verma
9 min read
toildeveloper-productivityautomationengineering-efficiencysreengineering-leadership

Toil in Software Engineering: Finding and Eliminating It

Google's SRE team popularized the concept of toil: work that's manual, repetitive, automatable, and scales linearly with system size. They set a target of keeping toil below 50% of an SRE's time. Most software engineering teams I've worked with don't even measure their toil, and when they do, the number is usually horrifying.

I ran a toil audit on my team last quarter. The result: engineers spent 38% of their time on work that was repetitive, manual, and could be automated. That's 38% of salaries, 38% of energy, 38% of your team's finite capacity spent on tasks that a script could do. And unlike Google's SRE-focused definition, engineering toil extends far beyond operations into the daily development workflow.

The Contrarian Take: Most Engineering "Best Practices" Create Toil

Here's what nobody talks about: many of the processes teams adopt in the name of quality and rigor are actually toil generators. Manual QA checklists. Required approval from 3 reviewers. Hand-written changelog entries. Manually updated dependency graphs. Each one sounds responsible. Together, they create a death spiral where the process of building software takes longer than actually building it.

The question isn't "is this process valuable?" The question is "is this process valuable enough to justify its cost in human time, given that it could be automated?"

Identifying Toil: The Four Properties

Not all manual work is toil. Designing a system architecture is manual but not toil. Writing a one-off migration script is manual but not toil. True toil has four properties:

  1. Repetitive: You've done this task before, in essentially the same way
  2. Manual: A human has to execute the steps
  3. Automatable: A computer could do this with the right tooling
  4. Scaling: The work grows as your system or team grows
typescript
// A toil classifier
interface EngineeringTask {
  name: string;
  isRepetitive: boolean;    // Done more than once per month
  isManual: boolean;        // Requires human execution
  isAutomatable: boolean;   // Could be scripted/automated
  scalesWithGrowth: boolean; // More instances as system grows
}

function isToil(task: EngineeringTask): boolean {
  return task.isRepetitive && task.isManual && task.isAutomatable && task.scalesWithGrowth;
}

// Examples:
const tasks: EngineeringTask[] = [
  {
    name: "Update version numbers before release",
    isRepetitive: true,    // every release
    isManual: true,         // engineer edits files
    isAutomatable: true,    // semantic-release does this
    scalesWithGrowth: true, // more packages = more version updates
  }, // TOIL

  {
    name: "Design API for new feature",
    isRepetitive: false,   // each API is unique
    isManual: true,
    isAutomatable: false,   // requires human judgment
    scalesWithGrowth: false,
  }, // NOT TOIL
];

The Toil Catalog: Where Engineering Time Disappears

After auditing 4 teams, I've cataloged the most common sources of engineering toil. The percentages represent the average share of total toil each category represents.

Category 1: Release Toil (28% of total toil)

  • Manually updating version numbers
  • Writing changelog entries by reading git log
  • Running manual smoke tests before deploy
  • Manually tagging releases and creating GitHub releases
  • Coordinating release timing across teams
bash
# Toil example: manually creating a changelog
# Engineer reads through all PRs since last release,
# writes summaries, categorizes changes, formats output
# Time: 30-90 minutes per release

# Automated alternative: conventional commits + auto-changelog
# npx conventional-changelog -p angular -i CHANGELOG.md -s
# Time: 0 minutes (runs in CI)

Category 2: Environment Toil (22% of total toil)

  • Setting up local development environments
  • Debugging "works on my machine" issues
  • Manually seeding test databases
  • Resetting environments after failed tests
  • Managing local service dependencies

Category 3: Code Review Toil (19% of total toil)

  • Manually checking for style violations (should be linters)
  • Verifying test coverage meets thresholds (should be CI)
  • Checking for missing documentation (should be CI)
  • Reviewing auto-generated code (migration files, schemas)

Category 4: Testing Toil (17% of total toil)

  • Manually running integration tests locally
  • Updating test fixtures after data model changes
  • Rerunning tests due to flaky failures
  • Manually testing UI flows that could be automated

Category 5: Communication Toil (14% of total toil)

  • Status update meetings that could be async
  • Writing deployment notifications manually
  • Repeating the same onboarding explanations
  • Answering the same "how do I...?" questions

The Toil Elimination Process

Step 1: Toil Tracking (1 week)

Have each engineer log toil for one week using a simple format:

typescript
interface ToilEntry {
  task: string;
  category: 'release' | 'environment' | 'review' | 'testing' | 'communication';
  timeMinutes: number;
  frequency: 'daily' | 'weekly' | 'per-release' | 'per-feature';
  automationDifficulty: 'easy' | 'medium' | 'hard';
}

Don't overthink this. A shared spreadsheet works fine. The goal is to make invisible work visible.

Step 2: Calculate the Toil Budget (1 day)

Aggregate the data and compute your team's toil percentage:

Total Toil Hours/Week = sum of all toil entries
Total Available Hours/Week = team size x 40
Toil Percentage = Total Toil Hours / Total Available Hours

Set a target. Google targets below 50% for SRE. For product engineering teams, I target below 20%. If you're above 30%, toil elimination should be your top priority because it's consuming almost a third of your engineering investment.

Step 3: Rank by ROI (1 day)

For each toil item, calculate:

Annual Time Cost = frequency_per_year x time_per_occurrence x engineers_affected
Automation Cost = estimated engineering days to build automation
ROI = Annual Time Cost / Automation Cost
Toil Item Annual Hours Automation Days ROI
Manual changelog 78 2 39x
Dev environment setup 160 10 16x
Rerunning flaky tests 312 5 62x
Manual style review 208 3 69x
Status update meetings 520 1 (async tool) 520x

Step 4: Automate Top 3 per Quarter

Take the top 3 items by ROI. Assign them to engineers as first-class project work, not side projects. Track completion and impact.

The most important cultural shift: toil elimination is not optional "when you have time" work. It's engineering work with measurable ROI that deserves sprint allocation.

The Stealable Framework: The TOIL Dashboard

Build a simple dashboard that tracks toil over time:

typescript
interface ToilDashboard {
  currentToilPercentage: number;    // Target: below 20%
  toilTrend: 'increasing' | 'decreasing' | 'stable';
  topToilItems: Array<{
    name: string;
    hoursPerWeek: number;
    automationStatus: 'identified' | 'in-progress' | 'automated';
  }>;
  toilEliminatedThisQuarter: number;  // Hours saved per week
  cumulativeSavings: number;          // Total hours saved since tracking began
}

Review the dashboard monthly. Celebrate toil elimination the same way you celebrate feature launches. An automation that saves 5 hours per week is equivalent to hiring 12.5% of an engineer. That's worth celebrating.

The Compound Effect

Here's what makes toil elimination so powerful: the savings compound. When you automate a task that took 5 hours per week, you don't just save 5 hours this week. You save 5 hours every week, forever. Over a year, that's 260 hours. Over three years, 780 hours. And the engineer-hours freed up can be spent on more toil elimination, creating a virtuous cycle.

I tracked this on my team. In Q1, we automated 8 hours per week of toil. In Q2, using some of those freed hours, we automated another 12 hours per week. By Q4, we'd reduced total toil from 38% to 14%. The team shipped 40% more features that year with the same headcount.

That's not magic. It's arithmetic. But you have to actually do the work of identifying, measuring, and eliminating toil. Most teams never start because toil feels like "just part of the job." It isn't. It's waste, and your team deserves better.

$ ls ./related

Explore by topic