Toil in Software Engineering: Finding and Eliminating It
Toil in Software Engineering: Finding and Eliminating It
Google's SRE team popularized the concept of toil: work that's manual, repetitive, automatable, and scales linearly with system size. They set a target of keeping toil below 50% of an SRE's time. Most software engineering teams I've worked with don't even measure their toil, and when they do, the number is usually horrifying.
I ran a toil audit on my team last quarter. The result: engineers spent 38% of their time on work that was repetitive, manual, and could be automated. That's 38% of salaries, 38% of energy, 38% of your team's finite capacity spent on tasks that a script could do. And unlike Google's SRE-focused definition, engineering toil extends far beyond operations into the daily development workflow.
The Contrarian Take: Most Engineering "Best Practices" Create Toil
Here's what nobody talks about: many of the processes teams adopt in the name of quality and rigor are actually toil generators. Manual QA checklists. Required approval from 3 reviewers. Hand-written changelog entries. Manually updated dependency graphs. Each one sounds responsible. Together, they create a death spiral where the process of building software takes longer than actually building it.
The question isn't "is this process valuable?" The question is "is this process valuable enough to justify its cost in human time, given that it could be automated?"
Identifying Toil: The Four Properties
Not all manual work is toil. Designing a system architecture is manual but not toil. Writing a one-off migration script is manual but not toil. True toil has four properties:
- Repetitive: You've done this task before, in essentially the same way
- Manual: A human has to execute the steps
- Automatable: A computer could do this with the right tooling
- Scaling: The work grows as your system or team grows
// A toil classifier
interface EngineeringTask {
name: string;
isRepetitive: boolean; // Done more than once per month
isManual: boolean; // Requires human execution
isAutomatable: boolean; // Could be scripted/automated
scalesWithGrowth: boolean; // More instances as system grows
}
function isToil(task: EngineeringTask): boolean {
return task.isRepetitive && task.isManual && task.isAutomatable && task.scalesWithGrowth;
}
// Examples:
const tasks: EngineeringTask[] = [
{
name: "Update version numbers before release",
isRepetitive: true, // every release
isManual: true, // engineer edits files
isAutomatable: true, // semantic-release does this
scalesWithGrowth: true, // more packages = more version updates
}, // TOIL
{
name: "Design API for new feature",
isRepetitive: false, // each API is unique
isManual: true,
isAutomatable: false, // requires human judgment
scalesWithGrowth: false,
}, // NOT TOIL
];The Toil Catalog: Where Engineering Time Disappears
After auditing 4 teams, I've cataloged the most common sources of engineering toil. The percentages represent the average share of total toil each category represents.
Category 1: Release Toil (28% of total toil)
- Manually updating version numbers
- Writing changelog entries by reading git log
- Running manual smoke tests before deploy
- Manually tagging releases and creating GitHub releases
- Coordinating release timing across teams
# Toil example: manually creating a changelog
# Engineer reads through all PRs since last release,
# writes summaries, categorizes changes, formats output
# Time: 30-90 minutes per release
# Automated alternative: conventional commits + auto-changelog
# npx conventional-changelog -p angular -i CHANGELOG.md -s
# Time: 0 minutes (runs in CI)Category 2: Environment Toil (22% of total toil)
- Setting up local development environments
- Debugging "works on my machine" issues
- Manually seeding test databases
- Resetting environments after failed tests
- Managing local service dependencies
Category 3: Code Review Toil (19% of total toil)
- Manually checking for style violations (should be linters)
- Verifying test coverage meets thresholds (should be CI)
- Checking for missing documentation (should be CI)
- Reviewing auto-generated code (migration files, schemas)
Category 4: Testing Toil (17% of total toil)
- Manually running integration tests locally
- Updating test fixtures after data model changes
- Rerunning tests due to flaky failures
- Manually testing UI flows that could be automated
Category 5: Communication Toil (14% of total toil)
- Status update meetings that could be async
- Writing deployment notifications manually
- Repeating the same onboarding explanations
- Answering the same "how do I...?" questions
The Toil Elimination Process
Step 1: Toil Tracking (1 week)
Have each engineer log toil for one week using a simple format:
interface ToilEntry {
task: string;
category: 'release' | 'environment' | 'review' | 'testing' | 'communication';
timeMinutes: number;
frequency: 'daily' | 'weekly' | 'per-release' | 'per-feature';
automationDifficulty: 'easy' | 'medium' | 'hard';
}Don't overthink this. A shared spreadsheet works fine. The goal is to make invisible work visible.
Step 2: Calculate the Toil Budget (1 day)
Aggregate the data and compute your team's toil percentage:
Total Toil Hours/Week = sum of all toil entries
Total Available Hours/Week = team size x 40
Toil Percentage = Total Toil Hours / Total Available Hours
Set a target. Google targets below 50% for SRE. For product engineering teams, I target below 20%. If you're above 30%, toil elimination should be your top priority because it's consuming almost a third of your engineering investment.
Step 3: Rank by ROI (1 day)
For each toil item, calculate:
Annual Time Cost = frequency_per_year x time_per_occurrence x engineers_affected
Automation Cost = estimated engineering days to build automation
ROI = Annual Time Cost / Automation Cost
| Toil Item | Annual Hours | Automation Days | ROI |
|---|---|---|---|
| Manual changelog | 78 | 2 | 39x |
| Dev environment setup | 160 | 10 | 16x |
| Rerunning flaky tests | 312 | 5 | 62x |
| Manual style review | 208 | 3 | 69x |
| Status update meetings | 520 | 1 (async tool) | 520x |
Step 4: Automate Top 3 per Quarter
Take the top 3 items by ROI. Assign them to engineers as first-class project work, not side projects. Track completion and impact.
The most important cultural shift: toil elimination is not optional "when you have time" work. It's engineering work with measurable ROI that deserves sprint allocation.
The Stealable Framework: The TOIL Dashboard
Build a simple dashboard that tracks toil over time:
interface ToilDashboard {
currentToilPercentage: number; // Target: below 20%
toilTrend: 'increasing' | 'decreasing' | 'stable';
topToilItems: Array<{
name: string;
hoursPerWeek: number;
automationStatus: 'identified' | 'in-progress' | 'automated';
}>;
toilEliminatedThisQuarter: number; // Hours saved per week
cumulativeSavings: number; // Total hours saved since tracking began
}Review the dashboard monthly. Celebrate toil elimination the same way you celebrate feature launches. An automation that saves 5 hours per week is equivalent to hiring 12.5% of an engineer. That's worth celebrating.
The Compound Effect
Here's what makes toil elimination so powerful: the savings compound. When you automate a task that took 5 hours per week, you don't just save 5 hours this week. You save 5 hours every week, forever. Over a year, that's 260 hours. Over three years, 780 hours. And the engineer-hours freed up can be spent on more toil elimination, creating a virtuous cycle.
I tracked this on my team. In Q1, we automated 8 hours per week of toil. In Q2, using some of those freed hours, we automated another 12 hours per week. By Q4, we'd reduced total toil from 38% to 14%. The team shipped 40% more features that year with the same headcount.
That's not magic. It's arithmetic. But you have to actually do the work of identifying, measuring, and eliminating toil. Most teams never start because toil feels like "just part of the job." It isn't. It's waste, and your team deserves better.
$ ls ./related
Explore by topic