The Tribal Knowledge Problem: When Your Senior Dev Leaves
The Tribal Knowledge Problem: When Your Senior Dev Leaves
Marcus knew everything about the payment system. Where the edge cases lived, why the retry logic had that weird 7-second delay, which third-party API would return HTML error pages instead of JSON on Tuesdays. When Marcus gave his two weeks' notice, we realized something terrifying: nobody else understood how the most critical system in our infrastructure actually worked.
Marcus was generous with his remaining time. We scheduled knowledge transfer sessions, recorded Loom videos, and wrote documentation. We captured maybe 30% of what he knew. The other 70% walked out the door with him.
I've since seen this play out at four different companies. The details change. The pattern doesn't.
What Tribal Knowledge Actually Is
Tribal knowledge isn't "undocumented knowledge." It's the stuff that's hard to document because the people who hold it don't know they hold it. It's unconscious expertise. When Marcus debugged a payment failure, he didn't think "I should check the retry queue timestamp offset." He just checked it. It was muscle memory built over three years of incidents.
There are three types of tribal knowledge, each with different extraction challenges:
Type 1: Factual Knowledge. Specific facts about the system. "The batch job runs at 3am UTC." "The API rate limit is 100 requests per minute per tenant." This is the easiest to extract and document.
Type 2: Procedural Knowledge. How to do things. "When the payment gateway times out, check the webhook queue first, then check the dead letter queue, then check the gateway's status page." This is harder to extract because the expert often skips steps they consider obvious.
Type 3: Intuitive Knowledge. Pattern recognition and gut feelings. "This error message usually means the database connection pool is exhausted." "When this metric spikes, it's usually a bot, not real traffic." This is the hardest to extract because the expert can't always articulate why they know what they know.
Measuring Your Exposure
Before you can fix the problem, you need to quantify it. Here's the Knowledge Risk Matrix I use:
| System/Module | Primary Expert | Backup Expert | Documented? | Risk Level |
|---|---|---|---|---|
| Payment processing | Marcus | Nobody | Partially | Critical |
| User authentication | Sarah | James | Yes | Low |
| Order fulfillment | Sarah | Nobody | No | High |
| Search indexing | Marcus | Sarah | No | High |
| Reporting pipeline | James | Nobody | No | High |
Fill this out for your top 10 systems. Any row where the "Backup Expert" column says "Nobody" is a single point of failure. Any row where both "Backup Expert" is "Nobody" and "Documented" is "No" is a ticking time bomb.
I was wrong about bus factor for years. I thought of it as a morbid hypothetical. "What if someone gets hit by a bus?" The reality is more mundane. People take vacations. They go on parental leave. They get sick. They get promoted to a different team. They quit. The bus factor isn't about catastrophe. It's about everyday personnel changes that happen constantly.
The Knowledge Extraction Framework
When you identify a knowledge silo, here's the systematic process for extracting it. This works during normal operations, not just during someone's notice period.
Step 1: Shadow Sessions (Week 1)
The backup person shadows the expert for one week. Not pair programming. Shadowing. The backup watches while the expert works. The backup takes notes. Specifically, they write down:
- Every tool the expert opens
- Every command the expert runs
- Every file the expert checks
- Every question the expert asks themselves out loud (or doesn't, which the backup should prompt)
The output is a raw, messy document of how the expert actually works. Not how they think they work. How they actually work.
Step 2: Incident Replay (Week 2)
Pull up the last 5 incidents that involved this system. Walk through each one with the expert. The key questions:
- How did you know where to start looking?
- What did you rule out, and how?
- What was the fix, and how did you know it would work?
- What's the fastest way you've seen this system fail?
Record these sessions. The recordings are more valuable than any summary because they capture the expert's reasoning process, not just the conclusions.
Step 3: Boundary Documentation (Week 3)
Have the expert draw the system's boundaries. What goes in, what comes out, what it connects to. Then have them annotate it with the things nobody knows:
- "This connection uses TLS 1.2 because the vendor doesn't support 1.3 yet"
- "This queue has a max message size of 256KB, which we hit once during the Black Friday incident"
- "This cron job must run before the daily report job or the numbers will be wrong"
Step 4: Test the Transfer (Week 4)
Give the backup person a realistic task: debug a simulated issue, implement a small change, or handle a staged incident. The expert is available for questions but doesn't drive. This reveals gaps in the knowledge transfer that documentation alone would miss.
The Contrarian Take: Code Review Is Your Best Knowledge Distribution Tool
Most articles about tribal knowledge recommend documentation, wikis, and knowledge bases. Those help, but they're secondary. The single most effective tool for distributing knowledge is mandatory code review, with one specific rule: no one reviews their own module exclusively.
If Marcus is the only one who reviews payment PRs, the knowledge stays concentrated. If Marcus and two other engineers rotate payment PR reviews, knowledge distributes naturally. Every PR review is a micro-training session. The reviewer reads the code, asks questions about things they don't understand, and builds familiarity with the system over time.
The rule I enforce: every module must have at least two approved reviewers, and the primary expert can review no more than 50% of the PRs in their area. This forces knowledge distribution as a side effect of normal work.
Preventive Measures
Extraction is damage control. Prevention is better.
Rotation
Rotate on-call responsibility across the team, even for systems where one person is the expert. On-call rotations force people to learn systems they didn't build. Yes, the first few rotations will be slower. That's an investment, not a cost.
Internal Tech Talks
Monthly 30-minute talks where each team member presents a system they own. The audience asks questions. The talk is recorded. Over a year, you build a library of system explanations that new hires can watch.
Architecture Decision Records
I've written about this separately, but ADRs capture the "why" behind decisions. When the expert leaves, the code tells you "what." ADRs tell you "why." The combination is almost as good as having the expert available.
Mandatory Documentation for Deploys
Any deployment that requires special steps, any migration that has specific ordering requirements, any release that needs manual verification: document it in a runbook. Not a wiki page. A runbook in the repo, co-located with the code it describes.
The Knowledge Preservation Checklist
Use this quarterly to assess your team's knowledge distribution:
- Every critical system has at least 2 people who can debug production issues
- Code review responsibilities are distributed (no single reviewer per module)
- On-call rotation covers all systems, not just the ones everyone knows
- The top 5 "things only one person knows" are documented
- New hires can set up and run every service locally without asking for help
- Incident post-mortems include "what did we learn about the system" as a section
- Architecture Decision Records exist for the top 10 technical decisions
- Runbooks exist for non-standard deployments and migrations
What to Do When It's Already Too Late
Sometimes the expert has already left and you're stuck. Here's the triage plan:
- Identify the gaps. What questions are you getting that nobody can answer? Track them for two weeks.
- Read the git history.
git log --author="Marcus"shows every commit they made. Read the messages. Read the diffs for the most recent 50 commits. You'll learn a lot about what they were working on and how they thought. - Read their PR reviews. If your team uses GitHub, search for their review comments. Experts leave breadcrumbs in code reviews: "be careful here because..." and "this will break if..." comments are pure gold.
- Read the incident history. Post-mortems written by the departed expert contain their most concentrated knowledge.
- Reach out. Most engineers are happy to answer a few questions after they've left. Don't abuse this, but a focused Slack message with 3 specific questions is usually welcome.
The tribal knowledge problem isn't solved with a tool or a single initiative. It's solved with a culture that treats knowledge distribution as a first-class engineering activity, not an afterthought. Every sprint should include some knowledge-sharing work, even if it's just a 30-minute code reading session or a single ADR.
Your bus factor is probably lower than you think. Check it today.
$ ls ./related
Explore by topic