codeintelligently
Back to posts
Code Intelligence & Analysis

How to Use Git History as a Codebase Analysis Tool

Vaibhav Verma
7 min read
gitcode analysisdeveloper toolscode intelligenceversion control

How to Use Git History as a Codebase Analysis Tool

Your git history is the most underutilized dataset in your engineering organization. Every commit is a record of what changed, who changed it, when, and (if your commit messages aren't garbage) why. That's years of data about how your codebase evolves, sitting right there in your .git directory.

Most teams use git for exactly two things: committing code and resolving merge conflicts. That's like buying a Formula 1 car to drive to the grocery store.

I've been mining git history for insights since 2016, and I consistently find patterns that no other tool surfaces. Here's how you can do the same.

The Basics: What Git Log Actually Contains

Every git commit stores:

  • Author and timestamp: Who made the change and when
  • Diff: Exactly what changed, line by line
  • Message: The developer's stated intent
  • Parent commits: The change's relationship to the project timeline

From these four pieces of data, you can derive:

  • Change frequency per file, per module, per author
  • Temporal coupling between files (which files change together)
  • Knowledge distribution (who knows what parts of the codebase)
  • Development patterns (when does the team work, how big are typical changes)
  • Risk indicators (which changes tend to precede bugs)

Let me walk through each of these with the actual commands.

Finding Your Hotspots

Your codebase's hotspots are files that change disproportionately often. These are your highest-risk, highest-priority areas for review and refactoring.

bash
# Files with the most commits in the last 6 months
git log --since="6 months ago" --pretty=format: --name-only | \
  grep -v '^$' | sort | uniq -c | sort -rn | head -25

When I run this on a typical project, the results always surprise the team. Files they thought were stable turn out to be constantly changing. Files they worried about turn out to be untouched.

But raw commit count doesn't tell the whole story. Pair it with size:

bash
# For each hotspot, show its line count alongside commit count
git log --since="6 months ago" --pretty=format: --name-only | \
  grep -v '^$' | sort | uniq -c | sort -rn | head -25 | \
  while read count file; do
    if [ -f "$file" ]; then
      lines=$(wc -l < "$file")
      echo "$count commits | $lines lines | $file"
    fi
  done

A 50-line config file with 30 commits is a different story than a 2,000-line module with 30 commits.

Discovering Temporal Coupling

Temporal coupling reveals hidden dependencies. If two files always change together, they're coupled, regardless of whether the code has an explicit dependency.

bash
# Find files that frequently change in the same commit
git log --since="6 months ago" --pretty=format:'---' --name-only | \
  awk '/^---$/{if(NR>1) for(i in files) for(j in files) if(i<j) print files[i], files[j]; delete files; n=0; next} /^$/{next} {files[n++]=$0}' | \
  sort | uniq -c | sort -rn | head -20

This command shows you file pairs that change together most often. Some of these will be expected (a component and its test file). But you'll also find surprising couples: a UI component and an unrelated API handler. A database migration and a seemingly unrelated utility function.

These unexpected couplings are architectural smells. They mean a change in one area requires a change in another, and that relationship isn't captured in the code's import graph. It's implicit knowledge that lives in developers' heads, and when those developers leave, the coupling causes bugs.

Mapping Knowledge Distribution

This is one of the most actionable analyses you can do. For every file in your codebase, who are the experts?

bash
# Top contributor for each of your most-changed files
for file in $(git log --since="12 months ago" --pretty=format: --name-only | \
  grep -v '^$' | sort | uniq -c | sort -rn | head -20 | awk '{print $2}'); do
  echo "=== $file ==="
  git log --since="12 months ago" --pretty=format:'%an' -- "$file" | \
    sort | uniq -c | sort -rn | head -3
done

When one name dominates a critical file, that's a bus factor of 1. I've seen this kill teams. Senior developer leaves, and suddenly nobody can confidently modify the payment processing module.

The fix isn't just documentation. It's deliberate knowledge spreading: pairing junior developers with the expert on changes to that module, having the expert record architecture walkthroughs, and rotating review responsibilities.

Analyzing Commit Patterns

Commit patterns reveal how your team actually works, as opposed to how you think they work.

bash
# Commits by day of week
git log --since="6 months ago" --format='%ad' --date=format:'%A' | \
  sort | uniq -c | sort -rn

# Commits by hour
git log --since="6 months ago" --format='%ad' --date=format:'%H' | \
  sort | uniq -c | sort -rn

# Average commit size (lines changed)
git log --since="3 months ago" --shortstat --pretty=format: | \
  grep -v '^$' | awk '{ins+=$4; del+=$6; n++} END {print "Avg:", int((ins+del)/n), "lines per commit over", n, "commits"}'

I use the commit-size analysis as a health check. Teams with an average commit size over 200 lines are likely batching too many changes together. Teams with an average under 20 lines might be committing too granularly (or their CI is so slow they split work to avoid long queues).

The Framework: MINE Your Git History

M - Map hotspots: Run the frequency analysis. Identify your top 20 most-changed files. This is your focus area.

I - Identify coupling: Run temporal coupling analysis. Find the unexpected file pairs. Document the hidden dependencies.

N - Name the experts: Map knowledge distribution. Identify bus factor risks. Start knowledge-spreading initiatives for critical areas.

E - Examine trends: Don't just look at a snapshot. Run these analyses monthly and track the trends. Is your bus factor improving or getting worse? Are your hotspots stabilizing or growing?

The Contrarian Take: Commit Messages Are Overrated for This

Most guides on git analysis emphasize the importance of good commit messages. And yes, good commit messages help humans understand changes. But for analytical purposes, the diff is far more valuable than the message.

I've mined insights from codebases with terrible commit messages ("fix," "wip," "stuff") because the structural data (what changed, who changed it, when, and what else changed at the same time) is all in the metadata, not the message.

Don't wait for your team to adopt conventional commits before you start analyzing your git history. The data you need is already there.

Automating the Analysis

Running git commands manually is fine for exploration, but you want automation for ongoing tracking. Here's what I recommend:

  1. Write a script that runs your key analyses (hotspots, coupling, knowledge distribution) weekly
  2. Output the results to a JSON file
  3. Store the JSON in a simple dashboard (even a Google Sheet works)
  4. Review the trends monthly in your engineering retrospective

The total investment is about a day of scripting. The payoff is continuous visibility into how your codebase is actually evolving, not how you hope it's evolving.

Tools like CodeScene and gitlog-to-csv can accelerate this. But the git commands I've shared above will get you 80% of the way there for zero cost.

Start mining. Your git history has been collecting data for years. It's time to use it.

$ ls ./related

Explore by topic