Why AI-Generated Tests Give You False Confidence
Why AI-Generated Tests Give You False Confidence
I had a PR with 94% code coverage. Every test passed. The CI pipeline was green across the board. We shipped on Friday afternoon. By Saturday morning, production was down because a null pointer exception hit a path that none of our tests covered meaningfully.
The tests were all AI-generated. They looked thorough. They covered every function. And they were almost completely useless.
This post explains why AI-generated tests create a dangerous illusion of safety, and how to fix it.
The Core Problem
AI generates tests by looking at the implementation and writing assertions that match what the code does. That's fundamentally backwards. Tests should verify what the code should do, regardless of how it currently does it.
Here's the difference in practice:
// The implementation
function calculateDiscount(order: Order): number {
if (order.total > 100) return order.total * 0.1;
if (order.items.length > 5) return order.total * 0.05;
return 0;
}
// AI-generated test (mirrors the implementation)
describe("calculateDiscount", () => {
it("returns 10% for orders over $100", () => {
const order = { total: 150, items: [{ id: "1" }] };
expect(calculateDiscount(order)).toBe(15);
});
it("returns 5% for orders with more than 5 items", () => {
const order = { total: 50, items: Array(6).fill({ id: "1" }) };
expect(calculateDiscount(order)).toBe(2.5);
});
it("returns 0 for small orders", () => {
const order = { total: 20, items: [{ id: "1" }] };
expect(calculateDiscount(order)).toBe(0);
});
});Looks great, right? 100% coverage. All green. But this test suite misses:
- What happens when total is exactly 100? (boundary condition)
- What happens when total is negative? (invalid input)
- What happens when items is an empty array? (edge case)
- What happens when an order qualifies for both discounts? The implementation only gives one. Is that correct? (business logic question)
- What happens when total is 101 and items.length is 6? (priority question)
The AI didn't test any of these because the AI doesn't understand the business rules. It only understands the code.
Five Patterns of Useless AI Tests
Pattern 1: The Mirror Test
The test is literally a restatement of the implementation.
// Implementation
function formatName(first: string, last: string): string {
return `${first} ${last}`.trim();
}
// AI test (useless: just re-implements the function)
test("formats name", () => {
expect(formatName("John", "Doe")).toBe("John Doe");
});
// What it should also test:
// formatName("", "Doe") => "Doe" (not " Doe")
// formatName("John", "") => "John" (not "John ")
// formatName(" John ", " Doe ") => "John Doe" or "John Doe"?Pattern 2: The Happy Path Only
AI tests almost always test the success case. Error paths, timeouts, and failure modes are either missing or trivially tested.
// AI only tests the success case
test("fetches user successfully", async () => {
const user = await getUser("valid-id");
expect(user).toBeDefined();
expect(user.name).toBe("Test User");
});
// Missing tests:
// - What happens with an invalid ID format?
// - What happens when the API returns 500?
// - What happens when the network times out?
// - What happens when the response is malformed JSON?
// - What happens under rate limiting?Pattern 3: The Over-Mocked Test
AI loves mocking. It mocks so aggressively that the test doesn't test anything real.
// AI-generated: everything is mocked, nothing is tested
test("processOrder creates order", async () => {
const mockDb = { order: { create: jest.fn().mockResolvedValue({ id: "1" }) } };
const mockPayment = { charge: jest.fn().mockResolvedValue({ success: true }) };
const mockEmail = { send: jest.fn().mockResolvedValue(true) };
const result = await processOrder(
{ items: [], total: 100 },
mockDb as any,
mockPayment as any,
mockEmail as any,
);
expect(mockDb.order.create).toHaveBeenCalled();
expect(mockPayment.charge).toHaveBeenCalled();
expect(mockEmail.send).toHaveBeenCalled();
expect(result.success).toBe(true);
});
// This test passes even if processOrder has a bug,
// because we mocked away all the real behaviorPattern 4: The Snapshot Trap
AI generates snapshot tests for complex outputs. These tests break every time the output format changes, even if the behavior is correct. They become noise that teams eventually stop maintaining.
Pattern 5: The "It Works" Assertion
// The most useless assertion AI generates
test("should work", async () => {
const result = await complexOperation(input);
expect(result).toBeDefined(); // This passes for literally any return value
expect(result).not.toBeNull(); // Also passes for wrong values
});The Mutation Testing Proof
To prove that AI tests provide false confidence, I ran mutation testing on 3 projects. Mutation testing changes your code slightly (flipping operators, removing lines) and checks if tests catch the change.
| Project | Code Coverage | Mutation Score | Gap |
|---|---|---|---|
| Project A (AI tests) | 91% | 34% | 57% |
| Project B (AI tests) | 87% | 41% | 46% |
| Project C (human tests) | 76% | 68% | 8% |
Project C had lower code coverage but a much higher mutation score. Its tests actually caught bugs. Projects A and B had high coverage but their tests let most mutations survive. Those tests give you a green dashboard while bugs slip through.
The Fix: Test-First AI Workflow
Here's the workflow I use now. It takes slightly longer but produces tests that actually protect your code.
Step 1: Write Test Descriptions First (Human)
describe("OrderProcessor", () => {
// Happy paths
it("should create an order with valid items and payment");
it("should apply volume discount for orders over 10 items");
// Edge cases
it("should reject orders with zero items");
it("should reject orders with negative prices");
it("should handle exactly 10 items without discount");
// Failure modes
it("should roll back order if payment fails");
it("should retry payment once on timeout");
it("should not send confirmation email on payment failure");
// Concurrency
it("should prevent double-charging with idempotency key");
});Step 2: Let AI Implement Test Bodies (AI-Assisted)
Give the AI your test descriptions and ask it to implement them. This constrains the AI to test what matters, not what's easy.
Step 3: Verify with Mutation Testing (Automated)
npx stryker run --mutate "src/order-processor.ts"
# Target: > 60% mutation scoreStep 4: Fill Gaps (Human)
Mutation testing reveals which code changes aren't caught by tests. Write targeted tests for those gaps.
The Quality Gate
Add this to your CI pipeline. Block merges when AI-generated tests don't meet the bar:
# .github/workflows/test-quality.yml
- name: Mutation Testing
run: npx stryker run
env:
STRYKER_THRESHOLD_HIGH: 70
STRYKER_THRESHOLD_LOW: 50
STRYKER_THRESHOLD_BREAK: 40 # Fail CI below this scoreThe Takeaway
Code coverage is a necessary but insufficient metric. When AI generates your tests, coverage becomes almost meaningless. Mutation testing is the metric that tells you whether your tests actually protect against bugs.
The rule is simple: humans decide what to test, AI helps with how to test it. Reverse that order, and you get a green dashboard that means nothing.
$ ls ./related
Explore by topic