Community Article

Testing Pyramid vs Trophy: Pick the Right Shape

Most teams ship the testing pyramid by accident. The trophy is what actually matches modern frontend work. Here is how to choose.

Testing Pyramid vs Trophy: Pick the Right Shape

Most teams ship the testing pyramid by accident. The trophy is what actually matches modern frontend work. Here is how to choose.

testing

unit-testing

craftsmanship

reliability

code-organization

By @aishasantos

March 4, 2026

Updated May 18, 2026

644 views

4.2 (14)

Most teams I have joined ship the testing pyramid by accident, not by design. They write a lot of unit tests because unit tests are easy to write, a handful of integration tests because someone read a blog post, and a few end-to-end tests that flake every Tuesday. The shape works out. Nobody chose it. Then a senior engineer reads the Kent C. Dodds essay about the testing trophy and proposes inverting the team's whole strategy, and the team spends six months arguing about something they never decided in the first place.

My take, after running this argument across four companies and three test frameworks: the pyramid is right for some codebases and the trophy is right for others, and the choice is mechanical once you know what to look for. This article is the criteria I use, the costs each shape pushes onto the team, and the migration path I take when I inherit a team that picked the wrong one.

The two shapes, stated honestly

The testing pyramid, as Mike Cohn drew it in 2009, is a stack: many unit tests at the bottom, fewer integration tests in the middle, very few end-to-end (UI) tests at the top. The argument was that unit tests are fast and cheap, end-to-end tests are slow and brittle, so most of your test investment should be at the bottom.

The testing trophy, popularized by Kent C. Dodds around 2018, flipped the middle. A small base of static analysis (TypeScript, ESLint), a layer of unit tests, a large layer of integration tests in the middle, and a small cap of end-to-end tests. The argument was that for modern frontend work, integration tests give you the most confidence per dollar, and unit tests of small components are mostly tautologies.

The two pictures look different but they share a structure: a wide base of cheap tests, a narrow cap of expensive tests, and a middle layer that does the actual work. The fight is about where the middle goes and how big it is.

Pyramid vs Trophy comparison

     PYRAMID                           TROPHY
     -------                           ------
     [E2E]                             [E2E]
    [INTEGR]                       [INTEGRATION]
   [   UNIT   ]                       [UNIT]
                                      [STATIC]

What each shape costs

Both shapes have hidden costs that nobody mentions in the introduction-to-testing posts.

The pyramid's hidden cost is mock maintenance. When most of your tests are unit tests, most of your tests use mocks. Mocks lie. They lie when you change the real dependency and forget to update the mock. They lie when the real dependency adds a new method that the mock does not implement. They lie when the real dependency's contract changes in a backward-compatible way that your mock makes incompatible. I have shipped at least three regressions in my career where every unit test passed because the mocks were green, and the moment the code hit a real database it broke. Mock maintenance is a cost that grows with the test count.

The trophy's hidden cost is integration test flake. When most of your tests run against a real database, real services, real DOM, the tests are slow and the failures are sometimes flaky. A 200-test suite that takes 12 minutes is fine. A 2,000-test suite that takes 90 minutes blocks the whole team's deploys. Flake compounds: every flaky test that the team learns to retry instead of fix is a real bug that will eventually ship. The trophy works if you invest in test infrastructure (parallel test runners, fast test databases, deterministic clock, headless browsers); it fails if you do not.

Neither cost is a deal-breaker. Both are real and shape the choice.

The criteria I actually use

When I join a team and have to recommend a shape, I look at four things. The answers usually point cleanly at one shape.

Criterion 1: Where does the value live? A library or pure-logic codebase (a parser, a tax calculator, a cryptographic primitive) lives in pure functions. The thing you want to test is whether parse('2x + 3') returns the right AST, and unit tests are exactly that. The pyramid fits. A frontend app or a typical CRUD service lives in the seams between components: the page renders, the form submits, the data round-trips through a database. Testing those seams is integration work. The trophy fits.

Criterion 2: How fast does the language give you static safety? TypeScript with strict mode catches the typo bugs that a unit test would have caught. So does a mature compiler. So does a good IDE. A strongly typed codebase has a thicker static base, which means it can lean on fewer unit tests for the same confidence. A dynamically typed codebase (vanilla JS, Python without type hints, untyped Ruby) needs more unit tests to fill that gap. This is a real input to the shape.

Criterion 3: How expensive is your integration environment to spin up? If a single test can spin up a Postgres in a Docker container in 600 ms, run a real query, and tear it down, you can afford a lot of integration tests. If your integration tests need a Kubernetes cluster, three external services, and a 5-minute warm-up, you cannot, and the pyramid is a more honest match for the constraint.

Criterion 4: How often do you change the seams vs the internals? A codebase where the public API is stable and the internals churn (a payment processor, a compiler) benefits from heavy unit tests of the internals because the seams rarely change. A codebase where the public API churns and the internals are stable (a startup product, a feature-flag-heavy codebase) benefits from heavy integration tests because they are robust to internal refactoring.

A 3,000-test suite that caught nothing

A recent team I joined was building a Next.js app with a Prisma + Postgres backend. They had inherited a 3,000-test unit suite that took 4 minutes to run and caught approximately zero real bugs. The bugs they shipped were UI regressions, schema migration mistakes, and broken Stripe webhooks. The unit tests were checking the inputs and outputs of internal helper functions that had not changed in two years.

The four criteria pointed cleanly at the trophy. Value lived in seams (rendered pages, submitted forms, webhook receivers). The codebase was strict TypeScript. Postgres in Docker took 800 ms to spin up. The seams changed every week.

We spent a quarter migrating. We deleted about 1,800 unit tests that were testing internal implementation details. We added integration tests that rendered pages with React Testing Library against a real database, exercised the Stripe webhook handlers with synthesized signed payloads, and verified migrations against snapshots of production data. The full suite, post-migration, ran in 6 minutes and caught real regressions in the first week.

If we had been writing a TypeScript JSON parser instead, we would have done the opposite: kept the unit tests, deleted the integration scaffolding, and put 95% of our test budget on input/output pairs of the parser core.

The integration test that earns its keep

This is the test shape I write most often these days, and it is what I mean when I say "integration".

test('checkout endpoint creates an order and clears the cart', async () => {
    const user = await createTestUser();
    await givenCart(user.id, [
        { productId: 'prod_a', quantity: 2 },
        { productId: 'prod_b', quantity: 1 },
    ]);
    await givenStock({ prod_a: 5, prod_b: 1 });
    await givenStripeWebhookSucceeds();

    const res = await request(app)
        .post('/checkout')
        .set('Cookie', sessionCookieFor(user.id))
        .send({ token: 'tok_test_123' });

    expect(res.status).toBe(200);
    expect(await db.order.findMany({ where: { userId: user.id } })).toHaveLength(1);
    expect(await db.cart.findUnique({ where: { userId: user.id } })).toBeNull();
});

This test exercises the HTTP layer, the session middleware, the database, the Stripe webhook stub, and the business logic, in one shot. If any of those break, this test fails. If any of those are refactored without changing behavior, this test passes. That is the property I want from a test, and it is what unit tests of calculateOrderTotal() cannot give me.

The time cost is 80 ms per test on a warm database. The setup cost is real: I had to build createTestUser, givenCart, givenStock, givenStripeWebhookSucceeds, and sessionCookieFor. I built them once. I have written 200 tests on top of them in the past year. The amortized per-test cost is small.

Where the pyramid still wins

I want to be careful not to come across as anti-pyramid. There are codebases where the pyramid is correct, and on those codebases the trophy is wrong. The pure-logic codebases mentioned above are the cleanest example. A second category is performance-critical inner loops: a sorting algorithm, a hashmap implementation, a checksum routine. Unit tests of those are fast, the inputs and outputs are well-defined, and integration tests would just measure the same thing slower.

A third category, less obvious: legacy codebases with no integration scaffolding. Building the scaffolding to run integration tests against a 15-year-old monolith with no test database is a quarter of work before the first test runs. In that situation, characterization unit tests of pure functions inside the monolith are the right move because they pay off immediately.

How I migrate teams that picked the wrong shape

The migration is mechanical, low-drama, and takes a quarter of focused work. I do not delete any tests on day one. I add the new shape's tests for new features only. After three months of new tests in the new shape, I look at the old tests, identify the ones that are tautologies or redundant with the new tests, and delete them in batches. The team's confidence in the test suite goes up first; the test count goes down second.

The order matters. If you delete the old tests first, the team panics. If you add the new tests first and then delete, nobody notices the old tests leaving because their job is being done by the new ones.

A test suite is not the goal

The argument I keep coming back to, when teams litigate the pyramid vs trophy choice on Slack: a test suite is not the goal. The goal is fast, confident shipping. The shape that makes that easier on your codebase is the right shape, and the shape that makes it harder is the wrong one. The pyramid and the trophy are two reasonable defaults; both can be wrong for your situation, and the criteria above are the way I figure out which.

The one anti-pattern that is wrong everywhere: a test suite that everyone knows is flaky and that the team has learned to retry. That is not a test suite. That is a slow CI that occasionally tells the truth. The fix is not a different shape. The fix is a culture where flaky tests are a P0 to investigate, the same way a real bug would be. Once that is in place, either shape works. Without it, neither does.

Back to Articles