A travel-booking flow I once worked on had to do four things in sequence: hold a flight seat, hold a hotel room, charge the credit card, and send a confirmation email. Each one was a separate service call to a separate vendor. The simplest version of the code did them in order, with no recovery logic. When the third call (credit card) failed because of a card decline, the user got an error, and the seat and the hotel room stayed held until those vendors timed out. The user then tried again with a new card, the booking succeeded, and now they had two seats and two hotel rooms held under their name. Vendor support calls followed.
The fix was a saga. Not the textbook saga, exactly. A pragmatic one, with compensations that worked, idempotent steps that retried correctly, and a coordination model that was easy to reason about three months later. This article is what I learned designing that saga and the next two.
My stance: the saga pattern is the right answer when distributed transactions are not available, which is most of the time. But the hard part is not the pattern itself; it is designing compensations that genuinely undo what the forward step did. The choreography-vs-orchestration debate is real but it is a second-order concern. Get the compensations right first.
What a saga actually is
A saga is a sequence of local transactions, where each step has a compensating action that semantically undoes that step if a later step fails. The classic example is exactly the booking flow: hold flight, hold hotel, charge card, send email. If the charge fails, you compensate by releasing the hotel hold and then releasing the flight hold, in reverse order.
The pattern is named after a 1987 paper by Garcia-Molina and Salem about long-lived transactions in databases. The original idea was to break a long transaction into smaller, individually-committable chunks, with rollback handled by application logic instead of the database. The microservices community picked it up because the same shape applies to a sequence of service calls.
The key word in the definition is semantically. The compensation does not roll back the database to its prior state; it issues a new operation that has the effect of undoing the original. "Release the seat hold" is a real operation that the airline's API supports; it is not a database rollback. Designing compensations is the hard part precisely because the underlying systems are usually one-way (you cannot un-send an email, you cannot un-charge a credit card without explicitly issuing a refund).
What 2PC would do, and why we cannot use it
The textbook alternative to a saga is a two-phase commit (2PC). All participants vote prepare-or-abort; if all vote prepare, the coordinator commits; if any vote abort, the coordinator aborts and all participants roll back. 2PC gives you ACID across services: either all four operations happen or none do.
Why I do not use 2PC for cross-service operations:
- Most third-party APIs do not support it. The airline's booking API, the credit card processor, the email service: none of them implement the 2PC protocol. You cannot ask Stripe to "prepare" a charge and then commit it. The protocol requires participation from every service in the transaction, and you do not control the third-party services.
- It blocks resources during the prepare phase. A 2PC transaction holds resources locked between prepare and commit. For a flight seat across multiple airlines, that lock can be minutes long, blocking other customers from booking the same seat. That is not acceptable.
- It does not survive coordinator failure cleanly. If the coordinator crashes after some participants have committed, the surviving participants are in an indeterminate state. The recovery protocol is non-trivial and most implementations get it wrong.
For a single database transaction across multiple tables, 2PC is fine (your database does it for you). For a sequence of service calls across vendors, 2PC is not a real option. Sagas are.
Compensation design: the hard part
Three properties every compensation must have:
- Semantic inverse. The compensation must produce the effect of undoing the forward step, even if it does not literally roll back state. "Refund $50" inverses "Charge $50" semantically; it does not erase the original charge from the ledger.
- Idempotent. A compensation can be called multiple times by a retrying coordinator. Calling "release seat hold for booking 42" twice should leave the system in the same state as calling it once.
- Always succeeds (eventually). If a compensation fails, the saga is stuck. Compensations must be designed to retry indefinitely until they succeed, or escalate to a human after a bounded number of failures.
The hardest case I have encountered: compensating an operation that has external visibility. The welcome email is the canonical example. Once the email is sent, the user has seen it; you cannot send a "please disregard the welcome" email without making the situation worse. Two design choices that work:
The first is what most well-designed sagas do. Send the welcome email after the booking is confirmed; if any earlier step fails, the email is never sent. "Confirmation last" is a standard saga design rule.
The second is more subtle. Instead of "send email immediately", queue the email for delivery five minutes from now, and let the saga compensate by canceling the queued send. The email is irreversible only after it is delivered; if you delay delivery, you create a window where compensation is still possible. This works for emails and for some other notifications. It does not work for everything.
Two compensation gotchas I learned the hard way
The first one: timing-dependent compensations. The flight hold expires after 15 minutes if not converted to a booking. If your saga takes 12 minutes (because some step retried for a long time) and then you start compensating, the flight release call may hit a hold that has already auto-expired. The vendor returns "unknown booking ID" and the compensation "fails". Most teams handle this by treating "already gone" as success: if the resource is not held anymore, the compensation has nothing to do. Code the compensation to handle the "already released" case as a no-op success, not as an error. Otherwise the saga gets stuck on a phantom failure.
The second one: partial compensations that observably succeed but actually do not undo everything. The classic example is a charge that included a transaction fee. "Refund $50" inverses the principal but the original charge had a $1.50 vendor fee that is non-refundable. The customer ends up $1.50 short of their original balance. From a customer-experience standpoint the compensation has not really compensated. The fix is either to absorb the fee on your end (refund the full charge plus the fee from a different source) or to be explicit with the user that compensation is not always perfect and they may need to escalate. There is no purely-technical answer here; the answer is a product decision about who pays for partial compensations.
Choreography vs orchestration
The literature talks a lot about this distinction. In choreography, each service publishes events when it completes a step, and the next service in the chain consumes those events. There is no central coordinator; the saga emerges from the event flow.
In orchestration, a single orchestrator service drives the saga, calling each step in order and handling compensations.
The advantage of choreography is loose coupling: each service knows only its predecessors and successors via events, not the full saga shape. The disadvantage is that the saga is implicit, distributed across services, and hard to debug. "Why did the booking fail at step three?" requires looking at logs in three different services.
The advantage of orchestration is that the saga is explicit and centralized: one service holds the state machine, one service decides when to compensate, and one log file shows the entire flow. The disadvantage is that the orchestrator becomes a coupling point; every saga change involves changing the orchestrator.
I have built both. My stance: orchestration is better for sagas with more than three steps, and choreography is fine for two-step or three-step flows. The reason is operational: when (not if) a saga goes wrong, debugging an orchestrator's state machine is dramatically easier than tracing events across services. The loose-coupling benefit of choreography matters less than the debuggability benefit of orchestration once the saga has more than a few hops.
The exception is when steps are owned by truly independent teams and the loose coupling is a feature. For our travel-booking saga, all steps were owned by the same team and orchestration was clearly right.
A worked orchestrator
The orchestrator I would build today, in pseudocode:
Two non-obvious things in this code. First, state.persist() writes the saga state to a database (or durable store) before and after each step. If the orchestrator crashes mid-saga, restarting it can read the persisted state and resume (or compensate) from the right point. Without that persistence, a crash leaves the saga in limbo.
Second, the compensation loop iterates in reverse order, only over steps that were completed. If the saga failed at step 2, only steps 0 and 1 are compensated. The order matters for cases like "release the hotel hold before releasing the flight hold", where downstream compensations might depend on upstream state.
State persistence is non-negotiable
I want to emphasize this. A saga without durable state is a saga that loses bookings on crash.
The minimal state per saga is:
This goes in a SQL table, not in memory. On orchestrator crash and restart, a sweeper scans for sagas in PENDING or COMPENSATING state older than some threshold, and resumes them. The forward step or compensation must be idempotent (because resume can re-execute it after a crash that happened mid-step), which folds back into the compensation-design properties above.
In the travel-booking saga, the state table had two thousand rows on a typical day and fewer than ten failed-and-stuck sagas per week. Those ten cases were what the on-call engineer looked at on Monday morning. Having the state table made "why did this booking fail" answerable in five minutes.
Where sagas have not worked for me
Two cases I would flag as anti-patterns:
The first: a saga that wraps a single database transaction. If all four steps are in the same database, just use a database transaction. The complexity of saga state management, compensations, and orchestrator recovery is not worth it when the database can give you atomicity for free. Sagas are for cross-service coordination, not for in-database operations.
The second: a saga where the compensations are not reliable. If payment.refund() regularly fails and requires human intervention, the saga is structurally broken. The system will accumulate stuck sagas faster than humans can clear them. Either fix the underlying API to make refunds reliable (with retries, with idempotency keys, with whatever it takes), or accept that the workflow cannot use sagas and design something else (a queue with manual review, a lower-stakes operation).
A subtler anti-pattern is a saga that issues a compensating action for a step that was a query, not a state change. Queries do not need compensations. If your saga has a step that just reads data, the compensation list should have None (or the equivalent) for that step. I have reviewed sagas that included "compensations" for read-only steps; they were no-ops at best and wrong-thinking at worst.
Observability for sagas
A saga with no observability is impossible to operate. The minimum I would build:
The timeline log is the highest-leverage piece. When a customer calls support saying "I tried to book but it failed", the support engineer should be able to find the saga by booking ID, see the full timeline (which steps ran, which succeeded, where it stopped, what compensations ran), and explain what happened in plain language. Without that timeline, the support engineer has to ask the on-call engineer, the on-call engineer has to grep three services' logs, and the customer waits. With the timeline, the support engineer can usually answer in seconds.
The replay tool is the next most important. After a code bug is fixed, you typically have a backlog of stuck sagas that should have succeeded. A tool that lets an operator pick a saga ID, advance its state to "resume from step N", and trigger a re-execution is the difference between a one-hour fix-and-replay and a one-day refund-and-rebuild.
What I tell teams designing their first saga
Three rules, in order:
- Sequence irreversible steps last. If you cannot compensate it, do it after every other step has succeeded. The booking flow puts "send email" last. The financial flow puts "transfer funds" last. The provisioning flow puts "send credentials" last.
- Make every step idempotent. A retrying orchestrator must be able to re-execute a step without compounding side effects. Use idempotency keys, dedupe tables, or natural idempotence at the operation level.
- Persist saga state. The orchestrator must be crash-safe. A saga is a long-running operation by definition; treating it as in-memory is a bug.
If your team can do those three things, the rest of the saga design (orchestration vs choreography, compensation logic, monitoring) is mechanical work. If your team cannot do those three things, no amount of clever pattern selection will save you.
A final claim
The saga pattern is widely talked about and partially understood. The talked-about part is the structural pattern (forward path, compensation path, orchestration vs choreography). The partially-understood part is that the structural pattern is the easy part. The hard parts are: designing compensations that genuinely invert their forward steps, sequencing irreversible operations last, persisting saga state durably, and operating the inevitable stuck sagas that human action will need to unstick. A team that gets the structural pattern but misses the operational pattern ends up with a half-built saga that drops customer state on the floor. A team that gets both has a sound foundation for cross-service workflows that 2PC will never give them. The pattern earns its complexity, but only when the work below the surface is funded.
A practical aside on choosing the orchestrator's home. I have seen three options, in increasing order of investment: a custom service that owns the saga state machine, a workflow engine like Temporal or Camunda, or a homegrown library inside the existing application. The custom service is the right answer for one or two sagas. The workflow engine is the right answer when you have five or more long-running flows; the engine handles state persistence, retries, timeouts, and replay for you, and the only code you write is the business logic of each step. The homegrown library is the wrong answer in almost every case: it ends up reinventing the engine badly, with bugs the engine has fixed years ago. If you are designing your second saga and you wrote the first one custom, that is the moment to evaluate Temporal-or-similar. Saving yourself the work of state machines and timers usually pays for the integration cost within a quarter.
