I have shipped infrastructure-as-code in all three of these tools across different teams: Terraform on a multi-cloud platform team, AWS CDK on a single-cloud product team, and Pulumi on a small startup. The community discussion of these tools tends to flatten down to language-vs-DSL flame wars. The actual differences that have determined whether a stack was a joy or a daily landmine were almost always about state model, drift detection, and blast radius, not syntax.
This is a comparison that comes from operating each one, not just spinning up a sample project. The trade-offs that matter, the picks I would make today by team shape, and the failure modes that recurred across all three.
What all three of them are doing
All three tools take a description of cloud infrastructure (a network, a database, a Lambda, an IAM role) and turn it into API calls against the cloud provider. They all maintain a state file that records what was created, so they can compute the diff between the desired state and the current state on the next run. They all support modules and reusable components. Where they differ is the language the description is written in, where the state file lives, and how they handle the gap between the description and reality.
The state file is the part that breaks people. Read on.
Terraform: the lingua franca
Terraform's pitch: a declarative DSL (HCL) describes resources, and the planner computes what would change. It is mature, supported across every cloud, and has a registry of community modules.
Where Terraform shines:
- Multi-cloud or vendor-agnostic shops. Terraform's provider model treats AWS, GCP, Azure, Cloudflare, GitHub, and Datadog all the same way. A team that runs in two clouds is almost always a Terraform shop.
- Established teams with platform engineers. HCL is its own thing to learn, but it is unambiguous and predictable. The planner output is the most readable of the three (a tree of resource changes, easy to review in a PR).
- Drift detection.
terraform planwith no changes shows you what has drifted in the cloud since the last apply. This is the single most useful operational feature of the tool.
Where Terraform hurts:
- Programming logic. Looping over a list to create N resources is a workout. The HCL
for_eachandcountare powerful but get confusing when the list is dynamic. Anything that needs real conditional logic ends up as a tortureddynamicblock or three. - State management. The state file is precious. Concurrent applies can corrupt it; misnamed remote backends can fork it; deleting it is a recovery operation, not an undo. Locking via DynamoDB or Terraform Cloud is mandatory in any team setting.
- The provider/version pinning dance. Provider versions, Terraform CLI versions, module versions all have to agree across the team. CI lock files and
.terraform-versionmarkers help but are not free.
A tiny representative file:
Readable, deterministic, easy to review, easy to lint. This is HCL at its best.
AWS CDK: code that compiles to CloudFormation
CDK's pitch: write infrastructure in TypeScript / Python / Go / Java / .NET, and the framework synthesizes a CloudFormation template that AWS deploys. CloudFormation handles the actual orchestration; CDK is the authoring layer.
Where CDK shines:
- AWS-native teams. CDK is an AWS product, follows AWS API changes immediately, and integrates with CloudFormation's drift detection, change sets, and stack policies. If you live in AWS and only AWS, CDK is the smoothest path.
- High-level constructs. A
RestApiconstruct creates the API Gateway, the IAM roles, the CloudWatch log groups, and the deploy stage with a few lines. Compare that to the 80 lines of HCL needed to express the same thing in Terraform. - Programming-language ergonomics. Loops, conditionals, sharing helper functions across stacks, all native. Refactoring is a normal IDE operation, not a search-and-replace exercise on text files.
Where CDK hurts:
- CloudFormation's pace. New AWS services usually land in CloudFormation after they land in the AWS API. CDK can wrap them as escape-hatch L1 constructs, but for a few months you may be writing raw CloudFormation in your TypeScript.
- Asset uploading and bootstrapping. CDK uses an "asset bucket" in each account/region pair. You bootstrap the account once before any stack can deploy. The bootstrap stack is a moving target across CDK versions; a dormant stack can fail to redeploy because its bootstrapping is two versions stale.
- Generated CloudFormation can be huge. A 100-line CDK app can produce a 4000-line CloudFormation template. Reading the synthesized output during a debugging session is grim.
A representative snippet for the same S3 bucket:
The loop-over-environments and shared helpers cases are where CDK pulls ahead, and they are the cases that make HCL ugly.
Pulumi: programming languages all the way down
Pulumi's pitch: like CDK, but multi-cloud, with its own state management instead of CloudFormation. Write infrastructure in TypeScript, Python, Go, .NET, or Java; Pulumi's engine talks directly to the cloud APIs.
Where Pulumi shines:
- Multi-cloud teams that want a real programming language. This is the niche Pulumi was built for. Loops, helpers, conditionals, real package managers, all working across AWS / GCP / Azure / Cloudflare.
- Testing. You can unit-test Pulumi programs with Jest or pytest because they are real programs. Mocking out cloud calls works the way you would expect from any other test setup.
- Custom dynamic providers. Need to manage a third-party SaaS that nobody has written a Terraform provider for? Pulumi lets you write a custom provider in your normal language. The escape hatch is much more comfortable than Terraform's.
Where Pulumi hurts:
- Smaller community. Fewer Stack Overflow answers, fewer pre-baked modules, fewer war stories online. The official docs are good; the long tail of "someone hit this exact bug last year" is thin.
- State backend lock-in pressure. Pulumi's free tier puts state in their cloud. Self-hosting state in S3 is supported but is the less-trodden path; you will be the one debugging odd state issues if they appear.
- Pricing on the SaaS plan. Once your team scales, the per-resource pricing on Pulumi Cloud is something to budget for. Self-hosting state mitigates this but adds operational work.
The same S3 bucket, in Pulumi TypeScript:
Reads like the CDK version, runs against a multi-cloud engine. The cost is the smaller community and the more custom-feeling path for state.
A side-by-side that fits on one screen
*CDK has a CDK-for-Terraform variant (CDKTF) and a CDK-for-Kubernetes (CDK8s); the table refers to vanilla AWS CDK.
The failure modes that hit all three
After shipping each of these in production, three classes of pain were not specific to any one tool:
Drift between the IaC source and reality. Someone clicks something in the AWS console. Now the next plan/apply will revert it. Or a different IaC stack manages an overlapping resource and the two stacks fight. Drift detection (terraform plan, cdk diff, pulumi refresh) is the answer; running it weekly on every stack is the discipline. None of the tools makes this discipline automatic. You have to schedule the runs.
Blast radius from a single change. A renamed resource gets recreated, not modified. A small change to a security group sometimes triggers a database failover (because the database depends on it). The plan output is the only defense; reading it carefully before applying is the most underrated CI/CD discipline. I have a personal rule: any plan that says "replace" on a stateful resource (database, load balancer, EFS volume) requires a second pair of eyes and a written ack in the PR. Cross-team incidents from accidental replaces are the most expensive class of IaC mistake I have seen.
The big-bang refactor. A 2000-line stack works. The team wants to split it into four smaller stacks. The migration is hand-rolled (move resources between state files, hope), and one wrong step recreates a critical database. All three tools have a state mv / state import mechanism, and all three of them are operationally tense to use in anger. I have done this on every tool and the rule is the same: do it during a low-traffic window, do it with a second engineer watching, and stage every move via a plan-then-apply rather than batching them.
What I would pick today, by team shape
- Multi-cloud platform team, established engineering org: Terraform. The DSL is an annoyance up front, but the predictability, drift detection, and ecosystem maturity pay off when the stack lives for years and people rotate through.
- AWS-only product team, mostly application engineers: CDK. The high-level constructs save real time, the language is one the team already speaks, and the AWS-native integration removes a class of provider-version pain.
- Multi-cloud startup, small team, application engineers wear the IaC hat: Pulumi. The same language as the app, real testing, multi-cloud reach. Mind the state-backend choice and budget for the SaaS plan from the start.
- Existing Terraform shop being asked to move: stay on Terraform. The migration cost is real and the wins are smaller than people expect once a team is fluent.
The choice that actually matters more than the tool
The biggest IaC outage I have caused was an accidental destroy of a production load balancer because the plan output was 600 lines and I trusted it. The biggest one I have prevented was catching a similar replace in the plan and pausing the apply. None of those moments turned on the language; they turned on whether the plan output was being read carefully and whether the change had been split small enough that the diff was reviewable. The day-to-day quality of IaC in a team is mostly about discipline (small changes, careful plan reviews, tested escape hatches, weekly drift checks), and the tool choice barely cracks the top three factors.
Pick one, write the tooling around the discipline, and stop arguing about syntax. The engineers who thrive in any of these three are the ones who treat the plan output as a first-class artifact and review it the way they would review a hot-path code change. The engineers who get burned are the ones who skim the plan, hit apply, and find out about the consequences from the customer.
