System Design Article
Serverless Architecture & FaaS
Difficulty: Medium
Serverless does not mean 'no servers'. It means the cloud provider runs the servers, scales them to zero when idle, and bills you per request rather than per running hour. Functions-as-a-Service (Lambda, Cloud Functions, Cloud Run, Azure Functions) is the most visible flavor. The pattern is genuinely powerful for spiky workloads, glue code, and small teams who want to skip the infrastructure tax. It is genuinely a bad fit for steady high-throughput services, latency-critical paths, and stateful systems. This lesson covers how serverless actually executes (cold starts, warm pools, concurrency limits), the architectural patterns it enables, the patterns it breaks, and the honest cost model.
Serverless Architecture & FaaS
Serverless does not mean 'no servers'. It means the cloud provider runs the servers, scales them to zero when idle, and bills you per request rather than per running hour. Functions-as-a-Service (Lambda, Cloud Functions, Cloud Run, Azure Functions) is the most visible flavor. The pattern is genuinely powerful for spiky workloads, glue code, and small teams who want to skip the infrastructure tax. It is genuinely a bad fit for steady high-throughput services, latency-critical paths, and stateful systems. This lesson covers how serverless actually executes (cold starts, warm pools, concurrency limits), the architectural patterns it enables, the patterns it breaks, and the honest cost model.
792 views
3
Motivation
The traditional model of running a service is: you pick an instance size, you start the process, you keep it running 24/7, you scale it manually or with auto-scaling rules, and you pay for every hour the instance exists whether it is doing work or not.
For a service with steady traffic that fully utilizes its CPU and memory, this is fine. For most workloads, it is wasteful:
- A nightly batch job: 10 minutes of work; 23 hours 50 minutes of idle paid compute.
- An image-resize endpoint that gets 5 requests per minute on average and 200 per minute during a viral moment: dramatically over- or under-provisioned almost all the time.
- A webhook receiver: bursty by definition; sometimes zero traffic for hours.
- A small internal API: maybe 1000 requests per day; a full instance is enormous overkill.
Serverless inverts the model: the cloud provider keeps the runtime warm somewhere, you give them code that runs per event, and you pay only for the milliseconds your code actually executed. Idle costs nothing. Auto-scaling is automatic and goes to zero. Operations (patching, capacity planning, instance failure recovery) are no longer your problem.
The price you pay: cold starts, vendor-specific runtimes, hard execution limits, painful state management, observability that is harder than it should be, and a cost model that flips on you at high steady throughput.
Knowing both the wins and the costs is what makes the architectural choice 'serverless or not' a senior-level decision rather than a 'because everyone does it' default.
Deep Dive
What 'serverless' actually means
The term spans several distinct services. Pinning down exactly what you mean is half of any interview answer.
| Layer | Examples | What 'serverless' means here |
|---|---|---|
| FaaS (functions) | AWS Lambda, GCP Cloud Functions, Azure Functions, Cloudflare Workers | You upload a function. Provider runs it per event. Pay per invocation + GB-seconds. |
| Serverless containers | AWS Fargate, Cloud Run, Azure Container Apps | You upload a container. Provider scales instances 0..N based on traffic. Pay per CPU-second / GB-second. |
| Serverless databases | DynamoDB on-demand, Aurora Serverless, Firestore, Cosmos DB serverless | You write queries; capacity scales automatically; pay per request unit. |
| Serverless event services | EventBridge, SNS, SQS, Kinesis on-demand | Pay per event; no broker to size. |
The common thread: no instances to manage, scale to zero when idle, billed by usage. This lesson focuses primarily on FaaS because it is the prototype, but the pattern generalizes.
How a FaaS request actually executes
When a function is triggered (HTTP, event, schedule, queue message), the provider needs an executable environment. There are three lifecycle states.
+--- Init (cold start) ----+ +--- Invoke (warm) ----+ +--- Idle ----+
| 1) Find/start a sandbox | | reuse same sandbox | | sandbox |
| 2) Download code/layers |-->| just call the handler|-->| kept warm |
| 3) Start runtime | | (1-10 ms overhead) | | for ~5-15 m |
| 4) Run init code | | | | |
| 5) Call handler | | | | |
+--------------------------+ +----------------------+ +-------------+
typically 50ms - 2s fast provider tears
(much worse for JVM/ path down silently
.NET, much better for
Go / Rust / Cloudflare
isolates)Cold start: the first request to a new sandbox pays for sandbox creation, code download, runtime startup, and any init code (e.g., loading models, opening connections). Cold-start times in 2026 typically range from ~50 ms (Cloudflare Workers, AWS Lambda SnapStart) to ~2 s (large JVM or .NET Lambdas with heavy init).
Warm invocation: subsequent requests to the same sandbox skip init; only the handler runs. Per-invocation overhead is in single-digit milliseconds.
Idle: after some time without traffic (typically 5-15 minutes), the provider tears down the sandbox. The next request pays a cold start.
Why does this matter? Because cold starts dominate the tail of latency for low-traffic functions. A function that gets one request per minute pays a cold start about every minute; its p99 will be the cold-start time, not the warm-path time.
Concurrency model
A FaaS function processes one request per sandbox at a time (the standard model in AWS Lambda, Cloud Functions, Azure Functions). If 100 requests arrive simultaneously, the provider spins up 100 sandboxes. This is wonderful for scaling but it has consequences:
- Connection storms: every cold sandbox opens its own database connection. 1000 concurrent invocations = 1000 DB connections. RDS connection limits get hit fast.
- Burst limits: every provider has a per-account concurrency cap (e.g., AWS Lambda's 1000 default per region). Hit it and requests get throttled.
- No internal state: a counter you increment in memory dies with the sandbox; a cache you build in memory benefits one sandbox only.
- Per-request CPU and memory: AWS Lambda allocates CPU proportionally to memory. A 128 MB function gets a small fraction of a core; a 1769 MB function gets a full vCPU.
Cloudflare Workers and similar 'isolate'-based runtimes break the one-request-per-sandbox rule (multiple requests share a V8 isolate), trading some isolation for much lower per-request overhead and far better cold-start behavior.
Hard limits to respect
Every FaaS platform has rigid limits that constrain architecture:
| Limit | AWS Lambda (typical) | Implication |
|---|---|---|
| Max execution time | 15 minutes | Long-running jobs need to be split or run elsewhere |
| Max memory | 10 GB | Large in-memory computations or models may not fit |
| Max payload (sync) | 6 MB | Large uploads must go through S3 with a presigned URL |
| Max payload (async) | 256 KB | Event size bounded |
| Temp storage | 10 GB (/tmp) | Large data must come from object storage |
| Concurrency cap | 1000/account by default | High-traffic functions need a quota increase |
| Cold start | 50 ms - 2 s | Latency-sensitive paths must keep warm or use provisioned concurrency |
These are not optional. Designing serverless without naming them is a junior signal.
State: where it lives
FaaS is fundamentally stateless compute. You cannot keep state across invocations. Every state need has to land somewhere external:
- Caching layer: Redis (ElastiCache / MemoryDB), DynamoDB, or in-memory only when warmth is acceptable.
- Database: DynamoDB and similar serverless databases pair naturally; relational DBs need a connection pooler (RDS Proxy) to survive concurrency storms.
- Workflow state: Step Functions or Durable Functions store the state of a multi-step workflow so individual steps can be stateless.
- Object storage: S3 / GCS for any large artifacts (uploads, intermediate files).
A realistic serverless architecture is one part code, three parts managed services.
Triggers: the event surface
The interesting thing about FaaS is that anything can be a trigger:
- HTTP request via API Gateway or function URL.
- Object created in S3.
- Message arrived in SQS, SNS, EventBridge, Kinesis.
- Row changed in DynamoDB stream.
- Scheduled by cron expression.
- IoT device event.
- Workflow step in Step Functions.
This trigger ecosystem is the real win. Wiring 'when a file lands in S3, transcode it, store the result, send a notification' becomes four small functions and three event subscriptions. No long-running orchestrator to babysit.
Implementation
A canonical serverless pipeline
A classic 'upload, process, notify' workflow looks like this:
user upload
|
v
+-------------+ S3:ObjectCreated +------------+ writes to +-----------+
| S3 bucket | -----------------> | Lambda A | --------> | DynamoDB |
+-------------+ | (process) | +-----------+
+-----+------+
|
| publish event
v
+-------------+
| EventBridge |
+------+------+
|
v
+-------------+ +----------------+
| Lambda B |--> | SES / SNS push |
| (notify) | +----------------+
+-------------+Notable features:
- No servers to manage anywhere. Every box is managed.
- Failure isolation: if Lambda B fails, retried independently; if EventBridge degrades, both Lambdas can be invoked from a dead-letter queue.
- Scale: the pipeline auto-scales from 1 file per hour to 10,000 per minute without any code change.
- Cost at idle: zero. You pay only when files arrive.
Cold start mitigation
Four established techniques:
- Provisioned concurrency (AWS Lambda) / min instances (Cloud Run, Cloud Functions): keep N sandboxes pre-warmed. Pay for them whether used or not. The classic 'serverless minus the scale-to-zero benefit' trade.
- SnapStart (AWS Lambda Java, .NET, Python): take a snapshot of the initialized runtime and resume from snapshot instead of cold-booting. Often 90% reduction in cold-start time.
- Smaller, faster runtimes: Go, Rust, and Node tend to have sub-100 ms cold starts; large JVM apps have multi-second cold starts. Choosing the runtime is choosing the cold-start floor.
- Architecture: keep latency-sensitive paths off FaaS or put a fast cache in front of them.
Not a fix: 'pinging the function every minute to keep it warm'. This works for one sandbox. Under any concurrency, new sandboxes still cold-start.
The connection-storm problem
A naive serverless function that opens a Postgres connection per invocation works fine at low traffic. Under burst it opens hundreds or thousands of connections in seconds and the database falls over.
The standard fixes:
- Use a connection pooler outside the function (RDS Proxy, PgBouncer). The function sees a small pool; the pool multiplexes onto a small number of real DB connections.
- Use a serverless database (DynamoDB, Aurora Serverless v2, Firestore) that scales connections naturally.
- Use HTTP-based data APIs (Aurora Data API, Hasura, Supabase REST) that hide the connection layer.
- Keep connections in module scope (outside the handler) so warm invocations reuse them within a sandbox. This caps connections at 'max concurrency' rather than 'requests per second'.
Observability is harder
With serverless you get:
- Metrics: per-function invocation count, duration, errors, throttles, cold-start count. Easy.
- Logs: per-invocation logs to CloudWatch / Stackdriver. Easy but expensive to query at scale.
- Traces: distributed tracing across function chains needs explicit propagation (X-Ray, OpenTelemetry SDK). Harder.
- Profiling and debugging: no SSH to attach a debugger. You add log statements and redeploy. The feedback loop is slower than for a long-running service.
A real serverless platform investment includes structured logging, trace context propagation, and cost-aware log retention.
Cost model and the crossover point
Serverless cost is roughly:
cost = invocations * unit_price + GB-seconds * memory_priceAWS Lambda (representative 2026 pricing):
- $0.20 per million requests.
- $0.0000166667 per GB-second of compute.
Worked example: a 200 ms function with 512 MB of memory invoked 10 million times a day:
- Requests: 10M * 30 days = 300M -> ~$60/month for invocation cost.
- GB-seconds: 0.5 GB * 0.2 s * 300M = 30M GB-s -> ~$500/month for compute.
- Total: ~$560/month.
A t3.medium EC2 instance running the same workload as a long-running service can be had for ~$30/month. Even with 4-5 instances behind a load balancer for redundancy and burst, you are looking at ~$200/month all-in.
At steady high throughput, dedicated compute is dramatically cheaper. The break-even point is workload-dependent but usually lands around 'busy more than ~30% of the day'. Below that, serverless wins on cost AND operations.
When to Use
Strong fit for serverless
- Spiky / unpredictable traffic (webhook receivers, image processing, IoT ingestion).
- Event-driven glue code: 'when X happens, do Y' wiring between managed services.
- Low-traffic APIs and admin tools where idle cost dominates.
- Scheduled jobs: cron-style work that runs minutes per day.
- Stateless transformations: data validation, format conversion, enrichment.
- Edge compute: Cloudflare Workers / Lambda@Edge for personalization, A/B testing, auth at the edge.
- Prototypes and MVPs where shipping speed matters more than per-request cost.
Probably do not use serverless
- Steady high-throughput services where dedicated compute is cheaper and warmer.
- Latency-critical paths (sub-50ms p99 targets) where cold starts can blow the SLO.
- Long-running jobs beyond the platform's max execution time (15 min on Lambda).
- Stateful workloads that benefit from in-memory state, sticky sessions, or open connections.
- WebSocket / streaming connections at scale (the connection model fights the per-invocation billing).
- Heavy CPU or GPU work that exceeds FaaS resource limits.
- Vendor-portability requirements where being locked into provider-specific event services is unacceptable.
Compared to alternatives
| Pattern | When it wins | When it loses |
|---|---|---|
| FaaS (Lambda) | Spiky, event-driven, scale-to-zero | Steady high QPS, sub-50ms latency, long jobs |
| Serverless containers (Cloud Run, Fargate) | Same scale-to-zero benefit but you control the container; longer max runtime | Higher per-request floor than pure FaaS; still a vendor lock |
| Managed K8s (EKS, GKE) | Mixed workloads, polyglot, full control | Operational overhead is real |
| Self-managed VMs | Steady traffic, cost-optimized at scale | All ops on you |
| PaaS (Heroku, Fly.io, Render) | Small teams, simple apps, fast deploys | Cost at scale; less flexibility |
Hybrid is the realistic answer
Most serious companies running serverless run a hybrid: long-running services for the steady core, FaaS for the spiky edges (webhooks, image processing, scheduled tasks). The interview answer 'serverless for these workloads, dedicated compute for these other workloads, here is my reason' is consistently stronger than 'all-in on either side'.
Case Studies
iRobot: serverless from day one
iRobot publicly described running their entire backend (millions of Roomba devices, billions of telemetry events) on AWS Lambda + DynamoDB + Kinesis with no traditional servers. Their team is famously small for the workload because they never built any of the infrastructure those services replace. This is one of the canonical 'serverless at scale' references.
Lesson: an event-driven IoT workload is almost the textbook fit for serverless, and the operational savings are enormous when the team is small relative to the workload.
Netflix: serverless for media pipelines
Netflix uses AWS Lambda extensively for video processing pipelines: each new master file triggers transcoding, validation, packaging, and publishing through a chain of functions. The streaming serving path itself remains on long-running services because of the QPS and latency profile, but the asynchronous media-processing work is a natural FaaS fit.
Lesson: a single company's architecture mixes serverless and dedicated compute deliberately; the boundary is set by per-workload economics and latency.
Coca-Cola vending machines
Coca-Cola described a serverless rewrite of their vending-machine payment system that cut backend operating cost dramatically by moving from always-on EC2 to per-transaction Lambda. The workload is naturally bursty (transactions during the day, near zero overnight) and matches the serverless cost model precisely.
Lesson: workloads with high idle ratios are where serverless wins on raw dollars, not just operations.
A typical cautionary tale: the always-busy API
Many engineering blogs describe teams porting a steady-throughput API (constant ~1000 RPS) to Lambda and watching the bill explode. The cost per million invocations multiplied by the constant traffic exceeded what a small autoscaling group of EC2s would cost by 5-10x. The teams either added provisioned concurrency (which mostly negates the serverless cost advantage) or migrated those endpoints back to ECS / Fargate.
Lesson: serverless is not a universal cost win. The break-even with dedicated compute lands in favor of dedicated when traffic is steady and high.
Cloudflare Workers: a different cold-start story
Cloudflare Workers run on V8 isolates rather than per-request containers. Cold starts are typically <5 ms because there is no container to spin up; the runtime stays warm and the function is just a new isolate inside it. This makes Workers viable for edge personalization on every page load, a workload that would be marginal on traditional Lambda due to cold starts.
Lesson: 'serverless cold start' is platform-dependent. Pick the runtime that matches the latency requirements; isolates are a different point on the trade-off curve from per-request containers.
Quick Review
- Serverless = scale to zero, pay per usage, ops handled by the provider. FaaS is the most visible flavor.
- Cold starts dominate the tail latency of low-traffic functions; warm invocations are fast.
- Concurrency model is one request per sandbox (except for isolate runtimes like Cloudflare Workers).
- Hard limits (15 min max runtime, 10 GB memory, 6 MB payload, account concurrency caps) drive architecture.
- State lives outside: in serverless DBs, caches, object storage, or workflow engines.
- Cost wins for spiky workloads; loses for steady high-throughput workloads.
- Hybrid (FaaS at the edges, dedicated compute for the steady core) is the realistic enterprise pattern.
Real-World Examples
How real systems implement this in production
Runs the backend for millions of Roomba devices on a serverless stack (AWS Lambda, DynamoDB, Kinesis, IoT Core). Famously operates with a small team because there are no servers to patch, scale, or replace. A canonical reference for IoT workloads where event volume is huge but per-event work is small.
Trade-off: Trades raw cost-per-request for operational simplicity; works because device traffic is naturally bursty and the team is small.
Uses AWS Lambda for the asynchronous video-processing pipeline (transcoding orchestration, validation, packaging) while keeping the actual streaming-serving path on long-running services. Demonstrates the 'hybrid by workload' pattern: serverless for spiky asynchronous work, dedicated compute for steady high-QPS serving.
Trade-off: Hybrid is the right answer at scale: serverless where workloads are bursty, dedicated compute where they are not, with no ideological commitment either way.
Migrated their vending-machine payment backend from always-on EC2 to AWS Lambda. Reported significant cost reduction because vending traffic is highly bursty (busy during the day, near-zero at night), matching the serverless cost model. A clean case where idle cost was the dominant expense.
Trade-off: Highly bursty workloads with idle periods are the perfect serverless fit; steady-state workloads at this scale would be cheaper on dedicated compute.
Runs JavaScript / WebAssembly functions on V8 isolates at every edge POP worldwide. Cold starts are typically under 5 ms because the runtime is shared and only a new isolate is created per request. Used by major sites for edge-side personalization, A/B testing, and authentication where Lambda cold starts would be too slow.
Trade-off: V8 isolates eliminate the cold-start tail at the cost of a more constrained programming model (no native code, tight memory limits).
A serverless flavor of Aurora that scales database capacity smoothly with load and bills per ACU-second rather than per instance-hour. Designed to pair with serverless compute (Lambda, Fargate) without the connection-storm problem of fixed-size RDS instances. Demonstrates that 'serverless' has spread well beyond compute into databases and event services.
Trade-off: Removes the connection-storm and capacity-planning pain of pairing FaaS with relational databases, but adds Aurora's premium pricing on top.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
The provider needs an environment to run the code. 1) Find or allocate a sandbox (microVM in AWS Lambda's case, called Firecracker). 2) Download the function code and any layers from object storage. 3) Start the language runtime (Node, Python, Java, etc.). 4) Run any init code outside the handler (loading libraries, opening connections). 5) Call the handler with the event payload. The full sequence is the cold start; subsequent invocations to the same sandbox skip steps 1-4 and only run the handler. Cold start times range from ~50 ms (Cloudflare Workers, AWS SnapStart) to 1-2 s (large JVM Lambdas).
Lambda wins when traffic is spiky or unpredictable, when idle cost matters (low traffic most of the time), when the workload is naturally event-driven (S3, queues, schedules), when the team is small and doesn't want to operate infrastructure, and for prototypes / MVPs. ECS or Fargate wins when traffic is steady and high-throughput (cost crosses over), when latency requirements rule out cold starts, when the workload is long-running, when you need WebSocket / streaming connections, or when state and connection pooling are essential.
Naive code opens a new connection per invocation; under burst this exhausts the database's connection pool fast. Solutions: 1) Open the connection in module scope (outside the handler) so warm invocations reuse it within a sandbox; this caps connections at 'max concurrent invocations'. 2) Use RDS Proxy or another connection pooler outside the function so the function sees a small pool that multiplexes onto few real DB connections. 3) Use a serverless database (DynamoDB, Aurora Serverless, Firestore) designed for the access pattern. 4) Use HTTP-based data APIs that hide the connection layer entirely.
1) Client uploads to S3 with a presigned URL. 2) S3 ObjectCreated event triggers Lambda A, which validates and resizes (or sends large jobs to Fargate / SageMaker if outside Lambda's resource limits). 3) Lambda A writes metadata to DynamoDB and the resized variants back to S3. 4) Lambda A publishes a 'processing complete' event to EventBridge or SNS. 5) Lambda B subscribes to that event and notifies the user (push, email via SES). Each step has DLQs (dead-letter queues) for retries; observability is CloudWatch metrics + X-Ray traces. The whole pipeline scales from 1 to thousands of uploads per minute with zero infra changes and idle cost is zero.
Cold starts blow the latency tail of low-traffic functions; provisioned concurrency mitigates but eats into the cost benefit. Connection storms exhaust databases under burst; use RDS Proxy or serverless DBs. Hard execution limits (15 min, 10 GB) constrain workload shape. Observability is harder than for long-running services. Cost flips negative at steady high throughput. Vendor lock-in via the trigger ecosystem is real. Local development of an event-driven serverless system is harder than running a single binary. Naming several of these honestly is a much stronger answer than reciting only the marketing.
Interview Tips
How to discuss this topic effectively
Always frame serverless as 'scale-to-zero compute the provider runs'. Most candidates default to 'I would use Lambda' without naming what serverless actually is, which is a junior signal.
Quote concrete cold-start numbers (50 ms for Cloudflare Workers, 100-300 ms for a small Node Lambda, 1-2 s for a large JVM Lambda). Specific numbers signal hands-on experience.
Volunteer the cost crossover point. Saying 'serverless wins below ~30% utilization, dedicated compute wins above' demonstrates economic literacy, not just architectural enthusiasm.
Mention the connection-storm problem and the fix (RDS Proxy, serverless databases, or module-scope connections). Interviewers love this because it is a real production trap.
Position the answer as hybrid by default. 'I would use Lambda for these spiky workloads and ECS for these steady ones' is a much stronger answer than 'all serverless'.
Common Mistakes
Pitfalls to avoid in interviews
Putting a steady high-throughput API on Lambda without checking the math
At constant high QPS, dedicated compute is typically 5-10x cheaper than Lambda for the same workload. Run the cost model before committing; serverless is not a universal cost win.
Opening a fresh database connection inside every Lambda invocation
Concurrent invocations multiply connections and exhaust the DB connection pool fast. Use a connection pooler (RDS Proxy, PgBouncer), a serverless database, or keep the connection in module scope so warm invocations reuse it.
Trying to keep functions warm with a 1-minute ping
Warming one sandbox does not warm the others created by burst traffic. For real cold-start mitigation use provisioned concurrency, SnapStart, or a runtime with low cold-start cost.
Designing a long-running job inside Lambda without splitting it
The 15-minute max runtime is a hard wall. Split work into smaller units triggered by SQS, Step Functions, or recursive invocations, or run the job on Fargate / Batch instead.
Treating serverless as vendor-neutral
Lambda + EventBridge + DynamoDB is deeply AWS-specific. The trigger ecosystem is the lock-in. If portability matters, prefer container-based serverless (Cloud Run, Fargate) or build your event glue with portable abstractions (Kafka).
