System Design Article

Serverless Architecture & FaaS

Difficulty: Medium

Serverless does not mean 'no servers'. It means the cloud provider runs the servers, scales them to zero when idle, and bills you per request rather than per running hour. Functions-as-a-Service (Lambda, Cloud Functions, Cloud Run, Azure Functions) is the most visible flavor. The pattern is genuinely powerful for spiky workloads, glue code, and small teams who want to skip the infrastructure tax. It is genuinely a bad fit for steady high-throughput services, latency-critical paths, and stateful systems. This lesson covers how serverless actually executes (cold starts, warm pools, concurrency limits), the architectural patterns it enables, the patterns it breaks, and the honest cost model.

System Design
/

Serverless Architecture & FaaS

Serverless Architecture & FaaS

Serverless does not mean 'no servers'. It means the cloud provider runs the servers, scales them to zero when idle, and bills you per request rather than per running hour. Functions-as-a-Service (Lambda, Cloud Functions, Cloud Run, Azure Functions) is the most visible flavor. The pattern is genuinely powerful for spiky workloads, glue code, and small teams who want to skip the infrastructure tax. It is genuinely a bad fit for steady high-throughput services, latency-critical paths, and stateful systems. This lesson covers how serverless actually executes (cold starts, warm pools, concurrency limits), the architectural patterns it enables, the patterns it breaks, and the honest cost model.

System Design
Medium
serverless
faas
event-driven-architecture
system-design
advanced
premium

792 views

3

Motivation

The traditional model of running a service is: you pick an instance size, you start the process, you keep it running 24/7, you scale it manually or with auto-scaling rules, and you pay for every hour the instance exists whether it is doing work or not.

For a service with steady traffic that fully utilizes its CPU and memory, this is fine. For most workloads, it is wasteful:

  • A nightly batch job: 10 minutes of work; 23 hours 50 minutes of idle paid compute.
  • An image-resize endpoint that gets 5 requests per minute on average and 200 per minute during a viral moment: dramatically over- or under-provisioned almost all the time.
  • A webhook receiver: bursty by definition; sometimes zero traffic for hours.
  • A small internal API: maybe 1000 requests per day; a full instance is enormous overkill.

Serverless inverts the model: the cloud provider keeps the runtime warm somewhere, you give them code that runs per event, and you pay only for the milliseconds your code actually executed. Idle costs nothing. Auto-scaling is automatic and goes to zero. Operations (patching, capacity planning, instance failure recovery) are no longer your problem.

The price you pay: cold starts, vendor-specific runtimes, hard execution limits, painful state management, observability that is harder than it should be, and a cost model that flips on you at high steady throughput.

Knowing both the wins and the costs is what makes the architectural choice 'serverless or not' a senior-level decision rather than a 'because everyone does it' default.

Deep Dive

What 'serverless' actually means

The term spans several distinct services. Pinning down exactly what you mean is half of any interview answer.

LayerExamplesWhat 'serverless' means here
FaaS (functions)AWS Lambda, GCP Cloud Functions, Azure Functions, Cloudflare WorkersYou upload a function. Provider runs it per event. Pay per invocation + GB-seconds.
Serverless containersAWS Fargate, Cloud Run, Azure Container AppsYou upload a container. Provider scales instances 0..N based on traffic. Pay per CPU-second / GB-second.
Serverless databasesDynamoDB on-demand, Aurora Serverless, Firestore, Cosmos DB serverlessYou write queries; capacity scales automatically; pay per request unit.
Serverless event servicesEventBridge, SNS, SQS, Kinesis on-demandPay per event; no broker to size.

The common thread: no instances to manage, scale to zero when idle, billed by usage. This lesson focuses primarily on FaaS because it is the prototype, but the pattern generalizes.

How a FaaS request actually executes

When a function is triggered (HTTP, event, schedule, queue message), the provider needs an executable environment. There are three lifecycle states.

Text
+--- Init (cold start) ----+   +--- Invoke (warm) ----+   +--- Idle ----+
| 1) Find/start a sandbox  |   | reuse same sandbox   |   | sandbox     |
| 2) Download code/layers  |-->| just call the handler|-->| kept warm   |
| 3) Start runtime         |   | (1-10 ms overhead)   |   | for ~5-15 m |
| 4) Run init code         |   |                      |   |             |
| 5) Call handler          |   |                      |   |             |
+--------------------------+   +----------------------+   +-------------+
      typically 50ms - 2s          fast                       provider tears
      (much worse for JVM/        path                        down silently
      .NET, much better for
      Go / Rust / Cloudflare
      isolates)

Cold start: the first request to a new sandbox pays for sandbox creation, code download, runtime startup, and any init code (e.g., loading models, opening connections). Cold-start times in 2026 typically range from ~50 ms (Cloudflare Workers, AWS Lambda SnapStart) to ~2 s (large JVM or .NET Lambdas with heavy init).

Warm invocation: subsequent requests to the same sandbox skip init; only the handler runs. Per-invocation overhead is in single-digit milliseconds.

Idle: after some time without traffic (typically 5-15 minutes), the provider tears down the sandbox. The next request pays a cold start.

Why does this matter? Because cold starts dominate the tail of latency for low-traffic functions. A function that gets one request per minute pays a cold start about every minute; its p99 will be the cold-start time, not the warm-path time.

Concurrency model

A FaaS function processes one request per sandbox at a time (the standard model in AWS Lambda, Cloud Functions, Azure Functions). If 100 requests arrive simultaneously, the provider spins up 100 sandboxes. This is wonderful for scaling but it has consequences:

  • Connection storms: every cold sandbox opens its own database connection. 1000 concurrent invocations = 1000 DB connections. RDS connection limits get hit fast.
  • Burst limits: every provider has a per-account concurrency cap (e.g., AWS Lambda's 1000 default per region). Hit it and requests get throttled.
  • No internal state: a counter you increment in memory dies with the sandbox; a cache you build in memory benefits one sandbox only.
  • Per-request CPU and memory: AWS Lambda allocates CPU proportionally to memory. A 128 MB function gets a small fraction of a core; a 1769 MB function gets a full vCPU.

Cloudflare Workers and similar 'isolate'-based runtimes break the one-request-per-sandbox rule (multiple requests share a V8 isolate), trading some isolation for much lower per-request overhead and far better cold-start behavior.

Hard limits to respect

Every FaaS platform has rigid limits that constrain architecture:

LimitAWS Lambda (typical)Implication
Max execution time15 minutesLong-running jobs need to be split or run elsewhere
Max memory10 GBLarge in-memory computations or models may not fit
Max payload (sync)6 MBLarge uploads must go through S3 with a presigned URL
Max payload (async)256 KBEvent size bounded
Temp storage10 GB (/tmp)Large data must come from object storage
Concurrency cap1000/account by defaultHigh-traffic functions need a quota increase
Cold start50 ms - 2 sLatency-sensitive paths must keep warm or use provisioned concurrency

These are not optional. Designing serverless without naming them is a junior signal.

State: where it lives

FaaS is fundamentally stateless compute. You cannot keep state across invocations. Every state need has to land somewhere external:

  • Caching layer: Redis (ElastiCache / MemoryDB), DynamoDB, or in-memory only when warmth is acceptable.
  • Database: DynamoDB and similar serverless databases pair naturally; relational DBs need a connection pooler (RDS Proxy) to survive concurrency storms.
  • Workflow state: Step Functions or Durable Functions store the state of a multi-step workflow so individual steps can be stateless.
  • Object storage: S3 / GCS for any large artifacts (uploads, intermediate files).

A realistic serverless architecture is one part code, three parts managed services.

Triggers: the event surface

The interesting thing about FaaS is that anything can be a trigger:

  • HTTP request via API Gateway or function URL.
  • Object created in S3.
  • Message arrived in SQS, SNS, EventBridge, Kinesis.
  • Row changed in DynamoDB stream.
  • Scheduled by cron expression.
  • IoT device event.
  • Workflow step in Step Functions.

This trigger ecosystem is the real win. Wiring 'when a file lands in S3, transcode it, store the result, send a notification' becomes four small functions and three event subscriptions. No long-running orchestrator to babysit.

Implementation

A canonical serverless pipeline

A classic 'upload, process, notify' workflow looks like this:

Text
user upload
      |
      v
  +-------------+  S3:ObjectCreated  +------------+ writes to +-----------+
  | S3 bucket   | -----------------> | Lambda A   | --------> | DynamoDB  |
  +-------------+                    | (process)  |           +-----------+
                                     +-----+------+
                                           |
                                           | publish event
                                           v
                                     +-------------+
                                     | EventBridge |
                                     +------+------+
                                            |
                                            v
                                     +-------------+    +----------------+
                                     | Lambda B    |--> | SES / SNS push |
                                     | (notify)    |    +----------------+
                                     +-------------+

Notable features:

  • No servers to manage anywhere. Every box is managed.
  • Failure isolation: if Lambda B fails, retried independently; if EventBridge degrades, both Lambdas can be invoked from a dead-letter queue.
  • Scale: the pipeline auto-scales from 1 file per hour to 10,000 per minute without any code change.
  • Cost at idle: zero. You pay only when files arrive.

Cold start mitigation

Four established techniques:

  1. Provisioned concurrency (AWS Lambda) / min instances (Cloud Run, Cloud Functions): keep N sandboxes pre-warmed. Pay for them whether used or not. The classic 'serverless minus the scale-to-zero benefit' trade.
  2. SnapStart (AWS Lambda Java, .NET, Python): take a snapshot of the initialized runtime and resume from snapshot instead of cold-booting. Often 90% reduction in cold-start time.
  3. Smaller, faster runtimes: Go, Rust, and Node tend to have sub-100 ms cold starts; large JVM apps have multi-second cold starts. Choosing the runtime is choosing the cold-start floor.
  4. Architecture: keep latency-sensitive paths off FaaS or put a fast cache in front of them.

Not a fix: 'pinging the function every minute to keep it warm'. This works for one sandbox. Under any concurrency, new sandboxes still cold-start.

The connection-storm problem

A naive serverless function that opens a Postgres connection per invocation works fine at low traffic. Under burst it opens hundreds or thousands of connections in seconds and the database falls over.

The standard fixes:

  • Use a connection pooler outside the function (RDS Proxy, PgBouncer). The function sees a small pool; the pool multiplexes onto a small number of real DB connections.
  • Use a serverless database (DynamoDB, Aurora Serverless v2, Firestore) that scales connections naturally.
  • Use HTTP-based data APIs (Aurora Data API, Hasura, Supabase REST) that hide the connection layer.
  • Keep connections in module scope (outside the handler) so warm invocations reuse them within a sandbox. This caps connections at 'max concurrency' rather than 'requests per second'.

Observability is harder

With serverless you get:

  • Metrics: per-function invocation count, duration, errors, throttles, cold-start count. Easy.
  • Logs: per-invocation logs to CloudWatch / Stackdriver. Easy but expensive to query at scale.
  • Traces: distributed tracing across function chains needs explicit propagation (X-Ray, OpenTelemetry SDK). Harder.
  • Profiling and debugging: no SSH to attach a debugger. You add log statements and redeploy. The feedback loop is slower than for a long-running service.

A real serverless platform investment includes structured logging, trace context propagation, and cost-aware log retention.

Cost model and the crossover point

Serverless cost is roughly:

Text
cost = invocations * unit_price + GB-seconds * memory_price

AWS Lambda (representative 2026 pricing):

  • $0.20 per million requests.
  • $0.0000166667 per GB-second of compute.

Worked example: a 200 ms function with 512 MB of memory invoked 10 million times a day:

  • Requests: 10M * 30 days = 300M -> ~$60/month for invocation cost.
  • GB-seconds: 0.5 GB * 0.2 s * 300M = 30M GB-s -> ~$500/month for compute.
  • Total: ~$560/month.

A t3.medium EC2 instance running the same workload as a long-running service can be had for ~$30/month. Even with 4-5 instances behind a load balancer for redundancy and burst, you are looking at ~$200/month all-in.

At steady high throughput, dedicated compute is dramatically cheaper. The break-even point is workload-dependent but usually lands around 'busy more than ~30% of the day'. Below that, serverless wins on cost AND operations.

When to Use

Strong fit for serverless

  • Spiky / unpredictable traffic (webhook receivers, image processing, IoT ingestion).
  • Event-driven glue code: 'when X happens, do Y' wiring between managed services.
  • Low-traffic APIs and admin tools where idle cost dominates.
  • Scheduled jobs: cron-style work that runs minutes per day.
  • Stateless transformations: data validation, format conversion, enrichment.
  • Edge compute: Cloudflare Workers / Lambda@Edge for personalization, A/B testing, auth at the edge.
  • Prototypes and MVPs where shipping speed matters more than per-request cost.

Probably do not use serverless

  • Steady high-throughput services where dedicated compute is cheaper and warmer.
  • Latency-critical paths (sub-50ms p99 targets) where cold starts can blow the SLO.
  • Long-running jobs beyond the platform's max execution time (15 min on Lambda).
  • Stateful workloads that benefit from in-memory state, sticky sessions, or open connections.
  • WebSocket / streaming connections at scale (the connection model fights the per-invocation billing).
  • Heavy CPU or GPU work that exceeds FaaS resource limits.
  • Vendor-portability requirements where being locked into provider-specific event services is unacceptable.

Compared to alternatives

PatternWhen it winsWhen it loses
FaaS (Lambda)Spiky, event-driven, scale-to-zeroSteady high QPS, sub-50ms latency, long jobs
Serverless containers (Cloud Run, Fargate)Same scale-to-zero benefit but you control the container; longer max runtimeHigher per-request floor than pure FaaS; still a vendor lock
Managed K8s (EKS, GKE)Mixed workloads, polyglot, full controlOperational overhead is real
Self-managed VMsSteady traffic, cost-optimized at scaleAll ops on you
PaaS (Heroku, Fly.io, Render)Small teams, simple apps, fast deploysCost at scale; less flexibility

Hybrid is the realistic answer

Most serious companies running serverless run a hybrid: long-running services for the steady core, FaaS for the spiky edges (webhooks, image processing, scheduled tasks). The interview answer 'serverless for these workloads, dedicated compute for these other workloads, here is my reason' is consistently stronger than 'all-in on either side'.

Case Studies

iRobot: serverless from day one

iRobot publicly described running their entire backend (millions of Roomba devices, billions of telemetry events) on AWS Lambda + DynamoDB + Kinesis with no traditional servers. Their team is famously small for the workload because they never built any of the infrastructure those services replace. This is one of the canonical 'serverless at scale' references.

Lesson: an event-driven IoT workload is almost the textbook fit for serverless, and the operational savings are enormous when the team is small relative to the workload.

Netflix: serverless for media pipelines

Netflix uses AWS Lambda extensively for video processing pipelines: each new master file triggers transcoding, validation, packaging, and publishing through a chain of functions. The streaming serving path itself remains on long-running services because of the QPS and latency profile, but the asynchronous media-processing work is a natural FaaS fit.

Lesson: a single company's architecture mixes serverless and dedicated compute deliberately; the boundary is set by per-workload economics and latency.

Coca-Cola vending machines

Coca-Cola described a serverless rewrite of their vending-machine payment system that cut backend operating cost dramatically by moving from always-on EC2 to per-transaction Lambda. The workload is naturally bursty (transactions during the day, near zero overnight) and matches the serverless cost model precisely.

Lesson: workloads with high idle ratios are where serverless wins on raw dollars, not just operations.

A typical cautionary tale: the always-busy API

Many engineering blogs describe teams porting a steady-throughput API (constant ~1000 RPS) to Lambda and watching the bill explode. The cost per million invocations multiplied by the constant traffic exceeded what a small autoscaling group of EC2s would cost by 5-10x. The teams either added provisioned concurrency (which mostly negates the serverless cost advantage) or migrated those endpoints back to ECS / Fargate.

Lesson: serverless is not a universal cost win. The break-even with dedicated compute lands in favor of dedicated when traffic is steady and high.

Cloudflare Workers: a different cold-start story

Cloudflare Workers run on V8 isolates rather than per-request containers. Cold starts are typically <5 ms because there is no container to spin up; the runtime stays warm and the function is just a new isolate inside it. This makes Workers viable for edge personalization on every page load, a workload that would be marginal on traditional Lambda due to cold starts.

Lesson: 'serverless cold start' is platform-dependent. Pick the runtime that matches the latency requirements; isolates are a different point on the trade-off curve from per-request containers.

Quick Review

  • Serverless = scale to zero, pay per usage, ops handled by the provider. FaaS is the most visible flavor.
  • Cold starts dominate the tail latency of low-traffic functions; warm invocations are fast.
  • Concurrency model is one request per sandbox (except for isolate runtimes like Cloudflare Workers).
  • Hard limits (15 min max runtime, 10 GB memory, 6 MB payload, account concurrency caps) drive architecture.
  • State lives outside: in serverless DBs, caches, object storage, or workflow engines.
  • Cost wins for spiky workloads; loses for steady high-throughput workloads.
  • Hybrid (FaaS at the edges, dedicated compute for the steady core) is the realistic enterprise pattern.

Real-World Examples

How real systems implement this in production

iRobot

Runs the backend for millions of Roomba devices on a serverless stack (AWS Lambda, DynamoDB, Kinesis, IoT Core). Famously operates with a small team because there are no servers to patch, scale, or replace. A canonical reference for IoT workloads where event volume is huge but per-event work is small.

Trade-off: Trades raw cost-per-request for operational simplicity; works because device traffic is naturally bursty and the team is small.

Netflix media pipeline

Uses AWS Lambda for the asynchronous video-processing pipeline (transcoding orchestration, validation, packaging) while keeping the actual streaming-serving path on long-running services. Demonstrates the 'hybrid by workload' pattern: serverless for spiky asynchronous work, dedicated compute for steady high-QPS serving.

Trade-off: Hybrid is the right answer at scale: serverless where workloads are bursty, dedicated compute where they are not, with no ideological commitment either way.

Coca-Cola vending payments

Migrated their vending-machine payment backend from always-on EC2 to AWS Lambda. Reported significant cost reduction because vending traffic is highly bursty (busy during the day, near-zero at night), matching the serverless cost model. A clean case where idle cost was the dominant expense.

Trade-off: Highly bursty workloads with idle periods are the perfect serverless fit; steady-state workloads at this scale would be cheaper on dedicated compute.

Cloudflare Workers

Runs JavaScript / WebAssembly functions on V8 isolates at every edge POP worldwide. Cold starts are typically under 5 ms because the runtime is shared and only a new isolate is created per request. Used by major sites for edge-side personalization, A/B testing, and authentication where Lambda cold starts would be too slow.

Trade-off: V8 isolates eliminate the cold-start tail at the cost of a more constrained programming model (no native code, tight memory limits).

AWS Aurora Serverless v2

A serverless flavor of Aurora that scales database capacity smoothly with load and bills per ACU-second rather than per instance-hour. Designed to pair with serverless compute (Lambda, Fargate) without the connection-storm problem of fixed-size RDS instances. Demonstrates that 'serverless' has spread well beyond compute into databases and event services.

Trade-off: Removes the connection-storm and capacity-planning pain of pairing FaaS with relational databases, but adds Aurora's premium pricing on top.

Quick Interview Phrases

Key terms to use in your answer

scale to zero
cold start vs warm invocation
provisioned concurrency
connection storm
event-driven glue code
serverless cost crossover

Common Interview Questions

Questions you might be asked about this topic

The provider needs an environment to run the code. 1) Find or allocate a sandbox (microVM in AWS Lambda's case, called Firecracker). 2) Download the function code and any layers from object storage. 3) Start the language runtime (Node, Python, Java, etc.). 4) Run any init code outside the handler (loading libraries, opening connections). 5) Call the handler with the event payload. The full sequence is the cold start; subsequent invocations to the same sandbox skip steps 1-4 and only run the handler. Cold start times range from ~50 ms (Cloudflare Workers, AWS SnapStart) to 1-2 s (large JVM Lambdas).

Interview Tips

How to discuss this topic effectively

1

Always frame serverless as 'scale-to-zero compute the provider runs'. Most candidates default to 'I would use Lambda' without naming what serverless actually is, which is a junior signal.

2

Quote concrete cold-start numbers (50 ms for Cloudflare Workers, 100-300 ms for a small Node Lambda, 1-2 s for a large JVM Lambda). Specific numbers signal hands-on experience.

3

Volunteer the cost crossover point. Saying 'serverless wins below ~30% utilization, dedicated compute wins above' demonstrates economic literacy, not just architectural enthusiasm.

4

Mention the connection-storm problem and the fix (RDS Proxy, serverless databases, or module-scope connections). Interviewers love this because it is a real production trap.

5

Position the answer as hybrid by default. 'I would use Lambda for these spiky workloads and ECS for these steady ones' is a much stronger answer than 'all serverless'.

Common Mistakes

Pitfalls to avoid in interviews

Putting a steady high-throughput API on Lambda without checking the math

At constant high QPS, dedicated compute is typically 5-10x cheaper than Lambda for the same workload. Run the cost model before committing; serverless is not a universal cost win.

Opening a fresh database connection inside every Lambda invocation

Concurrent invocations multiply connections and exhaust the DB connection pool fast. Use a connection pooler (RDS Proxy, PgBouncer), a serverless database, or keep the connection in module scope so warm invocations reuse it.

Trying to keep functions warm with a 1-minute ping

Warming one sandbox does not warm the others created by burst traffic. For real cold-start mitigation use provisioned concurrency, SnapStart, or a runtime with low cold-start cost.

Designing a long-running job inside Lambda without splitting it

The 15-minute max runtime is a hard wall. Split work into smaller units triggered by SQS, Step Functions, or recursive invocations, or run the job on Fargate / Batch instead.

Treating serverless as vendor-neutral

Lambda + EventBridge + DynamoDB is deeply AWS-specific. The trigger ecosystem is the lock-in. If portability matters, prefer container-based serverless (Cloud Run, Fargate) or build your event glue with portable abstractions (Kafka).