System Design Article
Data Pipelines & ETL/ELT
Difficulty: Medium
Data pipelines move data from operational systems (your transactional databases, event logs, third-party APIs) into analytical systems (warehouses, lakes, search indexes, ML feature stores). The 'shape' of the pipeline (ETL vs ELT, batch vs incremental, push vs pull) determines latency, cost, and how painful schema changes will be. This lesson covers the architectural choices: ingestion patterns, transformation engines (dbt, Spark, Beam), orchestration (Airflow, Dagster, Prefect), data quality, lineage, and the standard production layout (raw / staging / mart). It also covers the failure modes you must design for: late-arriving data, idempotency, backfills, schema evolution, and the silent corruption that comes from not testing your pipelines.
Data Pipelines & ETL/ELT
Data pipelines move data from operational systems (your transactional databases, event logs, third-party APIs) into analytical systems (warehouses, lakes, search indexes, ML feature stores). The 'shape' of the pipeline (ETL vs ELT, batch vs incremental, push vs pull) determines latency, cost, and how painful schema changes will be. This lesson covers the architectural choices: ingestion patterns, transformation engines (dbt, Spark, Beam), orchestration (Airflow, Dagster, Prefect), data quality, lineage, and the standard production layout (raw / staging / mart). It also covers the failure modes you must design for: late-arriving data, idempotency, backfills, schema evolution, and the silent corruption that comes from not testing your pipelines.
161 views
1
Motivation
A running product generates data in dozens of shapes: rows in Postgres, events in Kafka, files dumped by partners, third-party API responses, click logs. Almost none of it is in the shape that analytics, ML, or finance teams actually need. The job of a data pipeline is to take that mess and produce trusted, queryable, well-modeled data in the analytical store.
The naive approach is one ad-hoc script per question. Marketing wants daily signups by country; someone writes a cron that copies a Postgres table to a CSV and uploads it. Six months later the company has 80 such scripts, no two are documented, half of them silently broke when someone renamed a column, and the dashboards quietly disagree with the source database.
The disciplined approach is a pipeline: a defined set of stages, each one's input and output explicit, each step orchestrated, tested, monitored, and version-controlled. The right architecture turns 'analytics data is broken again' from a weekly incident into a non-event.
Why this matters at senior level: pipelines are where business decisions get their numbers. A bad pipeline does not crash, it lies. The lying is invisible until the CFO presents the wrong revenue number to the board. Architecting pipelines that are correct, observable, and recoverable is a senior data role at every serious tech company.
Deep Dive
ETL vs ELT: the modern default flipped
ETL (Extract, Transform, Load) was the dominant pattern for 30 years. You pulled data from sources, transformed it on a separate engine (Informatica, Talend, custom Python), then loaded the cleaned result into the warehouse.
Why ETL? Warehouses used to be expensive, slow, and limited (Teradata, Oracle Exadata). You did NOT want to pay them to do generic SQL transforms; you bought a separate transformation engine.
ELT (Extract, Load, Transform) became the new default once cloud warehouses (Snowflake, BigQuery, Redshift) made compute cheap and elastic. You dump raw source data into the warehouse first, then transform inside the warehouse with SQL.
ETL:
[ Source ] -> [ Transform engine (Spark / Python) ] -> [ Warehouse ]
ELT:
[ Source ] -> [ Warehouse: raw layer ] -> [ Warehouse: SQL transforms ] -> [ Marts ]The ELT advantages are real:
- Raw data is preserved. If you discover six months later that a transform was wrong, you can recompute from raw without re-extracting from the source.
- SQL is universal. Analysts and data scientists can read and contribute. Spark / Python pipelines are a separate skill.
- Cloud warehouse compute is elastic. You scale up for a backfill and down afterward.
- Tools (dbt above all) standardized this. Every modern team's transformation layer is some flavor of versioned SQL.
ETL is still right when:
- The transformation needs computation the warehouse cannot do (heavy ML, image processing, complex joins on huge volumes that would melt the warehouse).
- Compliance forbids landing raw PII in the warehouse; you must mask before loading.
- You are working on-prem with an old warehouse and cannot trivially scale compute.
The standard layered pipeline (raw / staging / mart)
Every serious team converges on roughly this layout, sometimes named differently (bronze / silver / gold in Databricks land, raw / staging / mart in dbt land):
+--------+ +-----------+ +--------------+ +----------+
| Source | -> | Raw layer | -> | Staging | -> | Marts |
| | | (as-is) | | (cleaned, | | (busi- |
| | | | | typed, | | ness |
| | | | | deduped) | | views) |
+--------+ +-----------+ +--------------+ +----------+
| | |
| | |
schema close 1 model per 1 model per
to source source table use case- Raw: a near-verbatim copy of the source. Schema close to source. No business logic. The point of the raw layer is fidelity and idempotency: re-running ingestion of the same window must produce the same raw rows.
- Staging: light cleaning. Cast types, rename columns, drop obvious junk, deduplicate. One staging model per source table. Still source-shaped, not business-shaped.
- Marts: business-shaped.
dim_customer,fct_orders,mart_revenue_daily. Multiple sources are joined, business definitions encoded ('what counts as an active user'), and these are what dashboards and downstream consumers query.
The rule: dashboards never query raw or staging directly. They query marts. This is what lets you refactor the upstream layers without breaking every dashboard.
Ingestion patterns
How data lands in the raw layer falls into a few standard shapes.
Full snapshot: every run replaces the entire table. Simple, deterministic, but expensive and loses history. Fine for small dimension tables (10k country codes), wrong for fact tables (10B order rows).
Incremental append: each run adds rows newer than the last high-water mark. Efficient for append-only data (events, logs, immutable orders). Watch for clock skew and late-arriving rows.
Incremental upsert (merge): rows can change. Each run upserts records by primary key. Requires either a reliable updated_at column on the source or change data capture (CDC).
Change data capture (CDC): a stream of row-level inserts / updates / deletes from the source database, usually via the WAL / binlog. Tools: Debezium, Fivetran, AWS DMS. Best fidelity, but more moving parts.
Event ingestion: the source produces events (Kafka, Kinesis, webhook) and the pipeline consumes them in micro-batches (every minute) or in true streaming.
A single product usually uses all five for different sources at the same time. Choosing the right one per source is the first design decision.
Batch vs micro-batch vs streaming
Latency budget drives the choice:
| Pattern | Typical latency | Tools | When |
|---|---|---|---|
| Daily batch | hours | Airflow + dbt | Finance, monthly reporting |
| Hourly batch | ~1 hour | Airflow + dbt + incremental | Most BI dashboards |
| Micro-batch | 1-5 minutes | Spark Structured Streaming, Databricks | Operational dashboards, near-real-time alerts |
| True streaming | seconds | Flink, Kafka Streams | Fraud detection, live ML features |
The trap: every product manager asks for 'real-time'. The cost of going from hourly batch to true streaming is often 5-10x in infrastructure and engineering. Push back: 'what business decision is made faster than every hour, and what is it worth?' If the honest answer is 'none', stay with hourly batch.
Orchestration: the layer that ties it together
A modern pipeline is dozens of tasks with dependencies. The orchestrator is what runs them in the right order, retries the failed ones, and tells you when something broke.
The market converged on a few:
- Apache Airflow: the de facto standard. Python DAGs, mature ecosystem. Showing its age (operator-centric model, weak data-awareness) but ubiquitous.
- Dagster: data-asset-centric (you declare assets and their dependencies, not tasks). Better for the modern dbt + Python world.
- Prefect: lighter weight, good UX. Strong cloud offering.
- Cloud-native (Step Functions, Cloud Composer, Workflows): simple cases on AWS / GCP.
The orchestrator's job is scheduling, dependencies, retries, observability, and backfills. It is NOT where your transformations live (that is dbt / Spark). Mixing them is a common anti-pattern: an Airflow DAG with 200 lines of business logic in PythonOperator calls is impossible to test or version separately.
Idempotency: the property that makes pipelines recoverable
A pipeline task is idempotent if running it twice produces the same result as running it once. This is what lets you safely retry, safely backfill, and safely recover from a half-failed run.
Non-idempotent transform (broken):
INSERT INTO daily_revenue (day, revenue)
SELECT CURRENT_DATE, SUM(amount) FROM orders WHERE day = CURRENT_DATE;
-- Re-running this on the same day duplicates the row.Idempotent transform (correct):
MERGE INTO daily_revenue tgt
USING (
SELECT day, SUM(amount) AS revenue
FROM orders
WHERE day = '{{ run_date }}'
GROUP BY day
) src
ON tgt.day = src.day
WHEN MATCHED THEN UPDATE SET revenue = src.revenue
WHEN NOT MATCHED THEN INSERT (day, revenue) VALUES (src.day, src.revenue);The mental model: a task takes an input partition (a date, a hour, a window), produces an output partition, and overwriting the output partition is safe.
Watermarks and late-arriving data
In batch pipelines, a 'day' is straightforward in the warehouse but not in the source. Events arrive late (mobile app was offline, network glitch, time zones, clock skew). What about an event with event_time = 2024-04-12T23:59:00Z that arrives at 2024-04-13T01:30:00Z?
Two approaches:
- Closed window with grace period: the daily-revenue job for 2024-04-12 runs not at 00:00 the next day but at, say, 02:00 with a 2-hour grace. Anything arriving later goes into the next day's partition (deliberate inaccuracy) or triggers a re-run (correct but operationally painful).
- Reprocessing window: every day re-runs the last 7 days. Late arrivals up to 7 days late are eventually corrected. Simple, expensive, surprisingly common in production.
In streaming pipelines this is the famous watermark concept: the system tracks the maximum event time it has seen and treats everything older than watermark - allowed_lateness as 'too late, dropped or sent to a side output'. Same idea, different mechanics.
Backfills: the operational reality
A backfill is when you re-run the pipeline for a historical window. Reasons: you discovered a bug in a transform, you added a new column and need to populate it for old data, you onboarded a new source.
The questions to design for upfront:
- Are tasks parameterized by date so 'run for 2024-04-12' is a one-line invocation? If not, backfills are pull-out-the-laptop manual work.
- Is each partition's output independent so backfills can run in parallel? If task N+1 reads task N's output for the SAME day only, parallel days are safe. If it reads across days, you have a serial backfill.
- Will the backfill exceed warehouse / source capacity? Backfilling 3 years of daily jobs in parallel can melt your source database. Plan for rate-limited backfills.
- Do downstream consumers handle the partial state during the backfill? A dashboard reading a partially backfilled mart shows misleading numbers. Use a 'backfill-in-progress' flag or a side table to swap atomically at the end.
Data quality: the testing layer pipelines almost never have
Software engineers test their code. Data engineers historically did not. The result: pipelines that run successfully (no exceptions) but produce wrong numbers, often for months.
The modern toolkit:
- Schema tests: column exists, column is non-null, column is in expected enum. dbt tests are the most common implementation; Great Expectations is the standalone library.
- Volume tests: row count is within X% of yesterday's. A daily ingestion that suddenly returns 0 rows is almost always a silent bug, not a real number.
- Freshness tests: the latest row in this table is at most N hours old. Catches stalled ingestion.
- Referential tests: every
customer_idinfct_ordersexists indim_customer. Catches join skew and bad upstream data. - Distribution tests: average order value is within 3 standard deviations of the trailing 30 days. Catches semantic regressions a schema test would miss.
The rule: every published mart has at least the first three. Marts that drive money have all five.
Schema evolution
Upstream schemas change. A new column is added, a column is renamed, a type is widened. Your pipeline has to survive this without silently dropping data or crashing.
Defenses, in order of strength:
- Schema-on-read (raw layer): dump JSON or Parquet with permissive schema in the raw layer. Loss is impossible; the cleaning happens downstream where you can adapt.
- Schema registry (Confluent / Glue): the producer registers a schema; the pipeline checks compatibility (backward, forward, full). Breaks at deploy time, not at runtime.
- Contracts: explicit upstream-downstream agreements ('this team owns this table; here is its schema and SLA'). Common in data-mesh setups.
- Tests in the staging layer: 'these columns must exist with these types'. Catches anything the contract did not.
The rule: never let an upstream schema change reach the marts unmediated. Always have a staging layer that pins the column names and types your marts expect.
Lineage: knowing what depends on what
When a transform breaks, the question is 'what downstream tables, dashboards, ML features, and consumers are affected?'. Without lineage, the answer is 'we will know when someone complains'.
Lineage is a graph of which dataset is built from which. Modern tools (dbt's manifest, Dagster's asset graph, OpenLineage standard, vendor offerings like Monte Carlo / Datafold) emit this graph automatically from the pipeline definition.
Lineage is also the foundation for impact analysis ('if I change this column, what breaks?') and incident response ('this dashboard is wrong; trace back to find which upstream is bad').
Implementation
A typical modern stack (pick one per row)
| Layer | Common choices |
|---|---|
| Ingestion | Fivetran / Airbyte (managed), Debezium (CDC), Spark / Beam (custom) |
| Storage (raw) | S3 / GCS Parquet, Snowflake / BigQuery raw schemas |
| Transformation | dbt (SQL), Spark / Databricks (mixed), Beam / Flink (streaming) |
| Orchestration | Airflow, Dagster, Prefect |
| Quality | dbt tests, Great Expectations, Monte Carlo |
| Lineage / catalog | dbt docs, OpenLineage, DataHub, Atlan |
A realistic small-to-mid company stack: Fivetran for SaaS sources + Debezium for the OLTP DB + Snowflake as warehouse + dbt for transforms + Airflow for orchestration + dbt tests + dbt docs for lineage.
Orchestrating one daily mart
A daily revenue mart in dbt + Airflow:
Python (Airflow DAG)
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
with DAG(
'daily_revenue',
start_date=datetime(2024, 1, 1),
schedule_interval='0 2 * * *', # 02:00 UTC, 2-hour grace for late events
catchup=False,
default_args={'retries': 3, 'retry_delay': timedelta(minutes=15)},
) as dag:
ingest = BashOperator(
task_id='ingest_orders',
bash_command='fivetran trigger orders_connector --wait'
)
transform = BashOperator(
task_id='dbt_run',
bash_command='dbt run --select +mart_revenue_daily --vars run_date={{ ds }}'
)
test = BashOperator(
task_id='dbt_test',
bash_command='dbt test --select +mart_revenue_daily'
)
publish = BashOperator(
task_id='publish_metric',
bash_command='python publish_metric.py --date {{ ds }}'
)
ingest >> transform >> test >> publishJavaScript (TypeScript Dagster equivalent sketch using the JS CLI)
// Dagster is Python-native. This is a Node script that triggers a Dagster job
// via its GraphQL API for use in JS-controlled pipelines.
async function triggerRun(runDate) {
const res = await fetch('https://dagster.example.com/graphql', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
query: 'mutation($pipeline: String!, $vars: JSON!) { launchRun(pipeline: $pipeline, runConfigData: $vars) { runId } }',
variables: { pipeline: 'daily_revenue', vars: { run_date: runDate } },
}),
});
const json = await res.json();
return json.data.launchRun.runId;
}
await triggerRun('2024-04-12');The dbt model itself is a parameterized SQL file:
-- models/marts/mart_revenue_daily.sql
{{ config(materialized='incremental', unique_key='day') }}
SELECT
DATE(o.created_at) AS day,
SUM(o.amount) AS revenue,
COUNT(DISTINCT o.customer_id) AS paying_customers
FROM {{ ref('stg_orders') }} o
WHERE 1 = 1
{% if is_incremental() %}
AND DATE(o.created_at) = '{{ var("run_date") }}'::date
{% endif %}
GROUP BY 1And the test:
-- tests/mart_revenue_daily_volume.sql
-- Fail if today's revenue is more than 5x or less than 0.2x yesterday's.
WITH r AS (
SELECT day, revenue FROM {{ ref('mart_revenue_daily') }}
WHERE day >= CURRENT_DATE - 1
)
SELECT today.day
FROM (SELECT * FROM r WHERE day = CURRENT_DATE) today
JOIN (SELECT * FROM r WHERE day = CURRENT_DATE - 1) yest
ON 1 = 1
WHERE today.revenue > yest.revenue * 5 OR today.revenue < yest.revenue * 0.2This is a complete, idempotent, tested daily pipeline in maybe 60 lines.
Streaming pipeline sketch
A streaming variant for live operational metrics, using Flink as the engine and Kafka as input:
JavaScript (sample Flink job structured in code)
// Conceptual Flink job in TypeScript-like pseudocode for a JS-based codebase
// using a hypothetical Flink Node binding. Real Flink jobs are Java / Scala / Python.
const stream = env.fromKafkaTopic('orders');
const windowed = stream
.keyBy(e => e.region)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowedLateness(Time.minutes(5))
.reduce((a, b) => ({ region: a.region, revenue: a.revenue + b.revenue }));
windowed.sinkTo(new JdbcSink('mart_revenue_per_minute'));
env.execute('live-revenue');Python (PyFlink equivalent)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import Time
from pyflink.datastream.window import TumblingEventTimeWindows
env = StreamExecutionEnvironment.get_execution_environment()
stream = env.add_source(kafka_source('orders'))
windowed = (stream
.key_by(lambda e: e['region'])
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowed_lateness(Time.minutes(5))
.reduce(lambda a, b: {'region': a['region'], 'revenue': a['revenue'] + b['revenue']}))
windowed.add_sink(jdbc_sink('mart_revenue_per_minute'))
env.execute('live-revenue')Note the explicit watermark / allowed-lateness; this is the streaming analog of the batch grace period.
When to Use
Use ELT (modern default) when
- You have a cloud warehouse (Snowflake, BigQuery, Redshift, Databricks SQL).
- The transformations are mostly SQL-shaped (joins, aggregations, window functions).
- Analysts and data scientists need to read and contribute to transforms.
- You want raw data preserved for later recomputation.
Use ETL when
- Compliance forbids landing raw PII / PHI in the warehouse.
- The transform needs computation the warehouse cannot do (heavy ML inference, image processing, complex graph algorithms).
- The volume is so large that pre-aggregation upstream is cheaper than warehouse compute.
- You are stuck with an old on-prem warehouse where compute is fixed and expensive.
Use streaming pipelines when
- A real business decision is made on data younger than 5 minutes (fraud detection, live trading, real-time ML).
- The cost (5-10x batch) is justified by the value of the freshness.
- The team has Flink / Kafka Streams operational experience or is ready to invest in it.
Use batch pipelines when
- Hourly or daily latency is enough for the consumer (almost all BI, finance, analytics).
- Operational simplicity matters more than freshness.
- Cost optimization matters and batch lets you use spot instances and bulk warehouse pricing.
Pick orchestrators by team
- Airflow if the team already runs it; it is the safe boring choice.
- Dagster if you are starting fresh and your model is asset-centric (dbt + Python). The asset model maps better to modern data thinking.
- Prefect if you want a managed cloud offering and lighter weight.
- Step Functions / Cloud Composer for simple, low-volume pipelines on AWS / GCP without a dedicated platform team.
Case Studies
Airbnb's Minerva (metrics platform)
Airbnb's data platform team described a multi-year journey from ad-hoc SQL queries to a centralized metric definition layer (Minerva). The lesson is not the specific tool but the architectural insight: dashboards multiplied faster than the team could keep them consistent, so they pulled metric definitions out of dashboards and into versioned, reviewed SQL with explicit lineage. Pipelines populate the metrics; consumers query the metrics, not the underlying tables. This pattern (a 'semantic layer' between marts and consumers) is now standard at large data orgs.
Lesson: pipelines stop scaling the moment metric definitions live in many places. Centralize early.
Netflix's data infrastructure
Netflix has publicly described a stack with S3 as the raw data lake, Iceberg for table format, Spark for heavy batch transforms, Flink for streaming, and Genie for orchestration. The hard problem they emphasized in talks is not 'process N petabytes' (that part is solved), it is 'know which dataset is correct, fresh, and lineage-traceable when something breaks'. This is why data quality and lineage tooling are first-class at their scale.
Lesson: at petabyte scale the bottleneck shifts from compute to confidence. Invest in observability and lineage.
Stripe's data ingestion (CDC at scale)
Stripe has discussed using CDC (binlog-based change capture) from many sharded source databases into their warehouse, with strict ordering and reconciliation guarantees. They built dedicated reconciliation pipelines that re-derive aggregates from the warehouse and compare them against the source-of-truth ledger. Disagreements page on-call. This is the kind of investment financial-services data demands.
Lesson: when wrong numbers cost money, build pipelines that reconcile against the source, not just transforms that hope.
Spotify's event delivery system
Spotify's data platform uses a Kafka-based event delivery layer with strong schema management. Producers register schemas; the pipeline rejects events that violate the contract. Downstream pipelines therefore never need to handle 'a field appeared one day with a different type'. The cost is more upfront discipline; the benefit is dramatically fewer downstream incidents.
Lesson: schema contracts at the producer are cheaper than schema repair downstream.
Quick Review
- ELT (load raw, then transform in the warehouse with SQL) is the modern default; ETL still applies for PII, heavy compute, or on-prem.
- Layer pipelines as raw / staging / marts. Dashboards query marts, never raw or staging.
- Choose ingestion per source: full snapshot, incremental append, incremental upsert, CDC, or event stream.
- Latency budget drives batch vs micro-batch vs streaming. Default to batch and only escalate when the business value is real.
- Idempotency, watermarking, and parameterized backfills are the properties that make pipelines recoverable.
- Tests (schema, volume, freshness, referential, distribution) catch silent corruption that the runtime would never raise.
- Lineage is the prerequisite for incident response and impact analysis at scale.
Real-World Examples
How real systems implement this in production
Airbnb's metrics platform centralized metric definitions out of dashboards and into versioned SQL with explicit lineage and ownership. Pipelines populate the metric layer; consumers query the metrics. Public engineering blog posts described the multi-year migration and the cultural change required. The architecture pattern (semantic layer between marts and consumers) is now widely copied by other large data orgs.
Trade-off: Centralized metric definitions remove the divergent-dashboard problem, but they also create a chokepoint: changing a metric requires review and migration, slowing ad-hoc analysis.
Netflix runs a multi-petabyte data lake on S3 with the Iceberg table format, Spark for heavy batch transforms, Flink for streaming, and internally built orchestration (Genie, Maestro). Public talks emphasize data quality and lineage tooling as first-class concerns at their scale. The compute is solved; the hard problem is knowing which dataset is correct and traceable.
Trade-off: Building bespoke platform tooling lets them optimize for their scale and cost profile, but it is a permanent investment in platform engineering that smaller companies cannot justify.
Stripe has publicly discussed CDC-based ingestion from sharded source databases into their warehouse, with reconciliation pipelines that re-derive aggregates from the warehouse and compare against the source-of-truth ledger. Disagreements page on-call. This is the kind of belt-and-braces discipline financial-services data requires.
Trade-off: Reconciliation pipelines double the compute and engineering work, but they catch silent data bugs before they reach billing or finance reports where errors are extremely costly.
Spotify's data platform uses Kafka-based event delivery with strict schema management at the producer side. Events that violate the registered schema are rejected at ingest, so downstream pipelines never see malformed data. The platform team has described this in talks and blog posts as the foundation that makes thousands of downstream pipelines tractable.
Trade-off: Producer-side contracts shift effort upstream (every schema change requires producer review) but eliminate a huge class of silent downstream corruption.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Use ELT into a cloud warehouse (Snowflake / BigQuery / Redshift). Ingestion: CDC from Postgres via Debezium or a managed tool (Fivetran), Stripe via Fivetran or its events API, Segment direct to warehouse. Land everything in a raw schema. Build staging models per source (clean, type, dedup). Build marts that join across sources (`fct_orders`, `dim_customer`, `mart_revenue_daily`). Orchestrate with Airflow or Dagster on hourly or daily cadence. Add dbt tests for schema, volume, freshness, and referential integrity. Mention lineage via dbt docs / OpenLineage so impact analysis is possible. Default to hourly batch unless a real-time use case is justified.
Define explicitly what 'late' means and choose a policy. Option 1: closed-window with a grace period (run at 02:00 UTC for the previous day, dropping or side-channeling anything later). Option 2: reprocessing window (every day re-runs the last 7 days) so late events are eventually included. Option 3 (streaming): explicit watermark with allowed-lateness, late events go to a side output. Mention the trade-off: dropped accuracy vs cost of reprocessing. Most batch teams choose option 2 because it is operationally simple.
1) Confirm the upstream raw data covers the window (otherwise re-ingest first). 2) Verify the task is idempotent on its date partition. 3) Run the backfill in a parameterized way: `dbt run --vars run_date=...` per day, parallelized within source / warehouse capacity limits. 4) Avoid melting the warehouse: cap parallelism, use a dedicated warehouse / virtual cluster for the backfill. 5) Hold downstream publishing until the backfill completes (or use a 'backfill in progress' flag) so dashboards do not show partial state. 6) Re-run tests on the backfilled output before publishing. 7) Document the incident and the fix in a runbook.
Layer your tests. Schema tests: columns exist with expected types, primary keys are unique and non-null. Volume tests: row count is within X% of yesterday's. Freshness tests: latest row is at most N hours old. Referential tests: every foreign key resolves. Distribution tests: averages and percentiles are within tolerance of the trailing window. Wire failures to alert the owning team, not silently to a dashboard nobody reads. For business-critical numbers, add reconciliation against the source-of-truth system (e.g., revenue mart vs Stripe's API).
Only when a real business decision is made on data younger than ~5 minutes (fraud detection, live trading, ML feature serving, real-time alerting). The cost (5-10x batch in infrastructure and on-call burden) must be justified by the value of the freshness. If the honest answer to 'what decision happens faster than hourly' is 'none', stay with batch and avoid the operational complexity. Push back on PMs who ask for 'real-time' as an aesthetic preference.
Interview Tips
How to discuss this topic effectively
Default to ELT and only justify ETL when you have a real reason (PII, on-prem, heavy non-SQL compute). It signals you understand modern stacks.
Always describe the layered model. 'Raw is a verbatim copy, staging cleans, marts are business-shaped' is a one-line answer that telegraphs maturity.
When asked 'how do you handle real-time?', interrogate the requirement first. Senior signal: 'what business decision is made faster than every hour?' If the answer is 'we just want it to feel modern', stay with batch.
Volunteer idempotency before you are asked. Saying 'each task is idempotent on its date partition so backfills are safe' is a senior tell.
Mention data quality tests by name (schema, volume, freshness, referential). It separates 'I have built pipelines' from 'I have read about pipelines'.
Common Mistakes
Pitfalls to avoid in interviews
Putting business logic in Airflow PythonOperators instead of dbt / Spark
The orchestrator schedules and observes. The transform engine transforms. Mixing them gives you untestable, unversioned business logic spread across 200-line DAG files. Keep DAGs small and call out to dbt or Spark for the actual work.
Letting dashboards query staging or raw tables directly
Dashboards are consumers; they should depend only on marts. If dashboards query staging, every refactor in the staging layer breaks production analytics. The mart layer is the contract.
Treating a successful pipeline run as proof that the data is correct
A pipeline can run for months producing wrong numbers (a join lost rows, a unit got swapped, an upstream renamed a column you defaulted to NULL). You need explicit volume, freshness, distribution, and referential tests. No tests = no signal.
Hand-rolling backfills as one-off scripts
Parameterize every task by its run date so 'backfill April 2024' is `airflow backfill --start-date 2024-04-01 --end-date 2024-04-30`. If your tasks are not date-parameterized, every backfill is bespoke labor.
Ignoring schema evolution until it breaks production
Use a permissive raw layer, a schema registry or contract for events, and explicit type / column tests in staging. Then upstream changes break at the staging boundary (where engineers will fix them) instead of silently corrupting marts.
