System Design Article

Data Pipelines & ETL/ELT

Difficulty: Medium

Data pipelines move data from operational systems (your transactional databases, event logs, third-party APIs) into analytical systems (warehouses, lakes, search indexes, ML feature stores). The 'shape' of the pipeline (ETL vs ELT, batch vs incremental, push vs pull) determines latency, cost, and how painful schema changes will be. This lesson covers the architectural choices: ingestion patterns, transformation engines (dbt, Spark, Beam), orchestration (Airflow, Dagster, Prefect), data quality, lineage, and the standard production layout (raw / staging / mart). It also covers the failure modes you must design for: late-arriving data, idempotency, backfills, schema evolution, and the silent corruption that comes from not testing your pipelines.

Data Pipelines & ETL/ELT

System Design

Medium

data-pipelines

etl

elt

data-engineering

system-design

advanced

premium

orchestration

161 views

Motivation

A running product generates data in dozens of shapes: rows in Postgres, events in Kafka, files dumped by partners, third-party API responses, click logs. Almost none of it is in the shape that analytics, ML, or finance teams actually need. The job of a data pipeline is to take that mess and produce trusted, queryable, well-modeled data in the analytical store.

The naive approach is one ad-hoc script per question. Marketing wants daily signups by country; someone writes a cron that copies a Postgres table to a CSV and uploads it. Six months later the company has 80 such scripts, no two are documented, half of them silently broke when someone renamed a column, and the dashboards quietly disagree with the source database.

The disciplined approach is a pipeline: a defined set of stages, each one's input and output explicit, each step orchestrated, tested, monitored, and version-controlled. The right architecture turns 'analytics data is broken again' from a weekly incident into a non-event.

Why this matters at senior level: pipelines are where business decisions get their numbers. A bad pipeline does not crash, it lies. The lying is invisible until the CFO presents the wrong revenue number to the board. Architecting pipelines that are correct, observable, and recoverable is a senior data role at every serious tech company.

Deep Dive

ETL vs ELT: the modern default flipped

ETL (Extract, Transform, Load) was the dominant pattern for 30 years. You pulled data from sources, transformed it on a separate engine (Informatica, Talend, custom Python), then loaded the cleaned result into the warehouse.

Why ETL? Warehouses used to be expensive, slow, and limited (Teradata, Oracle Exadata). You did NOT want to pay them to do generic SQL transforms; you bought a separate transformation engine.

ELT (Extract, Load, Transform) became the new default once cloud warehouses (Snowflake, BigQuery, Redshift) made compute cheap and elastic. You dump raw source data into the warehouse first, then transform inside the warehouse with SQL.

Text

ETL:
  [ Source ] -> [ Transform engine (Spark / Python) ] -> [ Warehouse ]

ELT:
  [ Source ] -> [ Warehouse: raw layer ] -> [ Warehouse: SQL transforms ] -> [ Marts ]

The ELT advantages are real:

Raw data is preserved. If you discover six months later that a transform was wrong, you can recompute from raw without re-extracting from the source.
SQL is universal. Analysts and data scientists can read and contribute. Spark / Python pipelines are a separate skill.
Cloud warehouse compute is elastic. You scale up for a backfill and down afterward.
Tools (dbt above all) standardized this. Every modern team's transformation layer is some flavor of versioned SQL.

ETL is still right when:

The transformation needs computation the warehouse cannot do (heavy ML, image processing, complex joins on huge volumes that would melt the warehouse).
Compliance forbids landing raw PII in the warehouse; you must mask before loading.
You are working on-prem with an old warehouse and cannot trivially scale compute.

The standard layered pipeline (raw / staging / mart)

Every serious team converges on roughly this layout, sometimes named differently (bronze / silver / gold in Databricks land, raw / staging / mart in dbt land):

Text

+--------+    +-----------+    +--------------+    +----------+
| Source | -> | Raw layer | -> | Staging      | -> | Marts    |
|        |    | (as-is)   |    | (cleaned,    |    | (busi-   |
|        |    |           |    |  typed,      |    |  ness    |
|        |    |           |    |  deduped)    |    |  views)  |
+--------+    +-----------+    +--------------+    +----------+
                  |                  |                   |
                  |                  |                   |
             schema close       1 model per       1 model per
             to source          source table      use case

Raw: a near-verbatim copy of the source. Schema close to source. No business logic. The point of the raw layer is fidelity and idempotency: re-running ingestion of the same window must produce the same raw rows.
Staging: light cleaning. Cast types, rename columns, drop obvious junk, deduplicate. One staging model per source table. Still source-shaped, not business-shaped.
Marts: business-shaped. dim_customer, fct_orders, mart_revenue_daily. Multiple sources are joined, business definitions encoded ('what counts as an active user'), and these are what dashboards and downstream consumers query.

The rule: dashboards never query raw or staging directly. They query marts. This is what lets you refactor the upstream layers without breaking every dashboard.

Ingestion patterns

How data lands in the raw layer falls into a few standard shapes.

Full snapshot: every run replaces the entire table. Simple, deterministic, but expensive and loses history. Fine for small dimension tables (10k country codes), wrong for fact tables (10B order rows).

Incremental append: each run adds rows newer than the last high-water mark. Efficient for append-only data (events, logs, immutable orders). Watch for clock skew and late-arriving rows.

Incremental upsert (merge): rows can change. Each run upserts records by primary key. Requires either a reliable updated_at column on the source or change data capture (CDC).

Change data capture (CDC): a stream of row-level inserts / updates / deletes from the source database, usually via the WAL / binlog. Tools: Debezium, Fivetran, AWS DMS. Best fidelity, but more moving parts.

Event ingestion: the source produces events (Kafka, Kinesis, webhook) and the pipeline consumes them in micro-batches (every minute) or in true streaming.

A single product usually uses all five for different sources at the same time. Choosing the right one per source is the first design decision.

Batch vs micro-batch vs streaming

Latency budget drives the choice:

Pattern	Typical latency	Tools	When
Daily batch	hours	Airflow + dbt	Finance, monthly reporting
Hourly batch	~1 hour	Airflow + dbt + incremental	Most BI dashboards
Micro-batch	1-5 minutes	Spark Structured Streaming, Databricks	Operational dashboards, near-real-time alerts
True streaming	seconds	Flink, Kafka Streams	Fraud detection, live ML features

The trap: every product manager asks for 'real-time'. The cost of going from hourly batch to true streaming is often 5-10x in infrastructure and engineering. Push back: 'what business decision is made faster than every hour, and what is it worth?' If the honest answer is 'none', stay with hourly batch.

Orchestration: the layer that ties it together

A modern pipeline is dozens of tasks with dependencies. The orchestrator is what runs them in the right order, retries the failed ones, and tells you when something broke.

The market converged on a few:

Apache Airflow: the de facto standard. Python DAGs, mature ecosystem. Showing its age (operator-centric model, weak data-awareness) but ubiquitous.
Dagster: data-asset-centric (you declare assets and their dependencies, not tasks). Better for the modern dbt + Python world.
Prefect: lighter weight, good UX. Strong cloud offering.
Cloud-native (Step Functions, Cloud Composer, Workflows): simple cases on AWS / GCP.

The orchestrator's job is scheduling, dependencies, retries, observability, and backfills. It is NOT where your transformations live (that is dbt / Spark). Mixing them is a common anti-pattern: an Airflow DAG with 200 lines of business logic in PythonOperator calls is impossible to test or version separately.

Idempotency: the property that makes pipelines recoverable

A pipeline task is idempotent if running it twice produces the same result as running it once. This is what lets you safely retry, safely backfill, and safely recover from a half-failed run.

Non-idempotent transform (broken):

SQL

INSERT INTO daily_revenue (day, revenue)
SELECT CURRENT_DATE, SUM(amount) FROM orders WHERE day = CURRENT_DATE;
-- Re-running this on the same day duplicates the row.

Idempotent transform (correct):

SQL

MERGE INTO daily_revenue tgt
USING (
  SELECT day, SUM(amount) AS revenue
  FROM orders
  WHERE day = '{{ run_date }}'
  GROUP BY day
) src
ON tgt.day = src.day
WHEN MATCHED THEN UPDATE SET revenue = src.revenue
WHEN NOT MATCHED THEN INSERT (day, revenue) VALUES (src.day, src.revenue);

The mental model: a task takes an input partition (a date, a hour, a window), produces an output partition, and overwriting the output partition is safe.

Watermarks and late-arriving data

In batch pipelines, a 'day' is straightforward in the warehouse but not in the source. Events arrive late (mobile app was offline, network glitch, time zones, clock skew). What about an event with event_time = 2024-04-12T23:59:00Z that arrives at 2024-04-13T01:30:00Z?

Two approaches:

Closed window with grace period: the daily-revenue job for 2024-04-12 runs not at 00:00 the next day but at, say, 02:00 with a 2-hour grace. Anything arriving later goes into the next day's partition (deliberate inaccuracy) or triggers a re-run (correct but operationally painful).
Reprocessing window: every day re-runs the last 7 days. Late arrivals up to 7 days late are eventually corrected. Simple, expensive, surprisingly common in production.

In streaming pipelines this is the famous watermark concept: the system tracks the maximum event time it has seen and treats everything older than watermark - allowed_lateness as 'too late, dropped or sent to a side output'. Same idea, different mechanics.

Backfills: the operational reality

A backfill is when you re-run the pipeline for a historical window. Reasons: you discovered a bug in a transform, you added a new column and need to populate it for old data, you onboarded a new source.

The questions to design for upfront:

Are tasks parameterized by date so 'run for 2024-04-12' is a one-line invocation? If not, backfills are pull-out-the-laptop manual work.
Is each partition's output independent so backfills can run in parallel? If task N+1 reads task N's output for the SAME day only, parallel days are safe. If it reads across days, you have a serial backfill.
Will the backfill exceed warehouse / source capacity? Backfilling 3 years of daily jobs in parallel can melt your source database. Plan for rate-limited backfills.
Do downstream consumers handle the partial state during the backfill? A dashboard reading a partially backfilled mart shows misleading numbers. Use a 'backfill-in-progress' flag or a side table to swap atomically at the end.

Data quality: the testing layer pipelines almost never have

Software engineers test their code. Data engineers historically did not. The result: pipelines that run successfully (no exceptions) but produce wrong numbers, often for months.

The modern toolkit:

Schema tests: column exists, column is non-null, column is in expected enum. dbt tests are the most common implementation; Great Expectations is the standalone library.
Volume tests: row count is within X% of yesterday's. A daily ingestion that suddenly returns 0 rows is almost always a silent bug, not a real number.
Freshness tests: the latest row in this table is at most N hours old. Catches stalled ingestion.
Referential tests: every customer_id in fct_orders exists in dim_customer. Catches join skew and bad upstream data.
Distribution tests: average order value is within 3 standard deviations of the trailing 30 days. Catches semantic regressions a schema test would miss.

The rule: every published mart has at least the first three. Marts that drive money have all five.

Schema evolution

Upstream schemas change. A new column is added, a column is renamed, a type is widened. Your pipeline has to survive this without silently dropping data or crashing.

Defenses, in order of strength:

Schema-on-read (raw layer): dump JSON or Parquet with permissive schema in the raw layer. Loss is impossible; the cleaning happens downstream where you can adapt.
Schema registry (Confluent / Glue): the producer registers a schema; the pipeline checks compatibility (backward, forward, full). Breaks at deploy time, not at runtime.
Contracts: explicit upstream-downstream agreements ('this team owns this table; here is its schema and SLA'). Common in data-mesh setups.
Tests in the staging layer: 'these columns must exist with these types'. Catches anything the contract did not.

The rule: never let an upstream schema change reach the marts unmediated. Always have a staging layer that pins the column names and types your marts expect.

Lineage: knowing what depends on what

When a transform breaks, the question is 'what downstream tables, dashboards, ML features, and consumers are affected?'. Without lineage, the answer is 'we will know when someone complains'.

Lineage is a graph of which dataset is built from which. Modern tools (dbt's manifest, Dagster's asset graph, OpenLineage standard, vendor offerings like Monte Carlo / Datafold) emit this graph automatically from the pipeline definition.

Lineage is also the foundation for impact analysis ('if I change this column, what breaks?') and incident response ('this dashboard is wrong; trace back to find which upstream is bad').

Implementation

A typical modern stack (pick one per row)

Layer	Common choices
Ingestion	Fivetran / Airbyte (managed), Debezium (CDC), Spark / Beam (custom)
Storage (raw)	S3 / GCS Parquet, Snowflake / BigQuery raw schemas
Transformation	dbt (SQL), Spark / Databricks (mixed), Beam / Flink (streaming)
Orchestration	Airflow, Dagster, Prefect
Quality	dbt tests, Great Expectations, Monte Carlo
Lineage / catalog	dbt docs, OpenLineage, DataHub, Atlan

A realistic small-to-mid company stack: Fivetran for SaaS sources + Debezium for the OLTP DB + Snowflake as warehouse + dbt for transforms + Airflow for orchestration + dbt tests + dbt docs for lineage.

Orchestrating one daily mart

A daily revenue mart in dbt + Airflow:

Python (Airflow DAG)

Python

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

with DAG(
    'daily_revenue',
    start_date=datetime(2024, 1, 1),
    schedule_interval='0 2 * * *',  # 02:00 UTC, 2-hour grace for late events
    catchup=False,
    default_args={'retries': 3, 'retry_delay': timedelta(minutes=15)},
) as dag:

    ingest = BashOperator(
        task_id='ingest_orders',
        bash_command='fivetran trigger orders_connector --wait'
    )

    transform = BashOperator(
        task_id='dbt_run',
        bash_command='dbt run --select +mart_revenue_daily --vars run_date={{ ds }}'
    )

    test = BashOperator(
        task_id='dbt_test',
        bash_command='dbt test --select +mart_revenue_daily'
    )

    publish = BashOperator(
        task_id='publish_metric',
        bash_command='python publish_metric.py --date {{ ds }}'
    )

    ingest >> transform >> test >> publish

JavaScript (TypeScript Dagster equivalent sketch using the JS CLI)

JavaScript

// Dagster is Python-native. This is a Node script that triggers a Dagster job
// via its GraphQL API for use in JS-controlled pipelines.
async function triggerRun(runDate) {
    const res = await fetch('https://dagster.example.com/graphql', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            query: 'mutation($pipeline: String!, $vars: JSON!) { launchRun(pipeline: $pipeline, runConfigData: $vars) { runId } }',
            variables: { pipeline: 'daily_revenue', vars: { run_date: runDate } },
        }),
    });
    const json = await res.json();
    return json.data.launchRun.runId;
}

await triggerRun('2024-04-12');

The dbt model itself is a parameterized SQL file:

SQL

-- models/marts/mart_revenue_daily.sql
{{ config(materialized='incremental', unique_key='day') }}

SELECT
    DATE(o.created_at)            AS day,
    SUM(o.amount)                 AS revenue,
    COUNT(DISTINCT o.customer_id) AS paying_customers
FROM {{ ref('stg_orders') }} o
WHERE 1 = 1
{% if is_incremental() %}
    AND DATE(o.created_at) = '{{ var("run_date") }}'::date
{% endif %}
GROUP BY 1

And the test:

SQL

-- tests/mart_revenue_daily_volume.sql
-- Fail if today's revenue is more than 5x or less than 0.2x yesterday's.
WITH r AS (
    SELECT day, revenue FROM {{ ref('mart_revenue_daily') }}
    WHERE day >= CURRENT_DATE - 1
)
SELECT today.day
FROM (SELECT * FROM r WHERE day = CURRENT_DATE) today
JOIN (SELECT * FROM r WHERE day = CURRENT_DATE - 1) yest
  ON 1 = 1
WHERE today.revenue > yest.revenue * 5 OR today.revenue < yest.revenue * 0.2

This is a complete, idempotent, tested daily pipeline in maybe 60 lines.

Streaming pipeline sketch

A streaming variant for live operational metrics, using Flink as the engine and Kafka as input:

JavaScript (sample Flink job structured in code)

JavaScript

// Conceptual Flink job in TypeScript-like pseudocode for a JS-based codebase
// using a hypothetical Flink Node binding. Real Flink jobs are Java / Scala / Python.
const stream = env.fromKafkaTopic('orders');

const windowed = stream
    .keyBy(e => e.region)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .allowedLateness(Time.minutes(5))
    .reduce((a, b) => ({ region: a.region, revenue: a.revenue + b.revenue }));

windowed.sinkTo(new JdbcSink('mart_revenue_per_minute'));
env.execute('live-revenue');

Python (PyFlink equivalent)

Python

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import Time
from pyflink.datastream.window import TumblingEventTimeWindows

env = StreamExecutionEnvironment.get_execution_environment()
stream = env.add_source(kafka_source('orders'))

windowed = (stream
    .key_by(lambda e: e['region'])
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .allowed_lateness(Time.minutes(5))
    .reduce(lambda a, b: {'region': a['region'], 'revenue': a['revenue'] + b['revenue']}))

windowed.add_sink(jdbc_sink('mart_revenue_per_minute'))
env.execute('live-revenue')

Note the explicit watermark / allowed-lateness; this is the streaming analog of the batch grace period.

When to Use

Use ELT (modern default) when

You have a cloud warehouse (Snowflake, BigQuery, Redshift, Databricks SQL).
The transformations are mostly SQL-shaped (joins, aggregations, window functions).
Analysts and data scientists need to read and contribute to transforms.
You want raw data preserved for later recomputation.

Use ETL when

Compliance forbids landing raw PII / PHI in the warehouse.
The transform needs computation the warehouse cannot do (heavy ML inference, image processing, complex graph algorithms).
The volume is so large that pre-aggregation upstream is cheaper than warehouse compute.
You are stuck with an old on-prem warehouse where compute is fixed and expensive.

Use streaming pipelines when

A real business decision is made on data younger than 5 minutes (fraud detection, live trading, real-time ML).
The cost (5-10x batch) is justified by the value of the freshness.
The team has Flink / Kafka Streams operational experience or is ready to invest in it.

Use batch pipelines when

Hourly or daily latency is enough for the consumer (almost all BI, finance, analytics).
Operational simplicity matters more than freshness.
Cost optimization matters and batch lets you use spot instances and bulk warehouse pricing.

Pick orchestrators by team

Airflow if the team already runs it; it is the safe boring choice.
Dagster if you are starting fresh and your model is asset-centric (dbt + Python). The asset model maps better to modern data thinking.
Prefect if you want a managed cloud offering and lighter weight.
Step Functions / Cloud Composer for simple, low-volume pipelines on AWS / GCP without a dedicated platform team.

Case Studies

Airbnb's Minerva (metrics platform)

Airbnb's data platform team described a multi-year journey from ad-hoc SQL queries to a centralized metric definition layer (Minerva). The lesson is not the specific tool but the architectural insight: dashboards multiplied faster than the team could keep them consistent, so they pulled metric definitions out of dashboards and into versioned, reviewed SQL with explicit lineage. Pipelines populate the metrics; consumers query the metrics, not the underlying tables. This pattern (a 'semantic layer' between marts and consumers) is now standard at large data orgs.

Lesson: pipelines stop scaling the moment metric definitions live in many places. Centralize early.

Netflix's data infrastructure

Netflix has publicly described a stack with S3 as the raw data lake, Iceberg for table format, Spark for heavy batch transforms, Flink for streaming, and Genie for orchestration. The hard problem they emphasized in talks is not 'process N petabytes' (that part is solved), it is 'know which dataset is correct, fresh, and lineage-traceable when something breaks'. This is why data quality and lineage tooling are first-class at their scale.

Lesson: at petabyte scale the bottleneck shifts from compute to confidence. Invest in observability and lineage.

Stripe's data ingestion (CDC at scale)

Stripe has discussed using CDC (binlog-based change capture) from many sharded source databases into their warehouse, with strict ordering and reconciliation guarantees. They built dedicated reconciliation pipelines that re-derive aggregates from the warehouse and compare them against the source-of-truth ledger. Disagreements page on-call. This is the kind of investment financial-services data demands.

Lesson: when wrong numbers cost money, build pipelines that reconcile against the source, not just transforms that hope.

Spotify's event delivery system

Spotify's data platform uses a Kafka-based event delivery layer with strong schema management. Producers register schemas; the pipeline rejects events that violate the contract. Downstream pipelines therefore never need to handle 'a field appeared one day with a different type'. The cost is more upfront discipline; the benefit is dramatically fewer downstream incidents.

Lesson: schema contracts at the producer are cheaper than schema repair downstream.

Quick Review

ELT (load raw, then transform in the warehouse with SQL) is the modern default; ETL still applies for PII, heavy compute, or on-prem.
Layer pipelines as raw / staging / marts. Dashboards query marts, never raw or staging.
Choose ingestion per source: full snapshot, incremental append, incremental upsert, CDC, or event stream.
Latency budget drives batch vs micro-batch vs streaming. Default to batch and only escalate when the business value is real.
Idempotency, watermarking, and parameterized backfills are the properties that make pipelines recoverable.
Tests (schema, volume, freshness, referential, distribution) catch silent corruption that the runtime would never raise.
Lineage is the prerequisite for incident response and impact analysis at scale.

Real-World Examples

How real systems implement this in production

Airbnb Minerva

Airbnb's metrics platform centralized metric definitions out of dashboards and into versioned SQL with explicit lineage and ownership. Pipelines populate the metric layer; consumers query the metrics. Public engineering blog posts described the multi-year migration and the cultural change required. The architecture pattern (semantic layer between marts and consumers) is now widely copied by other large data orgs.

Trade-off: Centralized metric definitions remove the divergent-dashboard problem, but they also create a chokepoint: changing a metric requires review and migration, slowing ad-hoc analysis.

Netflix data platform

Netflix runs a multi-petabyte data lake on S3 with the Iceberg table format, Spark for heavy batch transforms, Flink for streaming, and internally built orchestration (Genie, Maestro). Public talks emphasize data quality and lineage tooling as first-class concerns at their scale. The compute is solved; the hard problem is knowing which dataset is correct and traceable.

Trade-off: Building bespoke platform tooling lets them optimize for their scale and cost profile, but it is a permanent investment in platform engineering that smaller companies cannot justify.

Stripe ingestion and reconciliation

Stripe has publicly discussed CDC-based ingestion from sharded source databases into their warehouse, with reconciliation pipelines that re-derive aggregates from the warehouse and compare against the source-of-truth ledger. Disagreements page on-call. This is the kind of belt-and-braces discipline financial-services data requires.

Trade-off: Reconciliation pipelines double the compute and engineering work, but they catch silent data bugs before they reach billing or finance reports where errors are extremely costly.

Spotify event delivery

Spotify's data platform uses Kafka-based event delivery with strict schema management at the producer side. Events that violate the registered schema are rejected at ingest, so downstream pipelines never see malformed data. The platform team has described this in talks and blog posts as the foundation that makes thousands of downstream pipelines tractable.

Trade-off: Producer-side contracts shift effort upstream (every schema change requires producer review) but eliminate a huge class of silent downstream corruption.

Quick Interview Phrases

Key terms to use in your answer

ELT into the warehouse

raw, staging, mart layering

idempotent transforms

watermark and late-arriving data

transactional outbox or CDC ingestion

data quality tests on every mart

Common Interview Questions

Questions you might be asked about this topic

Walk me through how you would build the analytics pipeline for a SaaS product with Postgres as the OLTP, Stripe for payments, and Segment for product events.

Use ELT into a cloud warehouse (Snowflake / BigQuery / Redshift). Ingestion: CDC from Postgres via Debezium or a managed tool (Fivetran), Stripe via Fivetran or its events API, Segment direct to warehouse. Land everything in a raw schema. Build staging models per source (clean, type, dedup). Build marts that join across sources (`fct_orders`, `dim_customer`, `mart_revenue_daily`). Orchestrate with Airflow or Dagster on hourly or daily cadence. Add dbt tests for schema, volume, freshness, and referential integrity. Mention lineage via dbt docs / OpenLineage so impact analysis is possible. Default to hourly batch unless a real-time use case is justified.

How do you handle late-arriving events in a daily aggregation?

How do you safely backfill a mart that has been broken for two weeks?

How would you test a pipeline so that silent data corruption gets caught?

When would you choose streaming over batch for a pipeline?

Interview Tips

How to discuss this topic effectively

Default to ELT and only justify ETL when you have a real reason (PII, on-prem, heavy non-SQL compute). It signals you understand modern stacks.

Always describe the layered model. 'Raw is a verbatim copy, staging cleans, marts are business-shaped' is a one-line answer that telegraphs maturity.

When asked 'how do you handle real-time?', interrogate the requirement first. Senior signal: 'what business decision is made faster than every hour?' If the answer is 'we just want it to feel modern', stay with batch.

Volunteer idempotency before you are asked. Saying 'each task is idempotent on its date partition so backfills are safe' is a senior tell.

Mention data quality tests by name (schema, volume, freshness, referential). It separates 'I have built pipelines' from 'I have read about pipelines'.

Common Mistakes

Pitfalls to avoid in interviews

Putting business logic in Airflow PythonOperators instead of dbt / Spark

The orchestrator schedules and observes. The transform engine transforms. Mixing them gives you untestable, unversioned business logic spread across 200-line DAG files. Keep DAGs small and call out to dbt or Spark for the actual work.

Letting dashboards query staging or raw tables directly

Dashboards are consumers; they should depend only on marts. If dashboards query staging, every refactor in the staging layer breaks production analytics. The mart layer is the contract.

Treating a successful pipeline run as proof that the data is correct

A pipeline can run for months producing wrong numbers (a join lost rows, a unit got swapped, an upstream renamed a column you defaulted to NULL). You need explicit volume, freshness, distribution, and referential tests. No tests = no signal.

Hand-rolling backfills as one-off scripts

Parameterize every task by its run date so 'backfill April 2024' is `airflow backfill --start-date 2024-04-01 --end-date 2024-04-30`. If your tasks are not date-parameterized, every backfill is bespoke labor.

Ignoring schema evolution until it breaks production

Use a permissive raw layer, a schema registry or contract for events, and explicit type / column tests in staging. Then upstream changes break at the staging boundary (where engineers will fix them) instead of silently corrupting marts.

Back to System Design