Stream Processing
stream-processing
System Design
Stream Processing (Kafka Streams, Flink)
Stream processing is the discipline of computing on continuous, unbounded data as it arrives, instead of in periodic batches. This lesson covers the core stream-processing primitives: stateful operators, event time vs processing time, watermarks, windowing (tumbling, sliding, session), exactly-once semantics, and stateful checkpointing. We compare the leading engines (Kafka Streams, Apache Flink, Spark Structured Streaming) and walk through real production patterns: real-time analytics, fraud detection, ML feature pipelines, and CDC-driven materialized views. By the end you can sketch a Flink pipeline on a whiteboard and defend the windowing and checkpointing choices.
Batch vs Stream Processing (Lambda/Kappa)
Batch processing computes results over a finite, bounded dataset. Stream processing computes results continuously over an unbounded, ever-arriving dataset. The two paradigms have different latency, cost, correctness, and operational profiles, and choosing wrong is one of the most expensive architectural mistakes a senior engineer can make. This lesson covers the mental model (bounded vs unbounded data, event time vs processing time, watermarks, windows), the two classical reference architectures (Lambda and Kappa), the modern unified models (Beam, Flink), and the production realities of exactly-once semantics, late data, replays, and operational complexity. The goal is to leave you able to choose batch, streaming, or a hybrid for any system, and to defend the choice in an interview.
Community
Streaming Aggregations With a Single Pass (JS)
Welford's online algorithm for mean and variance, plus a 30-line streaming p99 estimator. The version I use when the data does not fit in memory or arrives over WebSocket.
Streaming JSONL Parser Without Loading the File
When the file is 8GB you cannot json.load it. Here is the generator-based JSONL reader I ship in every data pipeline, plus the malformed-line policy that has saved me twice.
Iterators, Generators, and Async Generators
One protocol, three layers. The iterator protocol with its single next method, generators as sugar over it, and async generators for streaming data with back-pressure. The lazy pipeline pattern I reach for every week.
