Community Article

Generators, yield, and Lazy Pipelines

Generators turn a 4GB log-processing job into a 50MB one without changing the consumer code. Here is the mental model, the pipeline pattern I reuse, and the four traps that make hand-rolled generators leak.

Generators, yield, and Lazy Pipelines

Generators turn a 4GB log-processing job into a 50MB one without changing the consumer code. Here is the mental model, the pipeline pattern I reuse, and the four traps that make hand-rolled generators leak.

py-generators
py-itertools
iterators
performance
fundamentals
sofiacollins

By @sofiacollins

May 4, 2026

·

Updated June 9, 2026

847 views

16

4.4 (13)

I once watched a colleague's data pipeline crash because they were doing list(open(huge_file)) and the file was 4GB. The fix was deleting the list(...) call. The pipeline ran on a laptop afterwards. That is the entire pitch for generators in three sentences: the consumer did not care that the producer was a list; the producer did not have to materialise; the only real change was that the program stopped running out of memory.

The argument I want to make is that yield is not just a syntactic alternative to building a list. It is a control-flow primitive that lets a function pause, return a value to the caller, and resume from the same line on the next call. That property turns Python into a streaming language wherever you want it to be one, and the pipeline pattern that falls out of it is the cleanest way I know to compose ETL-style code.

The mental model: a generator is a function with a pause button

A regular function runs from top to bottom and returns once. A generator function runs until it hits a yield, hands a value to the caller, and waits. Calling the generator does not run the body; it returns a generator object. Calling next() on that object runs the body up to the next yield, returns that value, and stops the world again.

def counter():
    n = 0
    while True:
        yield n
        n += 1

g = counter()
print(next(g))  # 0
print(next(g))  # 1
print(next(g))  # 2

The local state (n = 0, the while loop position) is preserved across yield. From the function's point of view, nothing strange happened; from the caller's point of view, the function paused and woke up. This single property is what makes generators the right answer for streaming data and lazy pipelines.

A for loop iterates a generator the same way it iterates a list. for x in counter() calls next() repeatedly under the hood, and stops when the generator raises StopIteration (which happens when it reaches a return or runs off the end).

The first win: streaming a file without loading it

The canonical case is reading a large file line-by-line. The naive version blows up memory; the generator version does not.

# bad: reads the whole file into memory
def read_log_lines(path):
    return open(path).readlines()

# good: yields one line at a time
def read_log_lines(path):
    with open(path) as f:
        for line in f:
            yield line.rstrip('\n')

The second version uses a constant amount of memory regardless of file size. Note the with statement: the generator owns the file handle, and the handle stays open for the lifetime of the generator. When the consumer stops iterating (or the generator is garbage-collected), with closes the file. If you forget the with and the consumer bails early, the handle leaks.

For stdlib ergonomics, files are themselves iterators of lines, so I usually skip the wrapper entirely:

with open(path) as f:
    for line in f:
        process(line)

But once you need to do something more than iterate (say, parse JSON, or skip headers), wrapping the file in a generator that yields parsed records is the right call.

A worked benchmark: list versus generator on a real file

If the memory story still feels abstract, tracemalloc makes it concrete. I generated a 200MB log file (about 2 million lines) and read it two ways: once into a list, once through a generator. Same total bytes processed, very different peak memory.

import tracemalloc

def sum_lengths_eager(path):
    with open(path) as f:
        lines = f.readlines()  # materialises the whole file
    return sum(len(line) for line in lines)

def sum_lengths_lazy(path):
    with open(path) as f:
        return sum(len(line) for line in f)  # one line at a time

for name, fn in [('eager', sum_lengths_eager), ('lazy', sum_lengths_lazy)]:
    tracemalloc.start()
    fn('big.log')
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"{name}: peak {peak / 1024 / 1024:.1f} MB")

On my laptop the eager version peaked around 230MB; the lazy version peaked around 0.05MB. Both produced the same answer in about the same wall time, because the bottleneck is disk reads, not allocation. The generator did not buy speed; it bought a 4000x reduction in peak memory. That is the single number I keep in my head whenever I am tempted to call .readlines() on something I do not control the size of.

The pipeline pattern

The trick that turned generators from "a curiosity" into "a daily tool" for me was learning to compose them. Each stage of an ETL is a generator that takes an iterable, transforms it, and yields downstream. Five lines of glue, and you have a pipeline that processes terabytes in constant memory.

def parse_lines(lines):
    for line in lines:
        try:
            yield json.loads(line)
        except json.JSONDecodeError:
            continue

def filter_errors(records):
    for r in records:
        if r.get('level') == 'error':
            yield r

def extract_user_ids(errors):
    for e in errors:
        uid = e.get('user_id')
        if uid:
            yield uid

# the pipeline
with open('app.log') as f:
    pipeline = extract_user_ids(filter_errors(parse_lines(f)))
    for user_id in pipeline:
        record_to_db(user_id)

Nothing materialises until the final for loop pulls. The file is opened once, parsed once, filtered once, and each error's user ID lands in the database one at a time. The intermediate pipeline variable is a generator object, not a list, even though the call sites read like list operations.

When I am writing this style of pipeline, I think of each stage as a Unix-pipe equivalent. parse_lines is jq. filter_errors is grep. extract_user_ids is awk. The composition is the same shell-pipe shape Python just lets me write it in one process with type annotations.

Generator expressions: the inline form

Most pipeline stages do not need a named function. A generator expression (the parenthesised cousin of a list comprehension) is the inline form.

import json

with open('app.log') as f:
    parsed = (json.loads(line) for line in f)
    errors = (r for r in parsed if r.get('level') == 'error')
    user_ids = (r['user_id'] for r in errors if r.get('user_id'))
    for uid in user_ids:
        record_to_db(uid)

Same data flow, no named functions. I prefer the named-function form when the stages are reusable or have non-trivial logic, and the expression form for ad-hoc transformations. Mixing both in the same pipeline is fine.

yield from: delegating to another generator

When one generator wants to flatten the output of another, yield from is the clean way.

def walk_directory(path):
    for entry in os.listdir(path):
        full = os.path.join(path, entry)
        if os.path.isdir(full):
            yield from walk_directory(full)
        else:
            yield full

Without yield from, the recursive case would need an inner for loop to re-yield each value. yield from does the loop for you, plus it forwards send, throw, and return correctly (which matters if you are using generator-as-coroutine). For the common case of "this generator should produce all of that generator's values", yield from is one keyword for the right behaviour.

itertools: the missing pipeline operators

Writing custom generator stages is fine, but itertools already ships the three I reach for most. They take an iterable in, give an iterator out, and chain into pipelines without materialising anything.

itertools.chain concatenates iterables. The shape "process file A, then file B, as if they were one stream" is one call.

from itertools import chain

def all_log_lines(paths):
    return chain.from_iterable(open(p) for p in paths)

for line in all_log_lines(['yesterday.log', 'today.log']):
    process(line)

chain.from_iterable is the variant I use most: it takes an iterable of iterables and yields from each in turn. The opened files stay open as long as the chain is iterating over them, so cleanup is automatic if the consumer iterates to completion.

itertools.islice is [start:stop] for iterators. Useful for sampling, paging, or grabbing the first N items of a generator without consuming the rest.

from itertools import islice

with open('huge.log') as f:
    first_thousand = list(islice(f, 1000))     # head -n 1000
    page_two = list(islice(f, 1000))           # next 1000 lines

When I am debugging a pipeline, list(islice(stage, 5)) is my go-to: it pulls five items, runs them through the upstream stages, and gives me something printable without iterating the full input.

itertools.tee is the one with a footgun. It splits one iterator into N independent iterators, but it does so by buffering everything one consumer reads ahead of the other. If your consumers move at very different speeds, the buffer grows unbounded, and you have given back the memory savings the generator was supposed to buy.

from itertools import tee

lines = (line.rstrip() for line in open('app.log'))
stream_a, stream_b = tee(lines, 2)  # safe ONLY if both consumers stay in lockstep

My rule: use tee when the two consumers are in the same loop (zipped together, processed in parallel one item at a time). If they run in separate phases, materialise the data once instead.

Four traps that bite hand-rolled generators

Trap 1: a generator can only be iterated once. This is the most common bug.

g = (x * x for x in range(5))
list(g)  # [0, 1, 4, 9, 16]
list(g)  # [] -- the generator is exhausted

If you need to iterate the same data twice, materialise it (results = list(g)) or return a fresh generator each time. Code that takes a generator and iterates it twice without materialising is a bug.

Trap 2: cleanup runs on close(), not on next(). A generator that holds a resource (file, lock, db cursor) needs try/finally to release it.

def read_with_lock(path, lock):
    lock.acquire()
    try:
        with open(path) as f:
            for line in f:
                yield line
    finally:
        lock.release()

If the consumer stops iterating early, Python calls close() on the generator, which raises GeneratorExit at the current yield. The finally block runs, the lock is released. Without try/finally, the lock would leak.

Trap 3: return inside a generator does not return a value to the consumer. It stops the iteration. The value is attached to the StopIteration exception, which the for loop swallows.

def gen():
    yield 1
    yield 2
    return "final"  # not visible to a `for` loop

for x in gen():
    print(x)  # prints 1, then 2; "final" is never seen

If you really want the return value, catch StopIteration manually. In practice this is a sign that you should be using a regular function that returns a tuple of (items, summary), not a generator. Mixing the two patterns in one function reads badly.

Trap 4: a broad except around next() silently swallows real bugs. This one cost me a half-day of debugging in production. The consumer wraps next() in a try/except Exception to handle StopIteration gracefully, and then a real exception (a KeyError from a malformed record, an OSError from a closed handle) gets eaten by the same except.

def bad_consumer(stream):
    while True:
        try:
            value = next(stream)
        except Exception:        # WRONG: swallows everything, not just StopIteration
            return
        yield process(value)

The pipeline appears to terminate cleanly, the metrics show "job complete", and the records that triggered the bug are silently dropped. The fix is to catch StopIteration specifically and let everything else propagate.

def good_consumer(stream):
    while True:
        try:
            value = next(stream)
        except StopIteration:    # right: only catch end-of-stream
            return
        yield process(value)

In practice I just write for value in stream: and let Python handle the StopIteration for me. The pattern above only shows up when I need pull-based control over iteration, and even then I make the except precise.

When I do not reach for a generator

The pipeline pattern wins when the data is large, the work per item is independent, and the consumer either iterates once or bails early. It loses in three cases worth knowing about.

  • Random access required. A generator does not support indexing. If the consumer needs data[42], materialise to a list.
  • Multiple consumers, same data. A generator is single-shot. If two parts of the program need to iterate the same data, either materialise once and share the list, or use itertools.tee (with care, because tee buffers internally).
  • Tiny data. If the dataset is small enough to fit in memory five times over, the lazy machinery is overhead, not benefit. A list comprehension is simpler to read for sub-megabyte inputs.

The single line I use to decide is "could this dataset get 10x bigger next quarter?" If yes, generator pipeline. If no and it is already small, list comprehension. The cost of switching from list to generator later is small (changing [...] to (...) and a list() cast at the end); the cost of running out of memory is bigger.

Why yield is worth the syntax it costs

For a long time I read generators as "functions that produce iterables". That is true, but it is the wrong frame. The frame that helped: a generator is two pieces of code talking to each other through a synchronous channel. The producer says yield x, hands x to the consumer, and waits. The consumer takes x, does something, and asks for the next one with next() or implicitly with the next iteration of a for loop. The two pieces interleave in lockstep, in one process, with no threads, no callbacks, no queues. That cooperative dance is what makes pipelines composable, what keeps memory bounded, and what makes the extra keyword pay for itself. Once you see generators as a control-flow primitive instead of a list builder, you start using them everywhere they fit, and your laptop stops crashing on 4GB log files.

Back to Articles