Community Python Snippet
Streaming JSONL Parser Without Loading the File
When the file is 8GB you cannot json.load it. Here is the generator-based JSONL reader I ship in every data pipeline, plus the malformed-line policy that has saved me twice.
Streaming JSONL Parser Without Loading the File
When the file is 8GB you cannot json.load it. Here is the generator-based JSONL reader I ship in every data pipeline, plus the malformed-line policy that has saved me twice.
By @clarachoi
December 21, 2025
·
Updated May 18, 2026
1,157 views
5
Rate
The shape is a generator over a line iterator, which keeps memory at O(1) regardless of file size. I take Iterable[str] rather than a path because that lets me feed the same parser a real file, an io.StringIO, a network stream, or a gzip.open handle. Skipping blank lines is required by the loose JSONL format used in the wild; producers like Cloud Logging emit trailing blank lines all the time. The eight lines here are enough for a clean dataset.
The first version dies on the first bad line, which is the right default for a small clean file but the wrong default for an 8GB log dump where one corrupt frame is expected. Yielding (lineno, obj) makes downstream errors traceable to the source byte range; it has saved me more than once when the bad rows turned out to come from a single misbehaving producer. The 'count' mode is what I ship: skip silently in the hot path, then emit one summary log line at the end. The 'fail' mode is for unit tests where any malformed input means a test bug.
The generator-of-batches shape is what real ETL stages look like: each pull from the consumer drives one batch read from the source, so memory stays bounded at batch_size rows. The size 500 is a starting point; for Postgres INSERT ... VALUES (...), (...) I tune it to keep the SQL under the wire-protocol packet limit, and for Kafka producers I match the producer's linger.ms + batch.size settings. Always emit the trailing partial batch with the if buf: yield buf after the loop; forgetting it silently drops up to batch_size - 1 rows and is the bug I have shipped most often in this category.
