Community Python Snippet
groupby Then Aggregate With defaultdict (Python)
Pure stdlib group-then-aggregate: defaultdict(list) for the grouping pass, then a tiny per-group reducer. The version I reach for before importing pandas, plus the multi-stat variant.
groupby Then Aggregate With defaultdict (Python)
Pure stdlib group-then-aggregate: defaultdict(list) for the grouping pass, then a tiny per-group reducer. The version I reach for before importing pandas, plus the multi-stat variant.
By @elenamuller
February 8, 2026
·
Updated May 18, 2026
280 views
6
4.4 (8)
Two passes: the first builds defaultdict(list) keyed by group, the second walks each group and emits whatever aggregate you need. The reason I prefer this over itertools.groupby is that groupby requires the input to be sorted by the key and silently produces broken results otherwise; defaultdict does not care about input order. The from __future__ import annotations at the top is what lets the dict[str, list[dict]] type hint parse on Python 3.8 (the playground's Python). For datasets up to a few hundred thousand rows this is faster than constructing a DataFrame and lets the aggregation logic stay in plain Python.
The helper is two functions worth of code wrapped in a parameterized agg: the caller picks how to reduce each group. Splitting key and agg is what makes the helper composable: agg=len counts, agg=sum totals, agg=list plucks the rows themselves, and any custom reducer fits the same hole. I keep this in a utils/group.py module on every Python project; it is short enough that it never needs to grow into pandas. The annotations use dict and list as bare generics, so the from __future__ import annotations is mandatory on Python 3.8 to keep the file importable.
This is the shape of every per-route latency report I have shipped. The aggregation is a single loop that builds per-group lists, then one comprehension that emits the stats dict; the only reason to add a second pass is if memory pressure forces you to track running stats incrementally. statistics.median is the right p50 implementation in stdlib; for percentiles other than 50 you want statistics.quantiles(samples, n=100)[k-1] for the kth percentile. I deliberately do not show p99 here because it requires statistics.quantiles which is Python 3.8+ but with subtle behavior on tiny samples; mention it in your real code's docstring so a junior does not call it on a 5-element list.
