Community Python Snippet
An Embedding Cache With Content-Hash Keys
Re-embedding the same paragraphs on every deploy was costing us $400 a month. This is the SQLite-backed cache I shipped: the key is sha256(model + normalized text), TTL is per-row, and a single batch call backfills misses.
An Embedding Cache With Content-Hash Keys
Re-embedding the same paragraphs on every deploy was costing us $400 a month. This is the SQLite-backed cache I shipped: the key is sha256(model + normalized text), TTL is per-row, and a single batch call backfills misses.
By @amaragupta
April 12, 2026
·
Updated May 20, 2026
477 views
9
4.3 (11)
The cache is a 40-line SQLite table with a single (key, vector, expires_at) row per text. The interesting choice is the key: NFKC-normalized text plus the model id, hashed with sha256. NFKC means caf\u00e9 (single codepoint) and cafe\u0301 (combining accent) collapse to the same key, which actually matters because OpenAI tokenizes them identically and we do not want two cache rows for the same vector. The TTL is per-row rather than per-cache because some embeddings (product names) churn weekly while others (legal text) stay valid for a year. The model id in the key is the safety net: when we upgrade from text-embedding-3-small to large, every key changes and the cache invalidates itself.
The batch path is the one that actually saved us money. get_many does a single IN (?, ?, ?, ...) query so a request embedding 200 documents costs one round-trip to SQLite, not 200. The miss list keeps the original index alongside the text so we can splice the freshly-embedded vectors back into the result without zipping order-dependent lists. The 128 batch size matches OpenAI's recommended max per request; for Cohere I drop it to 96. The print at the end confirms only three unique strings made it into the cache, which is what the dedup should produce.
I run evict() once an hour on a cron. It does two passes: drop everything past its TTL (cheap, indexed by expires_at if you add that index in production), then if the table still has more than max_rows, delete the coldest rows by last_seen. The two-phase order matters: a row with a very old last_seen but a not-yet-expired TTL is still useful, and we want TTL expiry to be the dominant signal. I cap at 100k rows because at our text length that comfortably fits in 200MB of SQLite, which is small enough to keep in memory and fsync without blocking writes.
