Vector search has gone from research curiosity to production-default in about three years. Most product teams now have a feature that uses it: semantic search, recommendation-by-similarity, retrieval for an LLM, near-duplicate detection. The mental model in the docs ("we embed the text, we do nearest-neighbor search") is true and not enough; the production gotchas are concentrated in three places (chunking, the metric you choose, and the hybrid retrieval question), and getting them wrong leaves you with a feature that demos well and disappoints in real usage.
This is the writeup I wish I had been handed before my first vector-search ship. What an embedding actually is, why cosine similarity is the metric you reach for, the dimension and index choices that move performance, and the three production decisions that decide whether the feature ships or sits.
What an embedding actually is
An embedding is a vector of numbers (typically 256 to 4096 floating-point dimensions) produced by a model from a piece of text, image, or other input. The vector's specific numbers are not directly meaningful. What is meaningful is the geometric relationship between vectors: similar inputs produce vectors that point in similar directions; dissimilar inputs produce vectors that point in different directions.
A tiny example, in Python pseudocode using a hypothetical embedding API:
The embedding model is a learned function. Its training objective was something like "make sentences with similar meaning produce vectors close to each other". The model does not know what a cat is; it knows that the contexts "cat" appears in are statistically similar to the contexts "feline" appears in, and that those contexts are statistically different from the contexts "Postgres" appears in. The geometry is a side effect of the training objective.
Why this is useful: traditional keyword search asks "do these strings overlap?" It cannot find "feline" given a query for "cat", because the word "cat" does not appear in the indexed sentence. Embedding search asks "are these meanings close?" and answers yes for that pair.
The metric: why cosine similarity, almost always
Three common ways to measure how close two vectors are:
For most embedding models you actually use, the right metric is cosine similarity, and it is right because the embedding models are trained to produce vectors whose direction (not magnitude) encodes meaning. Two near-duplicate sentences produce vectors that point in nearly the same direction; the lengths of those vectors are not meaningful, so the metric should ignore length. Cosine similarity does exactly that.
A crucial production detail: most embedding models output normalized vectors (length 1) by default. When the vectors are normalized, cosine similarity is identical to dot product; the dot product is faster to compute. Vector databases typically take advantage of this: they normalize at insert time and then use dot product internally because it is cheaper. The metric you ask for in the API is cosine; under the hood the implementation is dot product on normalized vectors.
Euclidean distance also works, but it conflates direction with magnitude, and it interacts poorly with embedding models that do not produce length-1 vectors. Stick with cosine unless you have a specific reason; the surprises are fewer.
The dimension and index choices that move performance
Two dials that determine the speed and quality trade-off:
Dimension count. Lower-dimensional vectors are faster to search, cheaper to store, and slightly less expressive. Higher-dimensional vectors are slower and more expensive but capture more nuance. Modern providers offer multiple dimension options (256, 512, 1024, 1536, 3072 are common). Some recent embedding models support "Matryoshka" (truncatable) embeddings: you can store the full vector and at query time use only the first N dimensions, trading quality for speed without re-embedding. The default I reach for: 1024 or 1536 dimensions for general-purpose retrieval. 256 or 512 for high-volume similarity (de-dup, reco) where the corpus is large and quality is less critical than speed.
Index type. Searching N vectors brute-force is O(N*D). At a million vectors with 1024 dimensions, that is too slow for interactive search (hundreds of milliseconds at best). The fix is an approximate-nearest-neighbor (ANN) index. Two common choices:
Approximate means the index can miss the true nearest neighbor in rare cases. For typical retrieval workloads, recall@10 of 95-99% is achievable with HNSW at sub-10ms query times. That is more than enough for almost any product use case I have shipped.
The three production decisions that determine whether the feature ships
The quality of a vector-search feature in production almost always comes down to three decisions, in order of impact.
Decision 1: chunking
The input documents are rarely the right unit to embed. A 50-page PDF embedded as one vector loses all the granularity that made the search useful. The same PDF chunked into 200-token chunks gives you 200x the recall surface and the model picks the most relevant section.
The chunking heuristics I use as a starting point:
- 200 to 500 tokens per chunk for prose. Long enough to carry meaning, short enough to be retrievable.
- 50 to 100 tokens of overlap between adjacent chunks. So a fact that straddles a chunk boundary is captured in at least one chunk.
- Split on natural boundaries first (paragraphs, headings, sentences), then on token count if the chunk is still too long.
- Prepend the document title (or section path) to each chunk. Short tag like
Doc: API Reference > Pagination >makes near-duplicate chunks distinguishable in retrieval.
A simple chunker in pseudocode:
Not fancy. The right shape is the simplest that respects natural boundaries.
Decision 2: hybrid retrieval (vectors + keywords)
Pure vector search misses things. Specifically: rare proper nouns, exact identifier matches (an order ID, a user handle, a product SKU), and queries dominated by structure rather than semantics. A user searching for ORD-2025-7842 does not want a fuzzy semantic match; they want the document that contains exactly that string.
The pattern that fixes this: hybrid retrieval. Run a keyword search (BM25 in Postgres / Elasticsearch / Tantivy) and a vector search in parallel, and combine the two result sets. The combination function is usually Reciprocal Rank Fusion (RRF), which is parameter-free and works:
Reciprocal rank fusion takes the union of the top results from each retrieval, weighting them by inverse rank. Documents that show up high in both lists score highest; documents that appear in only one but rank well still surface. The hyperparameter k is a stabilizer (defaulting to 60 is common in published work) and is rarely worth tuning; the algorithm is robust by design.
In my experience, hybrid retrieval is usually 10-25% better than pure vector search on common evaluations: it preserves the recall benefit of vectors while not losing the exact-match cases keyword search nails.
Decision 3: re-ranking
The top 50 results from hybrid retrieval are decent but not perfect. A second-stage re-ranker (a smaller cross-encoder model that scores each query+document pair) can reorder them based on a more fine-grained relevance signal. Rerankers add latency and cost, but improve precision-at-top-K significantly.
The pattern: retrieve 50 to 100 candidates with hybrid search, then rerank those candidates and return the top 10. The retrieval is fast (vector index + keyword index, a few ms each); the reranker only sees a small candidate set, so even a slower model is cheap at that scale.
Whether to add re-ranking depends on the use case. For RAG (retrieval to feed an LLM), the LLM's tolerance for noisy context is high, and pure hybrid is often enough. For user-facing search where the top 5 results need to be ordered well, the reranker is usually worth it.
A note on multi-tenant filters
In any non-trivial product, the vectors in your index are not all visible to everyone. Tenant A's documents must not surface in Tenant B's searches; private documents must not appear for users without access. The discipline that keeps this watertight: every chunk is stored with a tenant_id (and any other access labels) in its metadata, and every query attaches a metadata filter that the vector DB applies before similarity ranking.
The gotcha is that approximate-nearest-neighbor indexes do not always handle pre-filtering efficiently. If the filter is highly selective (1% of the corpus matches), the index may still scan the full graph and discard 99% of the results, which can be slower than just brute-force searching the matching subset. Most modern vector DBs (Qdrant, pgvector, Weaviate) have improved here, but it is worth measuring on your actual access patterns before declaring the architecture sound. The wrong shape (post-filtering, where the index returns top-K without the filter and you discard the unauthorized results in app code) is both slower and a security bug waiting to happen if anyone forgets to apply the filter on the way out.
What changes from prototype to production
A quick checklist of the things that almost always come up after the demo:
The item that bites the most often is the first one: when you upgrade the embedding model (because a better one was released, because the provider deprecated the old one), every vector in your index is now in a different geometric space. New queries cannot meaningfully be compared to old vectors; you must re-embed everything. Plan the migration before you ship; budget the cost.
The simple decision tree I use
If I am picking a vector retrieval setup for a new feature, the decision tree fits on a notecard.
That is the path that has worked for every vector-search feature I have shipped. The teams who skip step 4 ship features that miss exact matches and look bad in the first user demo. The teams who skip step 6 ship features that quietly drift in quality and nobody notices until support tickets pile up. The other steps tend to take care of themselves once you have the eval loop in place; the eval loop is what turns vector search from a one-time setup into something you can keep improving.
