NLP Foundations
Embeddings and Semantic Search
How dense vectors turn text into a geometry of meaning, and how cosine similarity lets you find related content without keywords.
beginner · 7 min read
Embeddings map text to dense vectors such that semantic similarity corresponds to vector proximity. They are the backbone of every modern semantic search, RAG, and recommendation system.
How they are produced
The dominant pattern: take a transformer encoder, pool its hidden states into a single vector, train it with a contrastive objective (similar pairs close, dissimilar pairs far). Sentence-Transformers, E5, BGE, and Voyage all follow this recipe.
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("all-MiniLM-L6-v2")
vecs = m.encode(["Cats are loud.", "Felines vocalise often."])
Similarity metrics
- Cosine similarity is the workhorse: range -1 to 1, scale-invariant.
- Dot product equals cosine when vectors are L2-normalised (which most embeddings are).
- Euclidean distance rarely used directly for semantic similarity.
Approximate nearest neighbour
Brute-force cosine at scale is O(n) per query. For corpora over ~100k vectors you reach for HNSW (used by pgvector, Qdrant, Pinecone), IVF (FAISS), or LSH. Recall trades against speed via index parameters.
Pitfalls
- Domain shift. A generic embedding model fares poorly on legal, medical, or proprietary content. Fine-tune or use domain-specific models.
- Length sensitivity. Long documents and short queries embed into different regions of vector space. Asymmetric retrieval (different models for query vs doc) often outperforms.