Text embeddings have become the connective tissue between content and AI answer engines. Understanding how they work is now essential for any content team optimizing for visibility in ChatGPT, Perplexity, Google AI Overviews, and similar systems.
These engines retrieve content at the passage level using dense vector similarity — not keyword matching — then verify answer quality with large language models before generating cited responses.1 The practical implication: content must be optimized not just for topical relevance but for extractability as discrete, self-contained answer blocks.
This article covers:
- How embeddings encode meaning
- How AI engines use them internally
- How to build auditing workflows that simulate retrieval pipelines
- The critical failure modes that make naive embedding-based analysis unreliable
How transformers turn text into searchable numbers
The encoding pipeline converts text into a vector through four deterministic stages:
- Tokenization — A WordPiece tokenizer splits input into subword tokens, prepending a
[CLS]token and appending[SEP]. Each token maps to a learnable 768-dimensional embedding, summed with positional and segment embeddings. - Contextualization — The token matrix passes through 12 stacked transformer layers (BERT-base). Multi-head self-attention lets every token attend to every other, producing fully contextual representations. “Bank” gets a different vector near “river” than near “account.”
- Pooling — The final layer outputs a
[sequence_length, 768]matrix, one vector per token. This collapses to a single vector: BGE uses[CLS]token pooling; most sentence-transformer models use mean pooling across all tokens. - L2 normalization — The vector is normalized to unit length, making dot product and cosine similarity mathematically equivalent.
Cosine similarity in practice
sim(A, B) = Σ(Aᵢ × Bᵢ) — scores range from -1 to +1:
- +1 — identical direction in embedding space
- 0 — orthogonal (unrelated)
- -1 — opposite direction
Critical caveat: absolute scores are not interpretable as relevance thresholds.2 Contrastive training compresses the score distribution (BGE v1.0 pushed most scores into 0.6–1.0), so only relative ordering within a comparison set is meaningful. A fixed threshold like “cosine > 0.7 means relevant” is unreliable.
How models learn: contrastive fine-tuning
Models are trained with InfoNCE loss on query–passage pairs:3
- Maximize similarity between a query and its known positive passage
- Minimize similarity with “hard negatives” — topically related passages that don’t actually answer the query
- Training data: millions to billions of pairs from MS MARCO, Natural Questions, and synthetically mined corpora
Bi-encoders vs. cross-encoders
| Bi-encoder | Cross-encoder | |
|---|---|---|
| How it works | Encodes query and document independently | Processes query + document together |
| Speed | Millisecond retrieval via ANN search | O(n) cost per query |
| Ranking quality | Scalable, lower precision | ~18% better than bi-encoder4 |
| Use case | Broad retrieval | Reranking small candidate sets |
This tradeoff drives the two-stage pipeline architecture discussed later.
How Google, Perplexity, and ChatGPT retrieve content using embeddings
The common pattern across all major AI answer engines is a multi-stage pipeline: broad retrieval → reranking → generation. Here’s how each system implements it.
Confirmed via DOJ antitrust trial testimony by VP of Search Pandu Nayak:5
- Stage 1-BM25 index: narrows trillions of documents to tens of thousands of candidates
- Stage 2-RankEmbed (deployed 2018): neural matching encodes queries and documents into a shared embedding space, adds semantically relevant results via ANN search6
- Stage 3-DeepRank: BERT-based cross-encoder scores only the top 20–30 documents, full cross-attention is too expensive for broader use
- Passage-level ranking: a separate system evaluates individual sections of pages, so a buried paragraph can surface even if the page overall isn’t the best match
Google AI Overviews adds query fan-out, breaking complex queries into subtopics and issuing multiple searches. Key stats:
- 40–54% of AI Overview citations overlap with top-10 organic results7
- #1 ranked page has a ~33% probability of being cited in an AI Overview8
Perplexity AI
Published architecture documentation (September 2025):1
- Index: over 200 billion unique URLs
- Retrieval: simultaneous hybrid search, BM25 + dense vector embedding via the Vespa backend9
- Reranking: multi-stage, fast embedding scorers early, cross-encoder models late
- Granularity: sub-document retrieval, emphasizing “surfacing the most atomic units possible”10
ChatGPT
- Browsing mode: uses the Bing Search API, reformulates queries, fetches pages, extracts passages, injects as context with citation tracking
- Custom GPTs: traditional RAG pipeline, documents chunked, embedded, stored in a vector database, retrieved via semantic similarity at query time
What this means for AEO
The unit of optimization shifts from entire pages to individual passages. The matching mechanism shifts from keywords and backlinks to semantic embedding similarity. The success metric shifts from ranking position to being selected as a cited source.
BAAI/bge-base-en-v1.5: best open-source size-to-performance ratio
The BGE base model is a 109M parameter, 12-layer BERT encoder producing 768-dimensional embeddings. MIT-licensed, free for commercial use.11
| Property | Value |
|---|---|
| Architecture | BERT-base (encoder-only) |
| Parameters | ~109M |
| Embedding dimensions | 768 |
| Max sequence length | 512 tokens |
| Pooling | [CLS] token |
| MTEB average score | ~63.55 |
| Model size on disk | 438 MB |
| License | MIT |
Training pipeline (C-Pack paper, Xiao et al., 2023)3
- RetroMAE pre-training: masked auto-encoder: encoder produces embeddings from corrupted input, lightweight decoder reconstructs the original, teaching the model to compress semantic information into dense vectors
- Contrastive fine-tuning: over 1 billion text pairs, InfoNCE loss with temperature τ=0.01, hard negatives mined via BM25 and model self-distillation
- v1.5 update (September 2023): addressed score distribution compression from v1.0; removed the requirement for the instruction prefix on queries11
How it compares
| Model | MTEB score | Notes |
|---|---|---|
| BGE-base-en-v1.5 | 63.55 | 438 MB, runs on CPU |
| BGE-large-en-v1.5 | 64.23 | 3× larger |
| OpenAI text-embedding-3-large | 64.6 | API cost per token |
| Cohere embed-v4 | 65.2 | API cost per token |
| Qwen-3-Embedding | 70.7 | Newer, higher cost |
BGE-base remains the de facto default for open-source semantic search: inference speed, native support in LangChain, LlamaIndex, ChromaDB, and zero API costs.
Limitations: English-only, 512-token context window. For multilingual or long-document needs, use BGE-M3 (8,192 tokens, 568M parameters).
Where embeddings fail: four critical traps for content teams
1. Semantic similarity ≠ the content answers the question
This is the most important failure mode. A section can score high on embedding similarity because it discusses the same topic, while completely failing to provide the answer.
LlamaIndex’s Great Gatsby experiment: asked “Who was driving the car that hit Myrtle?” the top two results described the crash in detail but never named the driver. Only the third result contained the actual answer (Daisy).12
Why this happens structurally:
- Bi-encoders encode query and document independently
- They cannot perform the token-level cross-attention needed to verify a passage contains the specific information requested
- A passage about “error codes in API integrations” scores highly against “What does error code TS-999 mean?” even if TS-999 is never mentioned
2. Chunk size creates a precision-context tradeoff
| Chunk size | Problem |
|---|---|
| Too large | Important details “washed out” into a muddy centroid, somewhat similar to many queries, strongly similar to none |
| Too small | Matches on surface features but lacks surrounding context needed to serve as an answer |
NVIDIA’s empirical findings:13
- Factoid queries → best with 256–512 token chunks
- Complex analytical queries → best with 1,024 tokens or page-level
Jina AI (April 2025) also documented a systematic length bias: longer texts produce higher average similarity scores regardless of relevance, because their embeddings spread across more semantic space.2
3. Keyword stuffing creates false positives
Microsoft Research demonstrated at SIGIR 2024 that content densely packed with topically relevant keywords inflates relevance scores even when it doesn’t genuinely answer queries.14 Dense keyword co-occurrence maps to cluster proximity in embedding space, generating retrieval false positives that:
- Waste LLM context windows
- Degrade answer quality
- Make your content rank highly in retrieval but never get cited
4. Absolute similarity scores cannot determine relevance
You cannot set a universal cosine similarity threshold.2 The contrastive training process, score distribution compression, and length bias all make absolute scores unreliable:
- A score of 0.82 might indicate strong relevance in one corpus and weak relevance in another
- Only rank ordering within a single comparison set is meaningful
- “Comparing embedding vectors can only tell you about relative similarity, not relevance”
How to chunk content for AEO
Five strategies exist on a spectrum from simple to sophisticated:
| Strategy | How it works | Best for |
|---|---|---|
| Fixed-size | Split every N tokens | Simple pipelines, consistent chunk sizes |
| Recursive | Split by paragraph → sentence → token | General use, 30–50% better precision than fixed |
| Semantic | Split at meaning transitions via embedding similarity | Variable-length natural splits |
| Heading-aware | Split at H1/H2/H3 boundaries | AEO — most critical strategy |
| Hierarchical | Multi-level: section → subsection → paragraph | Both broad and precise retrieval |
Practical AEO recommendations
Based on NVIDIA’s empirical benchmarks across five diverse datasets:13
- Use H2/H3 headings as primary chunk boundaries
- Target 256–512 tokens per chunk (matches typical answer engine citation length)
- Add 10–20% overlap between adjacent chunks (15% performed best in NVIDIA’s testing)
- Prepend heading hierarchy (H1 > H2 > H3) as metadata to each chunk
- Keep chunk sizes within a narrow range to avoid length bias distortion
Contextual Retrieval (Anthropic)15
Before embedding each chunk, an LLM generates a 50–100 token context description situating the chunk within the overall document:
- Reduced top-20 retrieval failure rates by 35%
- Combined with BM25 and reranking: 67% failure reduction
- Cost: ~$1.02 per million document tokens
The two-stage pipeline: embed first, then verify with an LLM
The most reliable AEO auditing workflow mirrors production answer engines: use embeddings for fast candidate retrieval, then use an LLM to verify whether candidates actually answer the query.12
Stage 1: Embedding retrieval
- Embed target queries (the questions your audience asks) and content chunks (sections split by headings)
- Compute cosine similarity to identify the top 10–50 candidate chunks per query
- Runs in milliseconds, prioritizes recall, casting a wide net
- Tools: Screaming Frog v22 integrates this directly connect to an embedding API, crawl your site, find pages with highest semantic similarity on a 0–100 scale
Stage 2: LLM verification
Pass each candidate chunk and query to an LLM with a structured prompt:
Given this content section and this user query, rate 0–3: (0) unrelated, (1) topically related but doesn’t answer, (2) partial answer but unclear or incomplete, (3) directly and clearly answers the query. Identify what specific information is missing to fully answer the query.
Reading the diagnostic gap
| Stage 1 score | Stage 2 score | What it means | Action |
|---|---|---|---|
| High | High | Strong citation candidate | Keep |
| High | Low | Topically close, doesn’t answer | Priority rewrite |
| Low | Low | Not relevant | New content needed |
| Low | High | Rare — check chunking | Review chunk boundaries |
Real-world results
- Anthropic contextual retrieval (top-150 → rerank to top-20): 67% reduction in retrieval failures15
- Cost: cross-encoder rerankers (BGE Reranker, Cohere Rerank) are 75× cheaper than LLM-based evaluation13
- LlamaIndex’s
LLMRerankpostprocessor: retrieve top-K via embeddings, then LLM reorders — returned a single correct result vs. three partially-relevant ones in the Great Gatsby test12
Conclusion: embeddings are necessary but not sufficient
Embeddings solve the retrieval problem but not the answer problem. They efficiently identify which content sections live in the same semantic neighborhood as target queries — an essential first step no manual audit can replicate at scale. But semantic proximity is not answer quality.
The actionable framework is three layers:
- Structure for extractability: heading-aware sections of 256–512 tokens, each self-contained enough to serve as a standalone answer
- Embed for candidate identification: map which sections are likely retrieval candidates for which queries, and where gaps exist
- Verify with an LLM: distinguish “topically close” from “actually answers the question,” and prioritize rewrites where the gap is largest
This pipeline: structure, retrieve, verify directly simulates what Perplexity, Google AI Overviews, and ChatGPT do when deciding which content to cite.
References
-
Perplexity AI: Architecting and Evaluating an AI-First Search API ↩ ↩2
-
Jina AI: On the Size Bias of Text Embeddings and Its Impact in Search ↩ ↩2 ↩3
-
Xiao et al., 2023: C-Pack: Packed Resources for General Chinese Embeddings ↩ ↩2
-
Pinecone: Rerankers and Two-Stage Retrieval ↩
-
Keywords People Use: How Google Search Works ↩
-
Natzir Turrado: How Google Works ↩
-
Search Engine Journal / BrightEdge: Google AI Overviews Overlaps Organic Search by 54% ↩
-
xSeek: Do Top Google Rankings Drive More AI Overview Citations? ↩
-
Vespa.ai: How Perplexity uses Vespa ↩
-
Perplexity AI: Introducing the Perplexity Search API ↩
-
BAAI: bge-base-en-v1.5 Model Card ↩ ↩2
-
LlamaIndex: Using LLMs for Retrieval and Reranking ↩ ↩2 ↩3
-
NVIDIA Developer Blog: Finding the Best Chunking Strategy for Accurate AI Responses ↩ ↩2 ↩3
-
Microsoft Research / SIGIR 2024: LLMs can be Fooled into Labelling a Document as Relevant ↩
-
Anthropic: Introducing Contextual Retrieval ↩ ↩2