Hallucinations & What Rerankers Can’t Fix
If you’re building AI products, you’ve probably discovered something uncomfortable: your system hallucinates. So frequently enough that you can’t ship it to enterprise customers without tomes of disclaimers.
Then you’ve tried optimizing prompt engineering, LLMs and RAG system. And still, roughly a third of your responses contain confidently stated nonsense. Then you realize the problem isn’t your prompts or your model but retrieval.
Introduce Re-Rankers & What They Actually Do
Most RAG systems work like this: convert everything to vectors, find similar vectors, feed them to the LLM. It’s appealingly simple. Yet, it is also wrong about 40% of the time.
Here’s why. When you encode “side effects of aspirin in elderly patients” and “aspirin is effective for elderly patients” into vectors, they look similar. Same words, same domain. Your retrieval system sometimes can’t tell that one answers the question and the other doesn’t.
Whereas, a ReRanker is a model that refines the ordering of retrieved documents by scoring them based on their relevance to a specific query. Let me explain how it fits into the broader retrieval ecosystem. In a typical retrieval system, you first get an initial set of candidate documents (say, top 100) using fast but less precise methods. A re-ranker then examines these documents more carefully and REorders them so the most relevant ones appear at the top. Think of it like a two-stage filter: the first stage casts a wide net quickly, and the second stage applies more scrutiny to rank what was caught.
In Retrieval-Augmented Generation (RAG), the typical workflow is:
- User asks a question
- System retrieves relevant documents from a knowledge base
- These documents are added to the LLM’s context
- LLM generates an answer based on the retrieved context
Re-rankers improve step 2 by ensuring the most relevant documents are retrieved and passed to the LLM. This is crucial because you typically can’t send 100 documents to the LLM, you need to select the best 5-10. Also, Re-rankers help you make optimal use of this limited space (context window) by ensuring you’re filling it with the most relevant documents rather than marginally useful ones. Because, even with large context windows, quality of retrieved content matters more than quantity.
How ReRankers Achieve What RAG Doesn’t?
Re-rankers typically use cross-encoders, a neural network architecture that takes two pieces of text (query + document) as a single combined input, captures interaction between them & outputs a relevance score. This is fundamentally different from how embeddings work.
Cross-attention is the operating word in capturing interactions. When processing the concatenated input, the transformer’s attention layers allow tokens from the query to directly attend to tokens in the document and vice versa. This captures:
- Semantic matching: Does the document answer the query’s intent?
- Term importance: Are key query terms prominent in the document?
- Context understanding: Does the document context align with query context?
- Negation & qualification: Can detect when documents contradict the query
So, if the query is “side effects of aspirin” and one document says “aspirin has no side effects” while another discusses “common side effects include…”, the cross-encoder can detect the negation in the first document that a simple embedding similarity might miss.
Reranking Process
- Initial retrieval gets top-k candidates (e.g., 100 documents) using embeddings
- For each candidate, create input: [query, document]
- Pass through cross-encoder → get relevance score
- Sort by score, take top-n (e.g., 5-10) for the LLM
- These become your RAG context
The difference matters. Embeddings tell you “these texts are about similar things.” Re-rankers tell you “this text answers that question.” For technical documentation, customer support, or any domain where precision matters, that distinction is everything.
Where Re-Rankers Fit
Think of retrieval as a two-stage filter:
Stage 1 (Embeddings): Cast a wide net fast. Your vector database scans millions of documents in milliseconds, narrowing down to the top 100 candidates.
Stage 2 (Reranker): Apply scrutiny. The re-ranker examines these 100 carefully, reordering them so the truly relevant ones rise to the top.
You need both. Embeddings give you speed; re-rankers give you accuracy. Neither alone is sufficient for production. The reranker sits between your initial retrieval and your LLM.
The Hallucination Numbers
At GraaS, we tested this extensively on technical documentation systems. Here’s what actually happens:
| KB Quality | No ReRanker | Single ReRanker | Two-Stage ReRanker |
| Baseline (Q1) | 38% | 30% | 26% |
| Improved (Q2) | 20% | 13% | 8% |
Look at that table carefully. Three insights jump out:
First, KB quality matters more than anything else. Going from baseline to improved KB reduces hallucinations by 18 percentage points, more than any amount of re-ranking.
Second, re-rankers help significantly. Single-stage re-ranking cuts hallucinations by 8 percentage points with baseline KB, 7 points with improved KB.
Third, there’s an anti-pattern to avoid: baseline KB with two-stage re-ranker (26%) performs worse than improved KB with no re-ranker (20%). If your content is mediocre, sophisticated retrieval just helps you find mediocre answers faster.
How to Use ReRankers: The Practical Bit
API Services (Easiest): Cohere Rerank 3 is production-ready. Send your query and candidates, get reordered results. Integrates in an afternoon, costs fractions of a penny per query.
Open Source (More Control): bge-reranker-v2-m3 is excellent for self-hosting. Better for data privacy or high-volume optimization.
Start with APIs. Move to self-hosted only at serious scale. The setup is straightforward, the hard part is everything else 😊
The Real Problem
Here’s what we’ve learned building these systems: rerankers are remarkable but they’re solving the wrong problem. They’re like hiring an excellent librarian for a library with terrible books.
Your baseline KB probably has: outdated documentation from three product versions ago, contradictory information between different docs, incomplete guides that assume context readers don’t have, poorly structured content that’s hard to chunk effectively.
Re-rankers genuinely help. They do reduce hallucinations. But they’re amplifying whatever signal exists in your knowledge base. If that signal is weak, you’re just amplifying weakness.
The unglamorous truth: spend 80% of your effort on KB quality and 20% on retrieval sophistication, not the reverse. Version your documentation. Remove outdated content. Fill gaps. Validate accuracy. Structure information clearly. Reduce entropy. Optimize content for chunking. Or simply use CohGent and fix them automatically at 1/10th of time and cost.
We’ve seen our clients’ systems go from 38% to 8% hallucination rates. The companies that succeeded didn’t do it with clever reranking algorithms. They did it by curating excellent knowledge bases and then, only then, adding rerankers to surface that quality content effectively.
Rerankers are essential infrastructure for any serious RAG system. But they’re infrastructure. The actual product is your knowledge base. Get that right first.
If you’re building AI products that need to be accurate, you need re-rankers. But if you’re still seeing high hallucination rates after adding them, your problem isn’t retrieval—it’s content. Fix the source, then optimize how you find it.
