Why Your RAG Application Gives Wrong Answers: Every Root Cause With Fixes and Working Code
RAG applications fail in specific, diagnosable ways. Bad retrieval, context poisoning, re-ranking failures, and generation errors each leave distinct fingerprints in the output. This guide diagnoses every failure mode with working Python code for each fix.
LangChain
RAG orchestration framework with retrieval, re-ranking, and generation pipeline components
www.langchain.com
Priya Nair
June 19, 2026
Introduction
A RAG system that gives wrong answers has failed at one of three stages: retrieval, context preparation, or generation. The failure looks the same from the outside โ a wrong answer โ but the fix is completely different depending on which stage broke. This guide diagnoses each failure mode and provides the specific code fix for each one.
The Problem: Three Types of Wrong Answers
- Type 1 โ Retrieval failure: the right document exists in the knowledge base but was not retrieved. The answer is wrong because the model did not have the information.
- Type 2 โ Context poisoning: irrelevant documents were retrieved alongside relevant ones and the model weighted them incorrectly, producing a blended wrong answer.
- Type 3 โ Generation failure: the right documents were retrieved but the model misread, misunderstood, or ignored them and generated an answer from its training data instead.
Causes: Why Each Type Happens
- Retrieval failure cause 1: semantic mismatch between query and document. User asks in casual language, document uses technical language. Embedding distance fails to bridge the gap.
- Retrieval failure cause 2: bad chunking strategy. The answer is split across two chunks and neither chunk alone is semantically close enough to the query to be retrieved.
- Context poisoning cause: top-k is set too high. Retrieving 10 documents when only 2 are relevant floods the context with noise.
- Context poisoning cause 2: no re-ranking step. Pure vector similarity does not distinguish between topically adjacent but factually different documents.
- Generation failure cause: context too long. When context exceeds model's effective attention, the model falls back on training data for the answer.
- Generation failure cause 2: prompt does not instruct the model to prefer retrieved context over its training knowledge.
Solutions: Fixes for Each Failure Type
Fix 1: Hybrid Search for Retrieval Failure
# Hybrid search combines dense (semantic) and sparse (keyword) retrieval
# Fixes cases where semantic embedding misses exact keyword matches
from pinecone import Pinecone
from pinecone_text.sparse import BM25Encoder
from openai import OpenAI
import numpy as np
client = OpenAI()
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")
index = pc.Index("your-hybrid-index")
bm25 = BM25Encoder.default()
def get_dense_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def hybrid_search(
query: str,
top_k: int = 5,
alpha: float = 0.7 # 0=pure sparse, 1=pure dense
) -> list[dict]:
"""
Hybrid search blending dense and sparse retrieval.
alpha=0.7 means 70% weight on semantic, 30% on keyword.
Adjust alpha based on your content type:
- Technical docs with exact terms: lower alpha (0.4-0.5)
- Natural language queries: higher alpha (0.7-0.8)
"""
dense_vector = get_dense_embedding(query)
sparse_vector = bm25.encode_queries(query)
# Scale vectors by alpha
scaled_dense = [v * alpha for v in dense_vector]
scaled_sparse = {
"indices": sparse_vector["indices"],
"values": [v * (1 - alpha) for v in sparse_vector["values"]]
}
results = index.query(
vector=scaled_dense,
sparse_vector=scaled_sparse,
top_k=top_k,
include_metadata=True
)
return results["matches"]
# Diagnostic: check if retrieval failure is happening
def diagnose_retrieval(query: str, expected_doc_id: str) -> dict:
"""
Run during development to check if the right document is retrieved.
"""
results = hybrid_search(query, top_k=10)
retrieved_ids = [r["id"] for r in results]
return {
"expected_retrieved": expected_doc_id in retrieved_ids,
"rank_if_found": retrieved_ids.index(expected_doc_id) + 1
if expected_doc_id in retrieved_ids else None,
"top_retrieved": retrieved_ids[:3]
}Fix 2: Re-Ranking to Eliminate Context Poisoning
# Re-ranking with Cohere โ improves precision after initial retrieval
# Reduces context poisoning by filtering irrelevant retrieved documents
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def retrieve_and_rerank(
query: str,
documents: list[dict],
initial_top_k: int = 20,
final_top_k: int = 4,
relevance_threshold: float = 0.5
) -> list[dict]:
"""
Two-stage retrieval:
Stage 1: Retrieve initial_top_k documents by vector similarity
Stage 2: Re-rank and keep only final_top_k above threshold
The threshold prevents low-quality documents from entering context
even if no better documents exist.
"""
# Stage 1: Initial retrieval (already done, documents passed in)
texts = [doc["content"] for doc in documents[:initial_top_k]]
# Stage 2: Cohere re-ranking
rerank_results = co.rerank(
query=query,
documents=texts,
model="rerank-english-v3.0",
top_n=final_top_k
)
# Filter by relevance threshold
filtered = [
{
"content": documents[r.index]["content"],
"relevance_score": r.relevance_score,
"original_rank": r.index
}
for r in rerank_results.results
if r.relevance_score >= relevance_threshold
]
# Safety: if no documents pass threshold, return empty
# Agent should respond 'I do not have relevant information'
# rather than answering from poisoned context
if not filtered:
return []
return filtered
# Bad chunking diagnostic
def test_chunking_coverage(text: str, query: str, chunk_size: int = 512) -> dict:
"""
Tests whether the answer to a query would be split across chunks.
Run this during development to tune chunk size.
"""
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
query_embedding = get_dense_embedding(query)
scores = []
for chunk in chunks:
chunk_embedding = get_dense_embedding(chunk)
# Cosine similarity
score = np.dot(query_embedding, chunk_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(chunk_embedding)
)
scores.append(score)
best_idx = np.argmax(scores)
best_score = scores[best_idx]
return {
"best_chunk_index": best_idx,
"best_similarity": round(float(best_score), 3),
"answer_likely_split": best_score < 0.75,
"recommendation": "Reduce chunk size" if best_score < 0.75 else "Chunk size OK"
}Examples: Before and After Fixes
- Before hybrid search: query 'RFC 2616 timeout behavior' retrieved documents about general HTTP without the specific RFC. After hybrid search with alpha=0.4: correct RFC document retrieved at rank 1.
- Before re-ranking: top-k=10 retrieved 6 irrelevant documents about related policies. Model averaged them into a wrong answer. After re-ranking with threshold 0.5: only 3 relevant documents passed, correct answer produced.
- Before chunking fix: 512 token chunks split a table across two chunks. Neither chunk retrieved. After switching to semantic chunking on paragraph boundaries: table stayed in one chunk, correctly retrieved.
Common Mistakes
- Mistake 1: Fixed chunk size for all document types. Code documentation, markdown, and prose each have different natural boundaries. Use semantic chunking that respects document structure.
- Mistake 2: Top-k set to 10 or higher without re-ranking. More retrieved documents without quality filtering increases noise faster than it increases relevant signal.
- Mistake 3: Embedding the raw user query without query expansion. Short ambiguous queries produce poor embeddings. Expand the query before embedding it.
- Mistake 4: Ignoring retrieval quality metrics. Without recall@k and MRR measurements during development, retrieval failures are invisible until production.
- Mistake 5: Putting retrieved context after the question. Research shows models pay more attention to context placed before the question. Always put context first in the prompt.
Best Practices
- Implement a RAG evaluation pipeline during development with at least 50 question-answer pairs and measure retrieval recall before moving to generation quality.
- Use sentence-transformers/multi-qa-mpnet-base-dot-v1 for technical content and text-embedding-3-small for general content โ model choice significantly affects retrieval quality.
- Store chunk metadata including source document ID, page number, and surrounding chunk IDs so the generation model can request adjacent context when needed.
- Implement query expansion: before retrieving, use an LLM to generate 2-3 alternative phrasings of the query and retrieve for all of them, then merge and re-rank.
Comparison: Retrieval Strategies
- Pure dense retrieval vs hybrid: hybrid consistently outperforms pure dense on technical content with specific terminology by 15 to 25 percent on recall@5.
- Fixed chunking vs semantic chunking: semantic chunking improves retrieval recall by 10 to 20 percent on document-heavy knowledge bases at the cost of slightly more complex ingestion.
- BM25 + dense vs BM25 + dense + re-ranking: adding re-ranking improves precision@3 by 20 to 35 percent with approximately 300ms additional latency from the re-ranking API call.
FAQ
- Q: What chunk size should I use? A: Start with 512 tokens with 64 token overlap for general text. Reduce to 256 for technical documentation. Always validate with retrieval metrics on your actual queries.
- Q: Should I use re-ranking on every query? A: Yes if latency budget allows. Re-ranking adds 200-400ms but improves answer quality significantly. If latency is critical, cache re-ranking results for frequent queries.
- Q: How many documents should I retrieve (top-k)? A: Retrieve 20 initially for re-ranking, then pass only the top 4-6 after re-ranking to the generation model. Never pass more than 6 to 8 chunks unless the model context is very large.
- Q: How do I know if my RAG is failing at retrieval or generation? A: Add a retrieval logging step that stores retrieved documents for failed queries. If the correct document was retrieved: generation failure. If not: retrieval failure. Different fixes apply.
Conclusion
Wrong answers in RAG applications trace to retrieval failures, context poisoning, or generation failures โ each requiring different fixes. Hybrid search addresses retrieval gaps, re-ranking eliminates context noise, and proper prompt engineering ensures the model uses retrieved context rather than defaulting to training knowledge. Build a diagnostic pipeline early that identifies which stage failed on any given wrong answer and the fixes become straightforward.