Where Vector Search Fails
We started with a standard RAG pipeline for contract analysis: chunk documents, embed them, retrieve by similarity, pass to LLM. It worked for simple questions — “What is the payment term?” retrieves the right clause and the model extracts the answer.
It broke on questions that require structural reasoning:
- “Which obligations changed after Amendment 2?” — Vector search cannot traverse the amendment-to-clause relationship. It finds chunks that mention amendments and chunks that mention obligations, but it does not know which amendment modified which obligation.
- “Are there conflicting liability terms across our contract set?” — This requires cross-document comparison. Vector search retrieves similar chunks, but “similar” and “conflicting” are different relationships.
- “What is the effective indemnification cap considering all amendments?” — This requires following a chain: original cap → Amendment 1 modification → Amendment 2 override → side letter exception. Vector search gives you fragments. Not the chain.
Here is the pattern: vector search handles “find me content about X” well. It fails at “show me the relationship between X and Y” and “trace the chain from X to Z through Y.”
What GraphRAG Adds
GraphRAG builds a knowledge graph over your documents: entities (clauses, parties, obligations, dates, amounts) become nodes, and relationships (modifies, supersedes, references, constrains) become edges. When a query arrives, you do two things in parallel: vector search for semantic relevance, and graph traversal for structural context. Then you merge and rerank.
What This Looks Like in Practice
For our contract intelligence system, the graph contains:
- Document nodes: MSA, SOW, Amendments, Side Letters — each with metadata (date, parties, status)
- Clause nodes: Individual clauses extracted and classified (indemnification, liability, payment, termination)
- Relationship edges: “Amendment 2 modifies MSA Section 4.3”, “Side Letter overrides Amendment 1 Section 2”, “SOW references MSA Section 7”
- Entity nodes: Parties, dates, amounts, obligations — linked to their source clauses
When someone asks “What is the current liability cap?”, the graph traversal finds the original cap in the MSA, follows modification edges through amendments, and returns the chain. The vector search simultaneously retrieves relevant text. The LLM receives both: the structured chain for accuracy, the raw text for grounding.
Hybrid Retrieval Architecture

When Graph Helps, When It Hurts
| Use Case | Vector Only | Graph Helps | Why |
| Simple factual lookup | Sufficient | Overkill | One chunk answers the question. No traversal needed. |
| Multi-hop questions | Fails | Essential | Answer requires following relationships across documents. |
| Cross-document comparison | Partial | Strong | Graph connects related clauses across documents explicitly. |
| Rapidly changing corpus | Good | Expensive | Graph extraction and updates add latency to ingestion. |
| Small corpus (<100 docs) | Sufficient | Over-engineering | Vector search coverage is good enough at small scale. |
| Compliance / audit trail | Weak | Strong | Graph provides citation chains: which clause, which version, which modification. |
Our honest assessment: if your queries are all simple lookups and your corpus is under 200 documents, vector search with good chunking and metadata filtering is probably enough. Graph adds value when you have relational questions, cross-document dependencies, or compliance requirements that demand citation chains.
Where Retrieval Fails

Two Failures That Shaped Our Approach
Failure 1: The entity extraction cascade. Our initial graph was built with aggressive entity extraction. Every noun phrase became a node. The graph was huge, noisy, and slow. Queries that should have returned 3 relevant clauses were returning 40+ nodes. The model drowned in context. The fix: we restricted entity extraction to a curated taxonomy — document types, clause types, party names, dates, monetary amounts, and obligation types. Everything else is handled by vector search. Less graph, better graph.
Failure 2: The amendment that did not link. A master agreement was modified by three amendments. Our entity extraction correctly identified all four documents. But it missed the relationship between Amendment 3 and the specific clause it modified, because the amendment text said “Section 4.3 is hereby replaced” without repeating the original clause text. The embedding similarity between the amendment and the original clause was low. Our fix: we added explicit document-structure parsing that detects modification language (“is hereby replaced,” “is amended to read,” “notwithstanding Section X”) and creates graph edges from these patterns, not just from embedding similarity.
Decisions and Trade-offs
| Decision | What We Chose | What We Gave Up |
| Graph scope | Curated taxonomy (limited entity types) | Coverage of edge-case entities. But noise dropped 80% and query latency dropped 60%. |
| Retrieval strategy | Hybrid: vector + graph + rerank | Simplicity. Two retrieval paths to maintain, merge, and test. But accuracy on relational queries went from ~55% to ~85%. |
| Graph update strategy | Re-extract on document change, batch nightly full refresh | Real-time freshness. A document change takes 5-10 minutes to reflect in the graph. Acceptable for our use case (contracts change infrequently). |
| Chunking | Section-aware: respect document structure, keep clause boundaries intact | Uniform chunk sizes. Some chunks are 50 tokens, some are 500. But clause boundaries are never broken. |
Your Retrieval Architecture Checklist
- Categorize your queries: what percentage require relational reasoning vs simple lookup? If under 20%, vector-only may be enough.
- If using a graph, restrict entity extraction to a curated taxonomy. More entities is not better.
- Chunk at semantic boundaries (section, clause, paragraph), not arbitrary token counts
- Set retrieval score thresholds: below threshold means “I don’t know”, not “here’s my best guess”
- Test retrieval independently from generation: does the right content reach the model?
- Monitor retrieval score distributions over time — drift indicates corpus or query distribution changes
- Measure graph freshness: how long after a document change does the graph reflect it?


