Research \ AgenticMind

When OpenAI released GPT-4 Turbo with a 128K context window, and Google followed with Gemini's 1M token capacity, a question emerged: do we still need RAG?

We ran the experiment. The results surprised us.

The Setup

We tested both approaches on a real client task: summarizing legal contracts for a mid-size law firm. The dataset included 200 contracts ranging from 5,000 to 80,000 tokens each.

Approach A: Traditional RAG with chunking (512 tokens), embeddings via text-embedding-3-large, and top-k retrieval (k=10) before synthesis.

Approach B: Full document loading into Claude's 200K context window. No chunking, no retrieval. Just the raw document and a summarization prompt.

The Results

89.2%

RAG Accuracy

94.7%

Long-Context Accuracy

47%

Cost Difference

Long-context won on accuracy. But here's where it gets interesting: RAG was 47% cheaper per document.

The Tradeoff Nobody Talks About

Long-context models charge by the token. When you load an 80K token document, you pay for 80K input tokens every single time. RAG only retrieves the relevant chunks, typically 5-10K tokens.

For documents under 20K tokens, long-context wins on both accuracy and cost. Above that threshold, the math flips. RAG becomes the economically rational choice, even with slightly lower accuracy.

Our Recommendation

Use a hybrid approach:

Short documents (<20K tokens): Long-context. The accuracy gain is worth the marginal cost increase.
Long documents (>20K tokens): RAG with high-quality chunking. Accept the 5% accuracy tradeoff for 40%+ cost savings.
Mission-critical tasks: Long-context with human review. When accuracy matters more than cost, don't optimize prematurely.

The Bottom Line

"RAG is dead" is wrong. "Long-context solves everything" is also wrong. The right answer depends on your document size distribution and your accuracy-cost tolerance.

We've built our agents to automatically select the approach based on document length. No manual configuration needed.