Retrieval-Augmented Generation (RAG) has been the standard solution for grounding LLMs in external knowledge bases. However, the rise of models supporting context windows of 1 million tokens or more has sparked a debate: is vector search still necessary when you can fit entire codebases directly in memory?
Feeding huge amounts of raw data directly into the context window yields better coherence, as the model can evaluate relationships across the entire dataset. But this approach is constrained by cost: attention mechanism computation scales quadratically, making long-context prompts expensive and slow.
Ultimately, a hybrid approach is winning. Vector databases act as a high-speed filter to narrow down millions of documents to a relevant subset of 50,000 tokens, which is then processed inside the long-context window. This achieves the perfect balance of retrieval speed, context depth, and cost efficiency.