Vector DB / RAG Implementation Guide (Hands-On)

Q: My RAG accuracy is low. What should I fix first?

Adding the two things — "hybrid search" and "reranking" — is the most effective. Plain vector search alone is weak on exact terms like model numbers and proper nouns, and the most relevant chunks get buried. Fusing with BM25 (via RRF) + a cross-encoder rerank raises search quality to a practical level. If that's still not enough, revisit your chunking.

Q: With a 1M-token window, isn't RAG unnecessary?

It doesn't become unnecessary. Stuffing everything into context drops quality via token inefficiency, "lost in the middle," and distraction. Use long context for small corpora and prototypes, and RAG once freshness, scale, and provenance are requirements — that's the realistic split.

Vector DB / RAG Implementation Guide — From Naive RAG to Production

Table of Contents

1. Recap: the limits of naive RAG
2. The modern RAG pipeline at a glance
3. ① Chunking (the most important)
4. ② Choosing an embedding model
5. ③ Choosing a vector DB (comparison)
6. ④ Hybrid search (BM25 + vector)
7. ⑤ Reranking (retrieve-then-rerank)
8. Frameworks (LlamaIndex/LangChain)
9. RAG vs long context
10. Productionization caveats
Summary
FAQ

You understand what RAG is. But when you actually build one, many people hit a wall: "it kind of works, but the key answer is off." The cause is almost always the same — it's still "naive RAG": chop the document carelessly and do a plain vector search.

Practical RAG in 2026 has clearly moved past that. The keys are a multi-stage pipeline: "smart chunking → the right embedding → hybrid search (keyword + vector) → reranking." As the implementation follow-up to article 030, this covers the concrete how-to of each stage, choosing a vector DB (Chroma/pgvector/Qdrant/Pinecone/Weaviate/Milvus), frameworks, and even "do we still need RAG in the 1M-token era?" — the essentials of a more advanced implementation.

RAG · IMPLEMENTATION

The 5 stages of modern RAG

— from naive RAG to RAG that works in production

① Chunk

Split smartly

② Embed

vectorize with embeddings

③ Store

in a vector DB

④ Retrieve

hybrid search

⑤ Rerank

rerank the best

The two biggest accuracy wins: hybrid search (BM25 + vector) and reranking.
Just adding these two greatly fixes naive RAG's "the answer is off" problem.

* Tool names, methods, and benchmark figures are based on official sources and several tech outlets (as of 2026). This space evolves fast and the best options change. Benchmark figures are source-reported and vary by data and conditions. Evaluate on your own data (below) before choosing.

1. Recap: the limits of naive RAG

The minimal RAG is "split the document → vectorize with embeddings → store in a vector DB → vectorize the question and fetch the nearest chunks → pass them to the LLM to answer." That's the basics of RAG. But this "naive RAG" has typical weaknesses.

Sloppy chunks: cut mid-sentence, breaking the meaning.
Vector search only: weak on exact terms like product names or model numbers (it grabs/misses things that are semantically close but lexically different).
Pass the top N as-is: the truly most relevant items get buried.

Practical RAG in 2026 crushes these three with "smart splitting," "hybrid search," and "reranking." Let's go in order.

2. The modern RAG pipeline at a glance

A production RAG data flow has two tracks: preparation (indexing) and query time (search & generate).

Two phases

Preparation (offline): chunk the document smartly → vectorize with embeddings → store in a vector DB (build a keyword index at the same time).

Query time (online): fetch the top 50-100 with hybrid search (BM25 + vector) → narrow to a few with reranking → pass to the LLM to generate the answer.

The difference from naive RAG is whether "④ hybrid search" and "⑤ reranking" are present. These two stages raise search accuracy to a practical level.

3. ① Chunking (the most important)

It's fair to say chunking decides half of RAG quality. Here are the main strategies.

Strategy	What it does	Good for
recursive 512 tokens	A pragmatic default. Reported #1 of 7 strategies in a Feb 2026 benchmark	When in doubt, this
semantic	Split where meaning shifts, so each chunk is topically coherent	Dense technical docs
structural	Respect headings, code blocks, HTML sections	Documentation and code
parent-child (hierarchical)	Search precisely on small chunks; return the surrounding parent chunk at answer time	Balancing precision and context

If context loss at boundaries is the issue, Contextual Retrieval (attach whole-document context to each chunk) or Late Chunking helps. Anthropic reports that Contextual Retrieval + reranking cut top-20 retrieval failures by up to 67% (a reported figure). The realistic order: start with "recursive 512," and add semantic / parent-child / Contextual Retrieval if accuracy falls short.

4. ② Choosing an embedding model

The embedding model converts chunks into vectors — the foundation of search accuracy.

Safe default: OpenAI text-embedding-3-large. A good balance of retrieval quality and ease of integration.
Other options: Cohere, Voyage, Gemini embeddings, and various OSS models.
Important: many OSS embeddings are plenty for production when combined with hybrid search + reranking. Don't obsess over the embedding alone.

The point is to treat "the embedding as one part of the whole search pipeline." Rather than swapping in an expensive embedding, adding hybrid search and reranking is often more cost-effective.

5. ③ Choosing a vector DB (comparison)

The vector DB stores and searches vectors. Here are the 2026 leaders by character.

DB	Character / strength	Who it's for
Chroma	AI-native, local-first, simple Python API	Individuals/PoCs prototyping RAG fastest
pgvector	A Postgres extension. No second DB, transactional consistency	Teams already on Postgres
Qdrant	Low latency (p50 ~4ms; reported vs Milvus ~6ms / Pinecone ~8ms)	Speed-focused, production
Pinecone	Fully managed. Zero infra, start with just an API key	Want ops handled, cloud-first
Weaviate	Hybrid-search champion (BM25 + vector + metadata in one query)	Heavy hybrid-search users
Milvus	Enterprise-grade, handles billions of vectors	Very large scale

Selection axes: "scale, managed vs self-hosted, existing stack, budget." When in doubt — Chroma for prototyping, pgvector if you have Postgres, Qdrant/Pinecone for balanced production, Weaviate for hybrid-heavy. For most RAG workloads, Pinecone / Weaviate / Qdrant are considered strong choices.

6. ④ Hybrid search (BM25 + vector)

What fixes naive RAG's biggest weakness — being weak on exact terms — is hybrid search. It fuses BM25 (keyword/lexical search) with dense vector search.

HYBRID SEARCH

Fuse lexical + semantic

BM25 (lexical)

Strong on exact terms like model numbers, proper nouns

Vector (semantic)

Captures paraphrase and intent

Fuse with RRF

Combine both ranks without score tuning

Fusing the two with Reciprocal Rank Fusion (RRF) is reported to
consistently beat either approach alone (higher NDCG).

In practice, it's easiest to use a DB that returns hybrid results in a single query, like Weaviate. Leave the tricky score tuning to RRF (rank-based fusion) and it won't break. The result: search that is strong on both exact terms and paraphrases.

7. ⑤ Reranking (retrieve-then-rerank)

The 2026 standard is "two stages: retrieve broadly first → then narrow (rerank)."

Stage 1 (retrieve): fetch the top 50-100 quickly with a bi-encoder (embeddings).
Stage 2 (rerank): with a cross-encoder, score (question, chunk) jointly and narrow to the truly most relevant few.

Popular rerankers are Cohere Rerank 3.5, Voyage rerank-2.5, BGE reranker-v2, Jina Reranker v2. Reranking adds about 50-200ms of latency and cost, but because the chunks passed to the LLM become a select few, it often reduces LLM token consumption and lowers total cost. On both accuracy and cost, reranking is becoming a stage with "no reason not to add it."

8. Frameworks (LlamaIndex/LangChain)

Using a framework is faster than writing everything yourself. Here's the 2026 division of labor.

LlamaIndex: retrieval-focused. Strong at document indexing, search quality, and fast RAG iteration.
LangChain / LangGraph: the orchestration (control) side. Complex workflows and agent coordination.
Combined pattern: in practice, many use LlamaIndex as the retrieval layer and LangGraph as the control layer.

Note that lately, with tool standardization via MCP and the rise of Agent SDKs, there's a trend of agents building pipelines on the fly without heavyweight LLM frameworks. Still, if you're crafting search quality, the LlamaIndex family remains strong. For agent building in general, see how to build an AI agent.

9. RAG vs long context

"With a 1M-token context window, can't I just stuff everything in and skip RAG?" — a common question. The answer is "no, RAG isn't replaced."

The "stuff it all in" trap

Token-inefficient: sending a huge, redundant context every time is costly.
Lost in the middle: information in the middle of long text tends to be ignored.
Distraction: the more irrelevant info, the lower the answer quality.

The guideline: use long context for small corpora and fast iteration, add prompt caching early for stable repeated context, and add RAG the moment "freshness, scale, provenance" become requirements. "Just stuff it in the context window" is the new "just add a vector DB" — not a cure-all. Pair this with an understanding of the context window.

10. Productionization caveats

Build an eval first: "it kind of got better" can't be improved. Quantify retrieval accuracy with a set of representative questions × expected sources and compare per change.
Monitoring: continuously watch retrieval hit rate, post-rerank relevance, and answer grounding.
Cost design: billing comes from embeddings, reranking, and the LLM. Cutting LLM tokens via reranking is the standard cost saver. See token-saving.
Freshness and provenance: design for reflecting data updates (re-indexing) and always attaching the source (which document) to answers — key to fighting hallucination.
Confidential data: be careful when vectorizing internal documents and placing them externally. See corporate AI usage guidelines.

Summary

Practical RAG has evolved from naive "chop and vector-search" into a multi-stage pipeline: "smart chunking → embedding → vector DB → hybrid search → reranking." The two biggest accuracy wins are hybrid search (fuse BM25 + vector with RRF) and reranking (retrieve-then-rerank) — just adding these two greatly fixes "the answer is off."

For vector DBs: Chroma for prototyping, pgvector if you have Postgres, Qdrant/Pinecone for production, Weaviate for hybrid-heavy. For frameworks: LlamaIndex for retrieval, LangChain/LangGraph for control. And even at 1M tokens, RAG isn't replaced — if you need freshness, scale, and provenance, it's RAG. One last must: "build the eval set first." You can't improve what you can't measure. Keep that and your RAG will reliably evolve from "kind of works" to "works in production."

To understand embeddings (vectors) from the ground up, see what is an embedding too — it explains how meaning becomes numbers and how to choose a model, for beginners.

FAQ

Q. My RAG accuracy is low. What should I fix first?
A. Adding the two things — "hybrid search" and "reranking" — is the most effective. Plain vector search alone is weak on exact terms like model numbers and proper nouns, and the most relevant chunks get buried. Fusing with BM25 (via RRF) + a cross-encoder rerank raises search quality to a practical level. If that's still not enough, revisit your chunking.

Q. Which vector DB should I choose?
A. Chroma for prototyping, pgvector if you already have Postgres are the easy starts. For balanced production, Qdrant (low latency) or Pinecone (fully managed); for heavy hybrid search, Weaviate; for very large scale, Milvus. Choose by scale, ops preference, existing stack, and budget.

Q. What chunk size is good?
A. Around 512 tokens with recursive splitting is a pragmatic default (reported near the top in 2026 benchmarks). For technical docs, semantic (split by meaning); for documentation/code, structural (respect headings/code); for precision plus context, parent-child. Start at 512 and tune while evaluating.

Q. With a 1M-token window, isn't RAG unnecessary?
A. It doesn't become unnecessary. Stuffing everything into context drops quality via token inefficiency, "lost in the middle," and distraction. Use long context for small corpora and prototypes, and RAG once freshness, scale, and provenance are requirements — that's the realistic split.

Q. LangChain or LlamaIndex — which should I use?
A. LlamaIndex if you're crafting search quality; LangChain/LangGraph for complex control and agent coordination. In practice, many combine them — "LlamaIndex as the retrieval layer, LangGraph as the control layer." Lately there's also a trend of assembling lightly with Agent SDKs, so choose by your requirements.

Vector DB / RAG Implementation Guide — From Naive RAG to Production

The 5 stages of modern RAG

1. Recap: the limits of naive RAG

2. The modern RAG pipeline at a glance

3. ① Chunking (the most important)

4. ② Choosing an embedding model

5. ③ Choosing a vector DB (comparison)

6. ④ Hybrid search (BM25 + vector)

Fuse lexical + semantic

7. ⑤ Reranking (retrieve-then-rerank)

8. Frameworks (LlamaIndex/LangChain)

9. RAG vs long context

10. Productionization caveats

Summary

FAQ

Related Articles

Generative AI Knowledge Cutoff Dates Compared: ChatGPT, Claude, Gemini & More

What Is Generative AI? How It Differs from Traditional AI

Generative AI Strengths and Weaknesses — What It Can and Cannot Do with Real Examples

What Is an LLM? How Large Language Models Work, Top Models & Use Cases

Comments

Leave a Comment