Document ingestion, chunking strategies, embedding pipelines, vector retrieval, hybrid search, re-ranking, context window management, citation attribution, and evaluation — the complete architecture for building AI features grounded in your data.

RAG (Retrieval-Augmented Generation)

Document ingestion, chunking strategies, embedding pipelines, vector retrieval, hybrid search, re-ranking, context window management, citation attribution, and evaluation — the complete architecture for building AI features grounded in your data.

Principles

1. RAG Architecture Overview

RAG gives LLMs access to your private data without fine-tuning. Instead of hoping the model knows the answer, you retrieve relevant documents and include them in the prompt. The model generates a response grounded in your actual data.

The three phases:

Ingest — split documents into chunks, embed them, store in a vector database
Retrieve — when a user asks a question, find the most relevant chunks
Generate — pass the retrieved chunks + the question to an LLM, get a grounded answer

User Question → Embed Query → Vector Search → Retrieved Chunks → LLM → Grounded Answer

Why RAG beats alternatives:

Approach	Pros	Cons
Fine-tuning	Model "knows" the data	Expensive, slow to update, hallucination risk
RAG	Fresh data, citations, lower cost	Retrieval quality matters, latency
Long context	Simple, no pipeline	Expensive per query, limited by window
RAG + Long context	Best of both	Most complex pipeline

Use RAG when: your data changes frequently, you need citations, data is larger than the context window, or you need to control which data each user can access.

Use long context (no RAG) when: data fits in the context window (<100K tokens), data rarely changes, and you do not need citations.

2. Document Ingestion Pipeline

Ingestion turns raw documents (PDFs, HTML, Markdown, database records) into embedded chunks ready for retrieval.

Pipeline steps:

Extract — convert source format to plain text
Clean — remove boilerplate, headers/footers, navigation, ads
Chunk — split into segments of appropriate size
Enrich — add metadata (source, date, section title, page number)
Embed — generate vector embeddings
Store — save chunks with embeddings and metadata to the vector database

// lib/rag/ingest.ts
interface RawDocument {
  id: string;
  title: string;
  content: string;
  source: string;
  sourceUrl?: string;
  mimeType: string;
  updatedAt: Date;
}

interface Chunk {
  id: string;
  documentId: string;
  content: string;
  metadata: {
    title: string;
    source: string;
    sourceUrl?: string;
    chunkIndex: number;
    totalChunks: number;
    section?: string;
  };
}

export async function ingestDocument(doc: RawDocument): Promise<Chunk[]> {
  // 1. Clean the content
  const cleaned = cleanContent(doc.content, doc.mimeType);

  // 2. Chunk the content
  const textChunks = chunkText(cleaned, {
    chunkSize: 500,
    chunkOverlap: 50,
    strategy: 'recursive',
  });

  // 3. Create chunk records with metadata
  const chunks: Chunk[] = textChunks.map((text, index) => ({
    id: `${doc.id}-chunk-${index}`,
    documentId: doc.id,
    content: text,
    metadata: {
      title: doc.title,
      source: doc.source,
      sourceUrl: doc.sourceUrl,
      chunkIndex: index,
      totalChunks: textChunks.length,
    },
  }));

  // 4. Embed all chunks
  const embeddings = await generateEmbeddings(chunks.map((c) => c.content));

  // 5. Store in database
  await storeChunks(chunks, embeddings);

  return chunks;
}

function cleanContent(content: string, mimeType: string): string {
  let cleaned = content;

  // Remove excessive whitespace
  cleaned = cleaned.replace(/\n{3,}/g, '\n\n');
  cleaned = cleaned.replace(/[ \t]+/g, ' ');

  // Remove common boilerplate
  cleaned = cleaned.replace(/^(Copyright|All rights reserved|Terms of).*/gim, '');

  return cleaned.trim();
}

Source format handling:

Markdown/HTML — strip tags, preserve headings as metadata
PDF — use pdf-parse or @mozilla/readability for extraction
Database records — concatenate relevant fields with field labels
Web pages — use Mozilla Readability to extract article content
Code files — preserve structure, use language-aware chunking

3. Chunking Strategies

Chunking is the most impactful decision in a RAG pipeline. Too large and retrieval is imprecise. Too small and chunks lack context. The right strategy depends on your content type.

Fixed-size chunking — split every N tokens with overlap:

function fixedSizeChunk(
  text: string,
  chunkSize: number = 500,
  overlap: number = 50
): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];

  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    const chunk = words.slice(i, i + chunkSize).join(' ');
    if (chunk.trim()) chunks.push(chunk);
  }

  return chunks;
}

Recursive character splitting — split on natural boundaries (paragraphs → sentences → words), trying the largest delimiter first:

function recursiveChunk(
  text: string,
  options: {
    chunkSize?: number;
    chunkOverlap?: number;
    separators?: string[];
  } = {}
): string[] {
  const {
    chunkSize = 500,
    chunkOverlap = 50,
    separators = ['\n\n', '\n', '. ', ', ', ' '],
  } = options;

  const chunks: string[] = [];

  function split(text: string, separatorIndex: number): string[] {
    if (text.length <= chunkSize) return [text];
    if (separatorIndex >= separators.length) {
      // Last resort: split by character count
      return fixedSizeChunk(text, chunkSize, chunkOverlap);
    }

    const separator = separators[separatorIndex];
    const parts = text.split(separator);
    const result: string[] = [];
    let current = '';

    for (const part of parts) {
      const candidate = current ? current + separator + part : part;

      if (candidate.split(/\s+/).length > chunkSize) {
        if (current) result.push(current);
        // Recursively split the oversized part with the next separator
        result.push(...split(part, separatorIndex + 1));
        current = '';
      } else {
        current = candidate;
      }
    }

    if (current) result.push(current);
    return result;
  }

  return split(text, 0);
}

Semantic chunking — split where the topic changes (using embedding similarity between sentences):

Best for heterogeneous documents where a single page covers multiple topics. More expensive (requires embedding each sentence) but produces the most coherent chunks.

Section-aware chunking — split on document structure (headings, chapters, sections):

Best for structured content like documentation, technical manuals, and help centers. Preserves the document's natural organization.

Guidelines:

Content Type	Strategy	Chunk Size	Overlap
Documentation / help center	Section-aware	300-500 tokens	50 tokens
Blog posts / articles	Recursive	400-600 tokens	50-100 tokens
Legal / contracts	Paragraph-based	200-400 tokens	100 tokens
Code	Function/class boundaries	Varies	0 (use full functions)
Chat transcripts	Message boundaries	5-10 messages	2-3 messages

4. Retrieval Strategies

Retrieval is the bottleneck of RAG quality. If you retrieve the wrong chunks, the LLM cannot produce a good answer — no matter how capable the model is.

Top-K similarity — the simplest approach. Embed the query, find the K most similar chunks.

const results = await semanticSearch(query, { limit: 5 });

Hybrid retrieval — combine vector + full-text search with RRF (see Embeddings guide for implementation).

Multi-query retrieval — generate multiple reformulations of the user's question, search with each, and merge results:

import { generateObject } from 'ai';
import { z } from 'zod';

async function multiQueryRetrieve(
  originalQuery: string,
  limit: number = 10
): Promise<SearchResult[]> {
  // Generate query variations
  const { object } = await generateObject({
    model: openai('gpt-4o-mini'),
    temperature: 0.7,
    schema: z.object({
      queries: z
        .array(z.string())
        .length(3)
        .describe('Three different ways to ask the same question'),
    }),
    prompt: `Generate 3 different search queries that capture different aspects of this question:
"${originalQuery}"

Make each query focus on different keywords or phrasings while preserving the intent.`,
  });

  // Search with all queries (including original)
  const allQueries = [originalQuery, ...object.queries];
  const allResults = await Promise.all(
    allQueries.map((q) => semanticSearch(q, { limit: limit * 2 }))
  );

  // Deduplicate and rank by appearance count
  const scoreMap = new Map<string, { result: SearchResult; score: number }>();

  for (const results of allResults) {
    results.forEach((result, rank) => {
      const existing = scoreMap.get(result.id);
      const rrfScore = 1 / (60 + rank);

      if (existing) {
        existing.score += rrfScore;
      } else {
        scoreMap.set(result.id, { result, score: rrfScore });
      }
    });
  }

  return Array.from(scoreMap.values())
    .sort((a, b) => b.score - a.score)
    .slice(0, limit)
    .map((entry) => entry.result);
}

Maximal Marginal Relevance (MMR) — balance relevance with diversity. Avoids retrieving 5 chunks that all say the same thing:

function mmrRerank(
  results: Array<{ id: string; embedding: number[]; similarity: number }>,
  queryEmbedding: number[],
  k: number = 5,
  lambda: number = 0.5 // 0 = max diversity, 1 = max relevance
): typeof results {
  const selected: typeof results = [];
  const candidates = [...results];

  // Select first by pure relevance
  candidates.sort((a, b) => b.similarity - a.similarity);
  selected.push(candidates.shift()!);

  while (selected.length < k && candidates.length > 0) {
    let bestScore = -Infinity;
    let bestIdx = 0;

    for (let i = 0; i < candidates.length; i++) {
      const relevance = candidates[i].similarity;

      // Max similarity to any already-selected document
      const maxSimilarity = Math.max(
        ...selected.map((s) => cosineSimilarity(candidates[i].embedding, s.embedding))
      );

      const mmrScore = lambda * relevance - (1 - lambda) * maxSimilarity;

      if (mmrScore > bestScore) {
        bestScore = mmrScore;
        bestIdx = i;
      }
    }

    selected.push(candidates.splice(bestIdx, 1)[0]);
  }

  return selected;
}

5. Re-Ranking

Initial retrieval (vector search) is fast but approximate. Re-ranking uses a more powerful model to re-order the candidates for higher precision.

Cohere Rerank — the standard re-ranking API:

// lib/rag/rerank.ts
interface RerankResult {
  index: number;
  relevanceScore: number;
}

export async function rerankResults(
  query: string,
  documents: string[],
  topN: number = 5
): Promise<RerankResult[]> {
  const response = await fetch('https://api.cohere.ai/v2/rerank', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.COHERE_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'rerank-v3.5',
      query,
      documents,
      top_n: topN,
      return_documents: false,
    }),
  });

  const data = await response.json();
  return data.results;
}

When to re-rank:

After initial retrieval (vector or hybrid), before passing to the LLM
When retrieval returns 20+ candidates and you need the best 5
When precision matters more than latency (e.g., customer support, legal)

When to skip re-ranking:

Low-latency requirements (adds 100-300ms)
Simple retrieval tasks where top-K is good enough
Budget constraints (Cohere Rerank costs per search)

6. Context Window Management

Retrieved chunks must fit in the LLM's context window alongside the system prompt, conversation history, and output space. Stuffing too many chunks degrades quality — the model struggles to find the relevant information in a sea of text.

The context budget:

Total context window:          128,000 tokens (GPT-4o)
- System prompt:                 ~500 tokens
- Conversation history:        ~2,000 tokens
- Retrieved context:           ~4,000 tokens (target)
- Reserved for output:         ~2,000 tokens
= Available for retrieval:     ~4,000 tokens ≈ 5-8 chunks

Strategies:

Limit chunk count — retrieve 5-8 chunks max. More is rarely better.
Summarize long chunks — if a chunk is over 500 tokens, summarize it before including.
Progressive disclosure — start with 3 chunks. If the model says it needs more information, retrieve additional chunks.
Chunk compression — use a fast model to extract only the relevant sentences from each chunk.

function buildRAGContext(
  chunks: Array<{ content: string; metadata: { source: string; title: string } }>,
  maxTokens: number = 4000
): string {
  let context = '';
  let tokenCount = 0;

  for (const chunk of chunks) {
    const chunkTokens = estimateTokens(chunk.content);
    if (tokenCount + chunkTokens > maxTokens) break;

    context += `[Source: ${chunk.metadata.title}]\n${chunk.content}\n\n---\n\n`;
    tokenCount += chunkTokens;
  }

  return context;
}

7. Citation and Source Attribution

Users need to know where answers come from. Citations build trust and allow verification. Every RAG response should include source references.

Approaches:

Inline citations — [1] markers in the response text, with a reference list at the end
Per-statement citations — each claim tagged with its source
Source cards — UI components showing the source documents alongside the answer

// Prompt template for citation-aware generation
const citationPrompt = `Answer the user's question using ONLY the provided sources.

RULES:
- Cite sources using [1], [2], etc. after each claim
- If a claim cannot be attributed to a source, do not make it
- If the sources don't contain the answer, say "I don't have information about that in my knowledge base"
- Never invent information not present in the sources

SOURCES:
${chunks
  .map(
    (chunk, i) =>
      `[${i + 1}] ${chunk.metadata.title}\n${chunk.content}`
  )
  .join('\n\n')}

QUESTION: ${userQuestion}`;

8. RAG Evaluation

You cannot improve what you do not measure. RAG evaluation is non-negotiable for production systems.

Key metrics:

Metric	What It Measures	How to Measure
Retrieval precision	Are retrieved chunks relevant?	LLM-as-judge or human labels
Retrieval recall	Are all relevant chunks retrieved?	Requires labeled dataset
Faithfulness	Does the answer match the sources?	LLM-as-judge: "Is claim X supported by source Y?"
Answer relevance	Does the answer address the question?	LLM-as-judge or user feedback
Citation accuracy	Do citations point to correct sources?	Automated check: does source [N] contain the cited claim?

RAGAS framework — the standard evaluation framework for RAG:

Faithfulness — fraction of claims in the answer that are supported by the context
Answer relevancy — how relevant the answer is to the question
Context precision — fraction of retrieved documents that are relevant
Context recall — fraction of relevant documents that were retrieved

LLM Instructions

RAG PIPELINE INSTRUCTIONS

1. BUILD A DOCUMENT INGESTION PIPELINE:
   - Accept documents in multiple formats (Markdown, HTML, PDF, plain text)
   - Clean content: remove boilerplate, normalize whitespace, strip irrelevant sections
   - Chunk using recursive splitting with ~500 tokens per chunk and ~50 token overlap
   - Enrich chunks with metadata: source URL, title, section heading, chunk index, date
   - Embed chunks using AI SDK embedMany() in batches of 100
   - Store chunks in pgvector with the embedding, content, and metadata
   - Track which documents have been ingested and when they were last updated
   - Implement incremental ingestion: only re-process documents that have changed

2. IMPLEMENT RETRIEVAL:
   - Default: hybrid search (vector + full-text) with Reciprocal Rank Fusion
   - Retrieve 15-20 candidates, then re-rank to top 5-8
   - Use Cohere Rerank API for re-ranking (model: rerank-v3.5)
   - Apply metadata filters BEFORE vector search (category, date, permissions)
   - For complex questions, use multi-query retrieval (generate 3 query variations)
   - Return similarity scores with results for debugging and threshold filtering

3. CREATE A RAG CHAT ENDPOINT:
   - Build context from retrieved chunks (max ~4000 tokens of context)
   - Include source metadata in the context for citation
   - Use a system prompt that requires citations [1], [2], etc.
   - Instruct the model to say "I don't know" if sources don't contain the answer
   - Stream the response using streamText + toDataStreamResponse
   - Include source documents in the response metadata (not just the text)

4. ADD CITATION SUPPORT:
   - Number sources sequentially in the context: [1], [2], [3]...
   - Prompt the model to cite sources inline using these numbers
   - Parse citations from the response to create clickable source links
   - Validate that cited source numbers actually exist
   - Return source metadata (title, URL, relevance score) alongside the response

5. EVALUATE RAG QUALITY:
   - Create a test set: 50-100 questions with expected answers and relevant documents
   - Measure retrieval precision and recall
   - Use LLM-as-judge for faithfulness (are claims supported by context?)
   - Use LLM-as-judge for answer relevance (does answer address the question?)
   - Track metrics over time — run evaluation after every pipeline change
   - Set minimum thresholds: faithfulness > 0.8, answer relevance > 0.7

Examples

Example 1: Complete RAG Pipeline

End-to-end RAG implementation: ingestion, retrieval, generation, and citations.

// lib/rag/pipeline.ts
import { streamText, embed, embedMany } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { openai } from '@ai-sdk/openai';
import { db } from '@/lib/db';

const EMBEDDING_MODEL = openai.embedding('text-embedding-3-small');

// --- INGESTION ---

export async function ingestDocuments(
  documents: Array<{
    id: string;
    title: string;
    content: string;
    source: string;
    sourceUrl?: string;
  }>
) {
  const allChunks: Array<{
    documentId: string;
    content: string;
    metadata: Record<string, string>;
  }> = [];

  for (const doc of documents) {
    const chunks = recursiveChunk(doc.content, {
      chunkSize: 500,
      chunkOverlap: 50,
    });

    for (let i = 0; i < chunks.length; i++) {
      allChunks.push({
        documentId: doc.id,
        content: chunks[i],
        metadata: {
          title: doc.title,
          source: doc.source,
          sourceUrl: doc.sourceUrl || '',
          chunkIndex: String(i),
          totalChunks: String(chunks.length),
        },
      });
    }
  }

  // Batch embed
  const batchSize = 100;
  for (let i = 0; i < allChunks.length; i += batchSize) {
    const batch = allChunks.slice(i, i + batchSize);
    const { embeddings } = await embedMany({
      model: EMBEDDING_MODEL,
      values: batch.map((c) => c.content),
    });

    // Store in database
    await db.$transaction(
      batch.map((chunk, idx) =>
        db.$executeRaw`
          INSERT INTO "Chunk" (id, "documentId", content, metadata, embedding)
          VALUES (
            ${`${chunk.documentId}-${chunk.metadata.chunkIndex}`},
            ${chunk.documentId},
            ${chunk.content},
            ${JSON.stringify(chunk.metadata)}::jsonb,
            ${JSON.stringify(embeddings[idx])}::vector
          )
          ON CONFLICT (id) DO UPDATE SET
            content = EXCLUDED.content,
            metadata = EXCLUDED.metadata,
            embedding = EXCLUDED.embedding
        `
      )
    );
  }

  return { chunksCreated: allChunks.length };
}

// --- RETRIEVAL ---

export async function retrieve(
  query: string,
  options: { limit?: number; category?: string } = {}
): Promise<
  Array<{
    id: string;
    content: string;
    metadata: Record<string, string>;
    similarity: number;
  }>
> {
  const { limit = 10, category } = options;

  const { embedding } = await embed({
    model: EMBEDDING_MODEL,
    value: query,
  });

  const results = await db.$queryRaw<
    Array<{
      id: string;
      content: string;
      metadata: Record<string, string>;
      similarity: number;
    }>
  >`
    SELECT
      id, content, metadata,
      1 - (embedding <=> ${JSON.stringify(embedding)}::vector) AS similarity
    FROM "Chunk"
    WHERE embedding IS NOT NULL
    ${category ? db.$queryRaw`AND metadata->>'source' = ${category}` : db.$queryRaw``}
    ORDER BY embedding <=> ${JSON.stringify(embedding)}::vector
    LIMIT ${limit}
  `;

  return results;
}

// --- RETRIEVAL WITH RE-RANKING ---

export async function retrieveAndRerank(
  query: string,
  options: { limit?: number; category?: string } = {}
): Promise<
  Array<{
    id: string;
    content: string;
    metadata: Record<string, string>;
    relevanceScore: number;
  }>
> {
  const { limit = 5 } = options;

  // Over-fetch for re-ranking
  const candidates = await retrieve(query, { ...options, limit: 20 });

  if (candidates.length === 0) return [];

  // Re-rank with Cohere
  const rerankResponse = await fetch('https://api.cohere.ai/v2/rerank', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.COHERE_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'rerank-v3.5',
      query,
      documents: candidates.map((c) => c.content),
      top_n: limit,
      return_documents: false,
    }),
  });

  const rerankData = await rerankResponse.json();

  return rerankData.results.map(
    (r: { index: number; relevance_score: number }) => ({
      ...candidates[r.index],
      relevanceScore: r.relevance_score,
    })
  );
}

// --- GENERATION ---

export function buildRAGPrompt(
  chunks: Array<{ content: string; metadata: Record<string, string> }>,
  question: string
): { system: string; prompt: string } {
  const context = chunks
    .map(
      (chunk, i) =>
        `[${i + 1}] (Source: ${chunk.metadata.title})\n${chunk.content}`
    )
    .join('\n\n---\n\n');

  const system = `You are a knowledgeable assistant that answers questions based on provided sources.

RULES:
- Use ONLY the information from the provided sources to answer
- Cite sources using [1], [2], etc. after each claim or piece of information
- If the sources don't contain enough information to answer, say: "I don't have enough information in my knowledge base to answer that question."
- Never invent or assume information not in the sources
- Be concise and direct
- If sources conflict, mention the discrepancy`;

  const prompt = `SOURCES:
${context}

QUESTION: ${question}`;

  return { system, prompt };
}

// app/api/rag/chat/route.ts
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { retrieveAndRerank, buildRAGPrompt } from '@/lib/rag/pipeline';
import { auth } from '@/lib/auth';

export const maxDuration = 30;

export async function POST(req: Request) {
  const session = await auth();
  if (!session?.user) return new Response('Unauthorized', { status: 401 });

  const { messages } = await req.json();
  const latestMessage = messages[messages.length - 1].content;

  // Retrieve relevant chunks
  const chunks = await retrieveAndRerank(latestMessage, { limit: 5 });

  // Build the RAG prompt
  const { system, prompt: ragContext } = buildRAGPrompt(chunks, latestMessage);

  // Generate response with sources
  const result = streamText({
    model: anthropic('claude-sonnet-4-20250514'),
    system,
    messages: [
      ...messages.slice(0, -1),
      { role: 'user', content: ragContext },
    ],
    maxTokens: 2000,
    onFinish: async ({ text, usage }) => {
      await db.ragLog.create({
        data: {
          userId: session.user.id,
          query: latestMessage,
          chunksRetrieved: chunks.length,
          chunkIds: chunks.map((c) => c.id),
          response: text,
          promptTokens: usage.promptTokens,
          completionTokens: usage.completionTokens,
        },
      });
    },
  });

  return result.toDataStreamResponse({
    // Include source metadata in the stream
    getErrorMessage: () => 'An error occurred while generating the response.',
  });
}

Example 2: Recursive Chunking with Overlap

A production-ready chunking implementation that respects document structure.

// lib/rag/chunking.ts

interface ChunkOptions {
  chunkSize: number;        // Target tokens per chunk
  chunkOverlap: number;     // Overlap tokens between chunks
  minChunkSize: number;     // Minimum chunk size (skip tiny chunks)
  separators: string[];     // Split hierarchy
}

const DEFAULT_OPTIONS: ChunkOptions = {
  chunkSize: 500,
  chunkOverlap: 50,
  minChunkSize: 50,
  separators: [
    '\n## ',     // Markdown H2
    '\n### ',    // Markdown H3
    '\n\n',      // Paragraph break
    '\n',        // Line break
    '. ',        // Sentence
    ' ',         // Word
  ],
};

export function recursiveChunk(
  text: string,
  options: Partial<ChunkOptions> = {}
): string[] {
  const opts = { ...DEFAULT_OPTIONS, ...options };
  const chunks = splitRecursive(text, opts, 0);

  // Add overlap between consecutive chunks
  return addOverlap(chunks, opts.chunkOverlap);
}

function splitRecursive(
  text: string,
  options: ChunkOptions,
  separatorIndex: number
): string[] {
  const wordCount = text.split(/\s+/).length;

  // Base case: text fits in one chunk
  if (wordCount <= options.chunkSize) {
    return wordCount >= options.minChunkSize ? [text.trim()] : [];
  }

  // Try each separator in order
  for (let i = separatorIndex; i < options.separators.length; i++) {
    const separator = options.separators[i];

    if (!text.includes(separator)) continue;

    const parts = text.split(separator);
    if (parts.length <= 1) continue;

    const chunks: string[] = [];
    let currentChunk = '';

    for (const part of parts) {
      const candidate = currentChunk
        ? currentChunk + separator + part
        : part;

      if (candidate.split(/\s+/).length > options.chunkSize) {
        // Current chunk is full — save it
        if (currentChunk.trim()) chunks.push(currentChunk.trim());

        // If the part itself is too large, recurse with next separator
        if (part.split(/\s+/).length > options.chunkSize) {
          chunks.push(...splitRecursive(part, options, i + 1));
          currentChunk = '';
        } else {
          currentChunk = part;
        }
      } else {
        currentChunk = candidate;
      }
    }

    if (currentChunk.trim() && currentChunk.split(/\s+/).length >= options.minChunkSize) {
      chunks.push(currentChunk.trim());
    }

    if (chunks.length > 0) return chunks;
  }

  // Fallback: split by word count
  return fixedSizeChunk(text, options.chunkSize);
}

function addOverlap(chunks: string[], overlapSize: number): string[] {
  if (overlapSize === 0 || chunks.length <= 1) return chunks;

  return chunks.map((chunk, i) => {
    if (i === 0) return chunk;

    // Get last N words from previous chunk
    const prevWords = chunks[i - 1].split(/\s+/);
    const overlapWords = prevWords.slice(-overlapSize);

    return overlapWords.join(' ') + ' ' + chunk;
  });
}

function fixedSizeChunk(text: string, chunkSize: number): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];

  for (let i = 0; i < words.length; i += chunkSize) {
    const chunk = words.slice(i, i + chunkSize).join(' ');
    if (chunk.trim()) chunks.push(chunk.trim());
  }

  return chunks;
}

Example 3: Hybrid Retrieval with Re-Ranking

Combine vector search, full-text search, and Cohere re-ranking for optimal retrieval.

// lib/rag/retrieval.ts
import { embed } from 'ai';
import { openai } from '@ai-sdk/openai';
import { db } from '@/lib/db';

const RRF_K = 60;

interface RetrievalResult {
  id: string;
  content: string;
  metadata: Record<string, string>;
  score: number;
  retrievalMethod: 'vector' | 'fulltext' | 'hybrid';
}

export async function hybridRetrieveAndRerank(
  query: string,
  options: {
    limit?: number;
    category?: string;
    rerankEnabled?: boolean;
  } = {}
): Promise<RetrievalResult[]> {
  const { limit = 5, category, rerankEnabled = true } = options;
  const candidateLimit = rerankEnabled ? 20 : limit;

  // Run vector and full-text search in parallel
  const [vectorResults, fulltextResults] = await Promise.all([
    vectorSearch(query, candidateLimit, category),
    fulltextSearch(query, candidateLimit, category),
  ]);

  // Fuse with RRF
  const fusedMap = new Map<
    string,
    RetrievalResult & { rrfScore: number }
  >();

  vectorResults.forEach((result, rank) => {
    fusedMap.set(result.id, {
      ...result,
      score: 0,
      rrfScore: 1 / (RRF_K + rank + 1),
      retrievalMethod: 'vector',
    });
  });

  fulltextResults.forEach((result, rank) => {
    const existing = fusedMap.get(result.id);
    const rrfScore = 1 / (RRF_K + rank + 1);

    if (existing) {
      existing.rrfScore += rrfScore;
      existing.retrievalMethod = 'hybrid';
    } else {
      fusedMap.set(result.id, {
        ...result,
        score: 0,
        rrfScore,
        retrievalMethod: 'fulltext',
      });
    }
  });

  let candidates = Array.from(fusedMap.values())
    .sort((a, b) => b.rrfScore - a.rrfScore)
    .slice(0, candidateLimit);

  // Re-rank with Cohere
  if (rerankEnabled && candidates.length > 0) {
    const rerankResponse = await fetch('https://api.cohere.ai/v2/rerank', {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${process.env.COHERE_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model: 'rerank-v3.5',
        query,
        documents: candidates.map((c) => c.content),
        top_n: limit,
        return_documents: false,
      }),
    });

    const rerankData = await rerankResponse.json();

    candidates = rerankData.results.map(
      (r: { index: number; relevance_score: number }) => ({
        ...candidates[r.index],
        score: r.relevance_score,
      })
    );
  } else {
    candidates = candidates.slice(0, limit).map((c) => ({
      ...c,
      score: c.rrfScore,
    }));
  }

  return candidates;
}

async function vectorSearch(
  query: string,
  limit: number,
  category?: string
) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: query,
  });

  return db.$queryRaw<
    Array<{ id: string; content: string; metadata: Record<string, string> }>
  >`
    SELECT id, content, metadata
    FROM "Chunk"
    WHERE embedding IS NOT NULL
    ${category ? db.$queryRaw`AND metadata->>'source' = ${category}` : db.$queryRaw``}
    ORDER BY embedding <=> ${JSON.stringify(embedding)}::vector
    LIMIT ${limit}
  `;
}

async function fulltextSearch(
  query: string,
  limit: number,
  category?: string
) {
  return db.$queryRaw<
    Array<{ id: string; content: string; metadata: Record<string, string> }>
  >`
    SELECT id, content, metadata
    FROM "Chunk"
    WHERE search_vector @@ plainto_tsquery('english', ${query})
    ${category ? db.$queryRaw`AND metadata->>'source' = ${category}` : db.$queryRaw``}
    ORDER BY ts_rank_cd(search_vector, plainto_tsquery('english', ${query})) DESC
    LIMIT ${limit}
  `;
}

Example 4: Chat-With-Your-Docs Endpoint

A complete "chat with your documents" feature with streaming, citations, and conversation history.

// app/api/docs/chat/route.ts
import { streamText, type Message } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { hybridRetrieveAndRerank } from '@/lib/rag/retrieval';
import { auth } from '@/lib/auth';

export const maxDuration = 30;

export async function POST(req: Request) {
  const session = await auth();
  if (!session?.user) return new Response('Unauthorized', { status: 401 });

  const { messages }: { messages: Message[] } = await req.json();
  const latestQuestion = messages[messages.length - 1].content;

  // Retrieve relevant chunks
  const chunks = await hybridRetrieveAndRerank(latestQuestion, {
    limit: 6,
    rerankEnabled: true,
  });

  // Build context with numbered sources
  const sourcesContext = chunks
    .map(
      (chunk, i) =>
        `[${i + 1}] "${chunk.metadata.title}" (${chunk.metadata.source})\n${chunk.content}`
    )
    .join('\n\n---\n\n');

  const systemPrompt = `You are a documentation assistant. Answer questions using the provided sources.

RULES:
- Use ONLY information from the numbered sources below
- Cite every claim with [1], [2], etc. matching the source numbers
- If sources don't contain the answer, say: "I couldn't find information about that in the documentation. Could you rephrase your question?"
- Be concise — prefer short, direct answers
- For how-to questions, provide step-by-step instructions
- For conceptual questions, explain clearly and cite the relevant source
- You may combine information from multiple sources

SOURCES:
${sourcesContext}`;

  const result = streamText({
    model: anthropic('claude-sonnet-4-20250514'),
    system: systemPrompt,
    messages,
    maxTokens: 1500,
  });

  // Return stream with source metadata as headers
  const response = result.toDataStreamResponse();

  // Attach sources as a custom header for the client to parse
  response.headers.set(
    'X-Sources',
    JSON.stringify(
      chunks.map((c, i) => ({
        index: i + 1,
        title: c.metadata.title,
        source: c.metadata.source,
        sourceUrl: c.metadata.sourceUrl,
        score: c.score,
      }))
    )
  );

  return response;
}

Example 5: RAG Evaluation Script

Automated evaluation of retrieval quality and answer faithfulness.

// scripts/evaluate-rag.ts
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import { retrieve, retrieveAndRerank, buildRAGPrompt } from '@/lib/rag/pipeline';
import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

// Test dataset: questions with expected relevant document IDs
const testSet = [
  {
    question: 'How do I reset my password?',
    expectedDocIds: ['doc-password-reset', 'doc-account-security'],
    expectedAnswer: 'Go to Settings > Security > Change Password',
  },
  {
    question: 'What are the API rate limits?',
    expectedDocIds: ['doc-api-limits', 'doc-api-reference'],
    expectedAnswer: '1000 requests per minute for Pro tier',
  },
  // ... more test cases
];

// Evaluation schemas
const FaithfulnessEval = z.object({
  claims: z.array(
    z.object({
      claim: z.string(),
      supportedByContext: z.boolean(),
      sourceIndex: z.number().optional(),
    })
  ),
  faithfulnessScore: z.number().min(0).max(1).describe(
    'Fraction of claims supported by the context'
  ),
});

const RelevanceEval = z.object({
  isRelevant: z.boolean(),
  relevanceScore: z.number().min(0).max(1),
  reasoning: z.string(),
});

async function evaluateRetrieval(
  question: string,
  retrievedIds: string[],
  expectedIds: string[]
): { precision: number; recall: number } {
  const relevantRetrieved = retrievedIds.filter((id) =>
    expectedIds.some((eid) => id.startsWith(eid))
  );

  const precision =
    retrievedIds.length > 0
      ? relevantRetrieved.length / retrievedIds.length
      : 0;

  const recall =
    expectedIds.length > 0
      ? relevantRetrieved.length / expectedIds.length
      : 0;

  return { precision, recall };
}

async function evaluateFaithfulness(
  answer: string,
  context: string
): Promise<z.infer<typeof FaithfulnessEval>> {
  const { object } = await generateObject({
    model: openai('gpt-4o'),
    temperature: 0,
    schema: FaithfulnessEval,
    prompt: `Evaluate the faithfulness of this answer to the provided context.

CONTEXT:
${context}

ANSWER:
${answer}

Extract each factual claim from the answer. For each claim, determine if it is supported by the context.`,
  });

  return object;
}

async function evaluateRelevance(
  question: string,
  answer: string
): Promise<z.infer<typeof RelevanceEval>> {
  const { object } = await generateObject({
    model: openai('gpt-4o'),
    temperature: 0,
    schema: RelevanceEval,
    prompt: `Does this answer adequately address the question?

QUESTION: ${question}
ANSWER: ${answer}

Rate relevance from 0 (completely irrelevant) to 1 (perfectly relevant).`,
  });

  return object;
}

// Run full evaluation
async function runEvaluation() {
  const results = [];

  for (const testCase of testSet) {
    console.log(`Evaluating: ${testCase.question}`);

    // 1. Retrieve
    const chunks = await retrieveAndRerank(testCase.question, { limit: 5 });

    // 2. Measure retrieval quality
    const retrievalMetrics = await evaluateRetrieval(
      testCase.question,
      chunks.map((c) => c.id),
      testCase.expectedDocIds
    );

    // 3. Generate answer
    const { system, prompt } = buildRAGPrompt(chunks, testCase.question);
    const { text: answer } = await generateText({
      model: anthropic('claude-sonnet-4-20250514'),
      system,
      prompt,
      maxTokens: 1000,
    });

    // 4. Evaluate faithfulness
    const faithfulness = await evaluateFaithfulness(
      answer,
      chunks.map((c) => c.content).join('\n\n')
    );

    // 5. Evaluate relevance
    const relevance = await evaluateRelevance(testCase.question, answer);

    results.push({
      question: testCase.question,
      retrievalPrecision: retrievalMetrics.precision,
      retrievalRecall: retrievalMetrics.recall,
      faithfulness: faithfulness.faithfulnessScore,
      relevance: relevance.relevanceScore,
      answer: answer.slice(0, 200),
    });
  }

  // Summary
  const avg = (arr: number[]) => arr.reduce((a, b) => a + b, 0) / arr.length;

  console.log('\n=== RAG Evaluation Summary ===');
  console.log(`Retrieval Precision: ${(avg(results.map((r) => r.retrievalPrecision)) * 100).toFixed(1)}%`);
  console.log(`Retrieval Recall:    ${(avg(results.map((r) => r.retrievalRecall)) * 100).toFixed(1)}%`);
  console.log(`Faithfulness:        ${(avg(results.map((r) => r.faithfulness)) * 100).toFixed(1)}%`);
  console.log(`Answer Relevance:    ${(avg(results.map((r) => r.relevance)) * 100).toFixed(1)}%`);

  return results;
}

runEvaluation().then(console.log).catch(console.error);

Common Mistakes

1. Wrong Chunk Size

Wrong: Using 2000-token chunks because "more context is better."

Fix: Aim for 300-500 tokens per chunk. Large chunks dilute the embedding — a 2000-token chunk about three different topics matches everything weakly. Small chunks are more precise. Start with 500 tokens, measure retrieval quality, and adjust.

2. No Chunk Overlap

Wrong: Splitting cleanly at every 500 tokens with no overlap, cutting sentences and ideas in half.

Fix: Use 10-20% overlap between consecutive chunks (50-100 tokens for 500-token chunks). This ensures that information at chunk boundaries is not lost. The overlap costs minimal extra storage and embedding cost.

3. Too Many or Too Few Chunks Retrieved

Wrong: Retrieving 20 chunks and stuffing them all into the prompt, or retrieving only 1 chunk.

Fix: Retrieve 5-8 chunks for most questions. Over-retrieve (15-20) and re-rank down to the top 5-8. More chunks increase noise and reduce answer quality. Fewer chunks risk missing relevant information. Use re-ranking to get the best of both worlds.

4. No Source Attribution

Wrong: The RAG system generates answers with no indication of where the information came from. Users cannot verify claims.

Fix: Number sources in the context ([1], [2]) and instruct the model to cite them inline. Return source metadata (title, URL, relevance score) alongside the answer. Build UI components that let users click citations to see the original document.

5. Ignoring Metadata Filtering

Wrong: Searching all documents when the user only has access to a specific subset (their team's docs, their plan's features, a specific product version).

Fix: Apply metadata filters before vector search. Filter by tenant, category, date range, and access permissions using SQL WHERE clauses. This is a security requirement, not just an optimization — users must not see documents they lack permission to access.

6. No Evaluation System

Wrong: Deploying RAG and assuming it works because the demo looked good.

Fix: Build an evaluation pipeline from day one. Create a test set of 50-100 questions. Measure retrieval precision, recall, faithfulness, and answer relevance. Run evaluation after every change to the pipeline (new chunking strategy, different model, updated prompts). Set quality thresholds and alert when they drop.

7. Stuffing the Entire Context Window

Wrong: Retrieving as many chunks as will fit in the context window (100+ chunks for a 128K model).

Fix: More context is not better. Research shows that retrieval quality degrades significantly when the LLM must find relevant information in a large, noisy context. Retrieve 5-8 high-quality chunks. Use re-ranking to ensure the best chunks are selected. Reserve context space for conversation history and output.

8. Not Handling Stale Content

Wrong: Documents are updated on the website, but the RAG pipeline still serves old embeddings.

Fix: Track document update timestamps. Re-ingest documents when they change. Use incremental ingestion: compare the document hash or updatedAt timestamp, and only re-embed changed documents. Set up a cron job or webhook to trigger re-ingestion when source data changes.

9. Using RAG When You Do Not Need It

Wrong: Building a full RAG pipeline for a FAQ with 20 questions that fits in a single prompt.

Fix: If your content fits in the context window (<50K tokens), just include it directly. RAG adds complexity — ingestion pipeline, vector database, chunking, retrieval tuning, evaluation. Only use RAG when: content exceeds context limits, content changes frequently, you need citations, or you need per-user access control.

10. No Fallback for Empty Retrieval

Wrong: When retrieval returns zero relevant chunks, the LLM generates an answer from its training data — potentially hallucinating.

Fix: Check retrieval results before generation. If no chunks pass the similarity threshold (e.g., all below 0.5), return a specific message: "I don't have information about that in my knowledge base." Never let the model fill gaps with its own knowledge in a RAG system — that defeats the purpose.

See also: Embeddings | LLM-Patterns | Prompt-Engineering | Backend/Background-Jobs | Backend/Database-Design

Last reviewed: 2026-03

By Ryan Lind, Assisted by Claude Code and Google Gemini.

RAG (Retrieval-Augmented Generation)

On this page