Skip to main content
RAGLLMNext.js

How I Built a RAG Pipeline for a Client in One Week

A practical walkthrough of designing, building, and shipping a retrieval-augmented generation system from scratch — in just five days.

F
Faysal Bsata
··8 min read

The Challenge

A client came to me with a clear problem: their support team was drowning in repetitive questions that were all answerable by their existing documentation. They needed an AI assistant that could read thousands of pages of internal docs and answer questions accurately — without hallucinating.

The constraint: one week, production-ready.

The Architecture

I chose a classic RAG (Retrieval-Augmented Generation) setup:

1. Ingestion pipeline — documents are chunked, embedded, and stored in pgvector.

2. Retrieval layer — at query time, the user's question is embedded and the top-k most relevant chunks are fetched via cosine similarity.

3. Generation layer — retrieved chunks are stuffed into a Claude prompt as context, and the answer is streamed back to the user.

The stack: Next.js App Router, Vercel AI SDK, pgvector on Supabase, and Claude claude-sonnet-4-6 for generation.

Day-by-Day Breakdown

Day 1 — Data pipeline. I wrote a Node.js script to parse PDFs, split them into ~512-token overlapping chunks, and batch-embed them using OpenAI's text-embedding-3-small. Total: ~14,000 chunks stored in Supabase. Day 2 — Retrieval API. A Next.js Route Handler that takes a query, embeds it, queries pgvector with a similarity threshold, and returns top-5 chunks. Day 3 — Generation layer. Wired Claude to the retrieval API with a carefully crafted system prompt instructing it to answer only from context and cite sources. Day 4 — UI and streaming. Built the chat interface using the Vercel AI SDK's useChat hook. Streaming felt instantaneous. Day 5 — Evaluation and tuning. I ran 50 test questions against the system, measured answer quality, and tuned chunk size, overlap, and the number of retrieved documents.

What Worked Well

  • ·pgvector is surprisingly fast for this scale — sub-50ms retrieval consistently.
  • ·Overlapping chunks (10–15% overlap) dramatically reduced cases where the answer straddled a chunk boundary.
  • ·Asking Claude to cite the source document name in its answer built trust with end users.

What I'd Do Differently

  • ·Add a re-ranking step (Cohere Rerank or a cross-encoder) to improve retrieval precision.
  • ·Implement hybrid search (BM25 + vector) for better handling of exact-match queries like product codes.
  • ·Build an eval harness from day one rather than bolting it on at the end.

Takeaways

RAG is mature enough that you can ship something genuinely useful in a week. The hard part isn't the technology — it's the data quality, the chunking strategy, and building enough eval coverage to trust it in production.

Building something with AI?

I help teams ship LLM-powered products — from RAG pipelines to full-stack AI apps.

More writing