Building RAG pipelines with local models

March 22, 2026·11 min readAI

Large language models know a lot, but they do not know about your internal documentation, your codebase, or your company's specific processes. Retrieval-augmented generation (RAG) bridges this gap by finding relevant documents and including them in the model's context.

Running RAG locally means your documents and queries never leave your network. For proprietary codebases, internal docs, or sensitive data, this matters. And the tooling has gotten good enough that you can build a genuinely useful pipeline in an afternoon.

How RAG works

The pipeline has three steps:

Index your documents by splitting them into chunks and creating vector embeddings
Retrieve relevant chunks by comparing the query embedding to document embeddings
Generate an answer by passing the retrieved chunks as context to the LLM

The model does not memorize your documents. It reads the relevant ones at query time and uses them to answer your question. This means you can update your knowledge base by re-indexing documents without retraining anything.

graph LR
    subgraph Indexing
        D[Documents] --> Ch[Chunking]
        Ch --> E[Embed]
        E --> V[(Vector DB)]
    end
    subgraph Retrieval
        Q[Query] --> QE[Embed]
        QE --> S[Similarity Search]
        V --> S
        S --> R[Retrieved Chunks]
        R --> L[LLM]
        L --> A[Answer]
    end

The components

You need four things:

Documents to index (Markdown files, PDFs, code, whatever)
An embedding model to convert text to vectors (runs locally)
A vector database to store and search embeddings
An LLM to generate answers from retrieved context

Setting it up with Ollama and ChromaDB

Install the pieces:

# Ollama for both embedding and generation
ollama pull llama3.1
ollama pull nomic-embed-text
 
# ChromaDB for vector storage
pip install chromadb ollama

Chunking: the most important decision

How you split documents into chunks has more impact on retrieval quality than your choice of embedding model or vector database. Get chunking wrong and the rest of the pipeline cannot compensate.

Fixed-size chunks (the starting point)

The simplest approach: split text at a fixed character or token count with some overlap between chunks.

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

This works but has an obvious problem: it cuts mid-sentence and mid-paragraph, losing context. A chunk that starts with "it increased by 30%" is useless without knowing what "it" refers to.

Semantic chunking (by document structure)

For Markdown files, splitting on headings preserves natural document structure:

def chunk_markdown(text):
    """Split markdown by H2 headers."""
    sections = text.split("\n## ")
    chunks = []
    for i, section in enumerate(sections):
        if i > 0:
            section = "## " + section
        section = section.strip()
        if section:
            chunks.append(section)
    return chunks

For code, split by functions or classes rather than arbitrary character counts. For HTML, split by structural elements. The goal is to keep semantically coherent units together.

Practical guidelines

Chunk size: 256-512 tokens is the sweet spot for most use cases. Larger chunks provide more context but reduce retrieval precision (a 2,000 token chunk might match on a small portion that is relevant while the rest is noise). Smaller chunks improve precision but can lose surrounding context.

Overlap: 10-20% of chunk size. Overlap ensures information at chunk boundaries is not lost. A fact that spans two chunks gets captured in both.

Start simple. Recursive character splitting at 400-500 tokens with 10-20% overlap works well for general documents. Move to structure-aware chunking only if your metrics show you need it.

Indexing documents

import chromadb
import ollama
import os
 
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")
 
def index_file(filepath):
    with open(filepath) as f:
        content = f.read()
 
    chunks = chunk_markdown(content)
 
    for i, chunk in enumerate(chunks):
        response = ollama.embed(model="nomic-embed-text", input=chunk)
        embedding = response["embeddings"][0]
 
        collection.add(
            documents=[chunk],
            embeddings=[embedding],
            ids=[f"{filepath}_{i}"],
            metadatas=[{"source": filepath}],
        )
 
# Index all markdown files in a directory
for root, dirs, files in os.walk("./docs"):
    for file in files:
        if file.endswith(".md"):
            index_file(os.path.join(root, file))

Choosing an embedding model

The embedding model converts text into vectors (arrays of numbers) that capture semantic meaning. Similar texts produce similar vectors, which is how retrieval works.

For local use with Ollama, the main options:

Model	Dimensions	Best for
nomic-embed-text	768	Good all-rounder, supports Matryoshka dimensions
mxbai-embed-large	1024	Strong English-only performance
bge-m3	1024	Best multilingual, supports hybrid search natively

nomic-embed-text is the default recommendation. It supports Matryoshka Representation Learning, which means you can truncate embeddings from 768 to 256 dimensions with minimal quality loss, trading accuracy for storage and speed.

For English-only use cases, the practical difference between these models is smaller than the difference good chunking makes. Pick one and focus your optimization effort on chunking strategy and retrieval quality.

Querying

def query_docs(question, n_results=5):
    # Embed the question
    response = ollama.embed(model="nomic-embed-text", input=question)
    query_embedding = response["embeddings"][0]
 
    # Find relevant chunks
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
    )
 
    # Build context from retrieved chunks
    context = "\n\n---\n\n".join(results["documents"][0])
 
    # Generate answer with context
    answer = ollama.chat(
        model="llama3.1",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the user's question based on the following context. "
                    "If the context doesn't contain enough information, say so. "
                    "Cite which sections you're drawing from."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n\n{context}\n\nQuestion: {question}",
            },
        ],
    )
 
    return answer["message"]["content"]
 
# Use it
result = query_docs("How do I configure the database connection?")
print(result)

Choosing a vector database

ChromaDB is the fastest path from idea to working prototype. It runs embedded (in-process), requires zero configuration, and handles millions of chunks on a single machine.

But it is not the only option:

pgvector adds vector search to PostgreSQL. If you already run Postgres, this is compelling because your documents and embeddings live in the same database with the same transactions and the same SQL. With the pgvectorscale extension, it handles 50M+ vectors efficiently.

Qdrant is written in Rust and excels at combined vector similarity plus complex metadata filtering. Good when you need to filter by source, date, category, or other metadata during retrieval.

LanceDB is embedded and serverless like ChromaDB, built on a columnar format that handles larger-than-memory datasets. Good for local-first applications.

For a local RAG pipeline, ChromaDB or LanceDB are the right starting points. Move to pgvector or Qdrant when you need production features like complex filtering, multi-user access, or integration with existing infrastructure.

Improving retrieval quality

The basic pipeline above works, but there are several techniques that meaningfully improve results.

Reranking

The single highest-ROI improvement you can add. After retrieving the top N chunks by vector similarity, pass them through a reranking model that scores each one against the query with much higher accuracy.

The reason: embedding models compress meaning into a fixed-size vector (fast but lossy). A reranker reads the query and each chunk together and scores relevance directly (slow but accurate). The trick is combining them: retrieve broadly with embeddings (top 20-50 chunks), then rerank to select the best 3-5.

Reranking typically improves accuracy by 20-35% with 200-500ms additional latency. For local use, bge-reranker-v2-m3 is open source and self-hostable.

Hybrid search

Pure vector search captures semantic meaning but struggles with exact identifiers, product codes, abbreviations, and proper names that get lost in embedding space. BM25 (keyword-based search) handles exact matches well but cannot understand synonyms or semantic relationships.

Combining both gives you the best of each: precision from keyword matching plus understanding from vector search. Most modern vector databases support hybrid search natively. Qdrant, Weaviate, and Milvus all have built-in BM25 plus vector search. For pgvector, combine it with PostgreSQL's full-text search.

Embed questions, not just documents

If your docs have a FAQ section, index the questions separately. Query embeddings are closer in vector space to other questions than to declarative paragraphs. A user asking "How do I reset my password?" matches better against an indexed question than against a paragraph that happens to mention password resets.

Use metadata filtering

Tag chunks with source file, date, category, or other metadata. Filter at query time to narrow the search before vector similarity even runs. This is especially useful when you have documents from multiple products, time periods, or teams.

Advanced patterns

Contextual retrieval

A technique from Anthropic that addresses a fundamental problem: chunks lose context when separated from their source document. A chunk that says "Revenue grew 3%" is useless without knowing which company and time period.

The fix: before embedding each chunk, use an LLM to generate a 50-100 token context snippet that situates the chunk within its source document. Prepend that context to the chunk before embedding. Anthropic reported this reduces failed retrievals by 49%, and by 67% when combined with reranking.

The tradeoff is cost at indexing time (one LLM call per chunk), but you only pay it once.

Parent-child retrieval

Create small child chunks (100-250 tokens) for precise retrieval and larger parent chunks (500-1500 tokens) for generation context. Vector search matches against child chunks, but the parent chunk gets sent to the LLM. You get surgical retrieval precision with rich generation context.

HyDE (Hypothetical Document Embeddings)

Instead of embedding the user's short query directly, use the LLM to generate a hypothetical answer, then embed that. The hypothesis is closer in embedding space to real documents than a terse query would be. Adds an LLM call at query time but can significantly improve retrieval for vague or underspecified queries.

Common failure modes

Wrong chunks retrieved. The retriever returns chunks that are semantically related to query terms but do not contain the answer. Fix with better chunking, reranking, or hybrid search.

Model ignores retrieved context. The LLM can receive the correct evidence but generate an answer from its training data instead. More common with smaller models. Fix with stronger system prompts that instruct the model to only use provided context.

Lost in the middle. Research from Stanford and Meta found that when critical information sits in the middle of the context window (rather than at the beginning or end), model performance drops by 20+ percentage points. Mitigation: place the most relevant chunks first, limit the number of chunks, and use reranking to ensure the best content is most prominent.

Empty results treated as absence of knowledge. If no chunks are retrieved above the similarity threshold, the model should say "I don't know" rather than hallucinate. Your system prompt needs to explicitly handle this case.

When to use RAG vs fine-tuning

RAG is better when your documents change frequently (it is just a database update) or when you need the model to cite specific sources. Fine-tuning is better when you need the model to internalize a pattern, style, or reasoning approach rather than look up facts.

The best pattern for 2026 is combining both: fine-tune for behavior and style (how to respond), RAG for dynamic knowledge (what to respond with). A legal AI fine-tuned on legal reasoning patterns, using RAG to pull current case law, outperforms either approach alone.

One important nuance: for knowledge bases under ~200,000 tokens, you might not need RAG at all. Full-context prompting with prompt caching (sending the entire knowledge base in the system prompt) can be faster and cheaper than building retrieval infrastructure. RAG becomes essential when your corpus exceeds what fits in the context window.

Do you need a framework?

LangChain, LlamaIndex, and Haystack are popular RAG frameworks. They provide abstractions for chunking, embedding, retrieval, and generation. LlamaIndex is the most focused on pure RAG and has the simplest API. LangChain is broader, covering agentic workflows and tool use beyond just retrieval.

But a basic RAG pipeline is genuinely simple. The code in this post is about 50 lines. Building from scratch gives you full understanding of every piece and avoids dependency churn. The hardest parts are chunking strategy and evaluation, not the retrieval plumbing itself.

My recommendation: build from scratch first to understand the pipeline, then adopt a framework if you need features like evaluation, hybrid search orchestration, or complex multi-step retrieval that would be tedious to build yourself.

Sources

July 5, 2026·10 min readAI

Why I built solidifai: parametric CAD through Claude Code

Starting every 3D printed part from a blank Fusion 360 viewport got old, so I built CAD for Claude Code.

April 11, 2026·15 min readAI

What agentic coding actually looks like

Agentic coding changed how I build software. Not in the way the hype suggests.

April 6, 2026·16 min readAI

Hermes Agent by Nous Research: the AI agent that actually cares about security

What Hermes Agent is, how it compares to OpenClaw on security and usability, and why it earned my trust.

Enjoying the blog? Subscribe via RSS to get new posts in your reader.

Subscribe via RSS