browser rag

Browser-Based RAG: Run Retrieval-Augmented Generation Locally

RAG (Retrieval-Augmented Generation) usually implies a server architecture: vector database, retrieval API, prompt builder, and LLM endpoint. But for many developer-facing products, especially docs assistants and local knowledge tools, you can move major pieces into the browser. This tutorial explains a browser RAG architecture where retrieval runs locally using altor-vec, and generation can use a local model (or a remote fallback). We focus on implementation trade-offs, not marketing demos.

Install altor-vec: npm install altor-vec

RAG recap in one paragraph

RAG improves LLM answers by retrieving relevant context chunks and injecting them into the prompt. Instead of relying only on model pretraining, the system grounds responses in your corpus. The quality of retrieval often dominates answer quality: poor context yields poor generation even with strong models.

Why run RAG in the browser?

Even if generation remains remote, local retrieval still reduces total dependency on backend services and can cut prompt token volume through smarter context selection.

Architecture diagram

Browser (Client) ┌────────────────────────────────────────────────────────────┐ │ Query Input │ │ │ │ │ ├─> Embed Query (local model / worker) │ │ │ │ │ ├─> altor-vec HNSW Retrieval (index.bin) │ │ │ └─> top-k chunk IDs + scores │ │ │ │ │ ├─> Prompt Assembler (metadata + chunk text) │ │ │ │ │ └─> Local LLM generation OR remote fallback endpoint │ └────────────────────────────────────────────────────────────┘

Step 1: build retrievable chunks

Chunk your corpus by semantic boundaries (paragraphs/sections), not arbitrary token windows only. Keep source URL/title with each chunk for citation rendering.

// chunk record
{
  "id": 4401,
  "source": "/docs/indexing#hnsw-params",
  "title": "HNSW Parameters",
  "text": "M and ef_search control graph connectivity and query breadth..."
}

Step 2: retrieval engine setup (altor-vec)

import init, { WasmSearchEngine } from 'altor-vec';

await init();
const bytes = new Uint8Array(await (await fetch('/rag/index.bin')).arrayBuffer());
const chunks = await (await fetch('/rag/chunks.json')).json();
const engine = new WasmSearchEngine(bytes);

export function retrieve(queryVector, topK = 5) {
  const pairs = JSON.parse(engine.search(new Float32Array(queryVector), topK));
  return pairs.map(([id, distance]) => ({ ...chunks[id], score: 1 - distance }));
}

Step 3: prompt assembly with citations

Prompt assembly should preserve provenance and enforce answer constraints. Provide explicit instruction to cite source URLs and to say “I don’t know” when retrieval confidence is low.

function buildPrompt(userQuery, contexts) {
  const contextBlock = contexts.map((c, i) =>
    `[${i+1}] ${c.title} (${c.source})\n${c.text}`
  ).join('\n\n');

  return `You are a docs assistant. Answer only from provided context.
If context is insufficient, say you don't know.
Include citations as [n].

Question: ${userQuery}

Context:
${contextBlock}`;
}

Step 4: generation path options

Option A — local generation

Use an in-browser model runtime (WebGPU/WebAssembly) for full local flow. This maximizes privacy but is limited by device capability and model size.

Option B — remote fallback

Keep retrieval local and send only compact, selected context to a hosted LLM endpoint. This often gives better output quality while still reducing egress scope.

async function answer(query) {
  const qVec = await embedLocal(query);
  const docs = retrieve(qVec, 6);
  const prompt = buildPrompt(query, docs);

  if (supportsLocalGeneration()) {
    return generateLocal(prompt);
  }
  return fetch('/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt })
  }).then(r => r.json());
}

Quality controls for browser RAG

Limitations you must accept

Browser RAG has constraints. Large models can exceed memory budgets or fail on older devices. Cold start for local embedding or generation can be noticeable. Background tab throttling may affect long-running generation. Also, browser caches are not a secure enclave; sensitive corpus data should only be shipped when your access model allows it.

Performance tuning checklist

  1. Preload index when user opens assistant panel.
  2. Run embedding and generation in workers where possible.
  3. Use smaller embedding model for latency, then tune retrieval hyperparameters.
  4. Measure answer quality with fixed eval questions, not anecdotal chats.

End-to-end walkthrough summary

At runtime, query text is embedded locally, top-k chunks are retrieved via altor-vec, and a prompt is assembled with citations. The generation step can be local or remote, but retrieval remains local and deterministic. This architecture is often enough for technical docs assistants, onboarding copilots, and internal knowledge tools where privacy and cost predictability matter.

Conclusion

Browser-based RAG is not a universal replacement for server infrastructure, but it is a credible architecture for many product surfaces. If your corpus is moderate and your priority is low latency with tighter data boundaries, local retrieval plus optional local generation can deliver strong results. altor-vec provides a compact retrieval core that makes this feasible in standard JavaScript stacks.

CTA: npm install altor-vec · Star on GitHub