browser rag
Browser-Based RAG: Run Retrieval-Augmented Generation Locally
RAG (Retrieval-Augmented Generation) usually implies a server architecture: vector database, retrieval API, prompt builder, and LLM endpoint. But for many developer-facing products, especially docs assistants and local knowledge tools, you can move major pieces into the browser. This tutorial explains a browser RAG architecture where retrieval runs locally using altor-vec, and generation can use a local model (or a remote fallback). We focus on implementation trade-offs, not marketing demos.
npm install altor-vecRAG recap in one paragraph
RAG improves LLM answers by retrieving relevant context chunks and injecting them into the prompt. Instead of relying only on model pretraining, the system grounds responses in your corpus. The quality of retrieval often dominates answer quality: poor context yields poor generation even with strong models.
Why run RAG in the browser?
- Privacy: user queries and retrieved chunks can stay on-device.
- Cost: local retrieval removes recurring vector API costs.
- Latency: nearest-neighbor search avoids network roundtrips.
- Offline mode: docs/chat assistants still work for retrieval when disconnected.
Even if generation remains remote, local retrieval still reduces total dependency on backend services and can cut prompt token volume through smarter context selection.
Architecture diagram
Browser (Client)
┌────────────────────────────────────────────────────────────┐
│ Query Input │
│ │ │
│ ├─> Embed Query (local model / worker) │
│ │ │
│ ├─> altor-vec HNSW Retrieval (index.bin) │
│ │ └─> top-k chunk IDs + scores │
│ │ │
│ ├─> Prompt Assembler (metadata + chunk text) │
│ │ │
│ └─> Local LLM generation OR remote fallback endpoint │
└────────────────────────────────────────────────────────────┘
Step 1: build retrievable chunks
Chunk your corpus by semantic boundaries (paragraphs/sections), not arbitrary token windows only. Keep source URL/title with each chunk for citation rendering.
// chunk record
{
"id": 4401,
"source": "/docs/indexing#hnsw-params",
"title": "HNSW Parameters",
"text": "M and ef_search control graph connectivity and query breadth..."
}
Step 2: retrieval engine setup (altor-vec)
import init, { WasmSearchEngine } from 'altor-vec';
await init();
const bytes = new Uint8Array(await (await fetch('/rag/index.bin')).arrayBuffer());
const chunks = await (await fetch('/rag/chunks.json')).json();
const engine = new WasmSearchEngine(bytes);
export function retrieve(queryVector, topK = 5) {
const pairs = JSON.parse(engine.search(new Float32Array(queryVector), topK));
return pairs.map(([id, distance]) => ({ ...chunks[id], score: 1 - distance }));
}
Step 3: prompt assembly with citations
Prompt assembly should preserve provenance and enforce answer constraints. Provide explicit instruction to cite source URLs and to say “I don’t know” when retrieval confidence is low.
function buildPrompt(userQuery, contexts) {
const contextBlock = contexts.map((c, i) =>
`[${i+1}] ${c.title} (${c.source})\n${c.text}`
).join('\n\n');
return `You are a docs assistant. Answer only from provided context.
If context is insufficient, say you don't know.
Include citations as [n].
Question: ${userQuery}
Context:
${contextBlock}`;
}
Step 4: generation path options
Option A — local generation
Use an in-browser model runtime (WebGPU/WebAssembly) for full local flow. This maximizes privacy but is limited by device capability and model size.
Option B — remote fallback
Keep retrieval local and send only compact, selected context to a hosted LLM endpoint. This often gives better output quality while still reducing egress scope.
async function answer(query) {
const qVec = await embedLocal(query);
const docs = retrieve(qVec, 6);
const prompt = buildPrompt(query, docs);
if (supportsLocalGeneration()) {
return generateLocal(prompt);
}
return fetch('/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt })
}).then(r => r.json());
}
Quality controls for browser RAG
- Retrieval threshold: if top score is below threshold, ask clarifying question instead of hallucinating.
- Chunk diversity: avoid feeding near-duplicate chunks; diversify by source/section.
- Citation requirement: reject generated outputs that omit references when your UX requires them.
- Context budget: keep prompt compact to reduce local generation cost.
Limitations you must accept
Browser RAG has constraints. Large models can exceed memory budgets or fail on older devices. Cold start for local embedding or generation can be noticeable. Background tab throttling may affect long-running generation. Also, browser caches are not a secure enclave; sensitive corpus data should only be shipped when your access model allows it.
Performance tuning checklist
- Preload index when user opens assistant panel.
- Run embedding and generation in workers where possible.
- Use smaller embedding model for latency, then tune retrieval hyperparameters.
- Measure answer quality with fixed eval questions, not anecdotal chats.
End-to-end walkthrough summary
At runtime, query text is embedded locally, top-k chunks are retrieved via altor-vec, and a prompt is assembled with citations. The generation step can be local or remote, but retrieval remains local and deterministic. This architecture is often enough for technical docs assistants, onboarding copilots, and internal knowledge tools where privacy and cost predictability matter.
Conclusion
Browser-based RAG is not a universal replacement for server infrastructure, but it is a credible architecture for many product surfaces. If your corpus is moderate and your priority is low latency with tighter data boundaries, local retrieval plus optional local generation can deliver strong results. altor-vec provides a compact retrieval core that makes this feasible in standard JavaScript stacks.