Next.js guide
Browser RAG in Next.js with altor-vec
Use altor-vec to add browser rag to your Next.js app — entirely in the browser, with no server, no API keys, and zero per-query cost. Retrieval-Augmented Generation (RAG) entirely in the browser — retrieve relevant document chunks from a local vector index, then inject them as context into an LLM prompt, all without a server.
npm install altor-vec @xenova/transformersImplementation
Uses App Router with 'use client' directive. Uses useRef for the engine, useState for results.
// Next.js — browser RAG with pre-built index (App Router)
// app/assistant/page.tsx
'use client';
import { useState, useEffect, useRef } from 'react';
import init, { WasmSearchEngine } from 'altor-vec';
import { pipeline } from '@xenova/transformers';
type Chunk = { text: string; source: string };
export default function AssistantPage() {
const engine = useRef(null);
const embedder = useRef(null);
const chunks = useRef([]);
const [answer, setAnswer] = useState('');
const [sources, setSources] = useState([]);
const [loading, setLoading] = useState(false);
useEffect(() => {
(async () => {
await init();
embedder.current = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const [indexRes, chunksRes] = await Promise.all([
fetch('/rag-index.json'),
fetch('/rag-chunks.json'),
]);
engine.current = WasmSearchEngine.from_json(await indexRes.text());
chunks.current = await chunksRes.json();
})();
}, []);
async function ask(question: string) {
if (!engine.current) return;
setLoading(true);
const out = await embedder.current(question, { pooling: 'mean', normalize: true });
const hits = JSON.parse(engine.current.search(new Float32Array(out.data), 3));
const relevant = hits.map((h: any) => chunks.current[h.id]);
setSources(relevant);
const resp = await fetch('/api/chat', { // Next.js Route Handler
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
context: relevant.map(c => c.text).join('\n\n'),
question,
}),
});
const data = await resp.json();
setAnswer(data.answer);
setLoading(false);
}
return (
Ask about our docs
{
if (e.key === 'Enter') ask((e.target as HTMLInputElement).value);
}} />
{loading && Searching...
}
{answer && Answer
{answer}
}
{sources.length > 0 &&
Sources
{sources.map((s,i) => {s.source}: {s.text.slice(0,150)}...
)}
}
);
}
Performance
Retrieval from 10K chunks: <1ms. Total RAG latency dominated by LLM call (1–30s depending on model). Measured on M2 MacBook Pro, Chrome 124. Mobile is typically 2–4× slower — test on target devices before deploying.
| Index size | Dimensions | Query p50 | Memory |
|---|---|---|---|
| 1,000 vectors | 384 | ~0.1ms | ~2MB |
| 10,000 vectors | 384 | ~0.4ms | ~17MB |
| 50,000 vectors | 384 | ~0.9ms | ~85MB |
When this approach works best
- Privacy-sensitive apps where document content must never leave the device
- Offline-first AI assistants that run on cached content without network access
- Demo or prototype RAG pipelines with no infrastructure budget
Limitations
- Context window limits: you can only inject a finite number of retrieved chunks — chunk size and k must be tuned
- Browser-side LLM inference (WebLLM) is slow on low-end devices; use an API-based LLM for production
Frequently asked questions
Which LLM can I use with browser-side RAG?
For fully offline RAG, use WebLLM (Llama 3.1, Phi-3, Mistral 7B quantized). For online RAG with a client-side retrieval step, call the OpenAI or Anthropic API with the retrieved context — only the LLM call is remote.
How large should my document chunks be for RAG retrieval?
Chunk size of 200–400 tokens (roughly 150–300 words) works well for most use cases. Shorter chunks give more precise retrieval; longer chunks give more context. Use overlapping chunks (50–100 token overlap) to avoid cutting off mid-thought.
How do I prevent the LLM from hallucinating if no relevant chunks are found?
Check the top result score: if the highest similarity score is below 0.6, tell the LLM 'No relevant information found in the document set' rather than injecting empty or low-relevance context. This significantly reduces hallucinations.
Related resources
framework
reference