React guide

Browser RAG in React with altor-vec

Use altor-vec to add browser rag to your React app — entirely in the browser, with no server, no API keys, and zero per-query cost. Retrieval-Augmented Generation (RAG) entirely in the browser — retrieve relevant document chunks from a local vector index, then inject them as context into an LLM prompt, all without a server.

Install: npm install altor-vec @xenova/transformers

Implementation

Works with Vite, CRA, or any React 18+ setup. Uses useState + useRef for the engine.

// BrowserRAG.tsx — full RAG pipeline in React (no server)
import { useState, useEffect, useRef } from 'react';
import init, { WasmSearchEngine } from 'altor-vec';
import { pipeline } from '@xenova/transformers';

type Chunk = { id: number; text: string; source: string };

export function BrowserRAG({ chunks }: { chunks: Chunk[] }) {
  const engine = useRef(null);
  const embedder = useRef(null);
  const [ready, setReady] = useState(false);
  const [answer, setAnswer] = useState('');
  const [context, setContext] = useState([]);

  useEffect(() => {
    (async () => {
      await init();
      embedder.current = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
      const DIM = 384;
      const vecs = new Float32Array(chunks.length * DIM);
      for (const [i, c] of chunks.entries()) {
        const out = await embedder.current(c.text, { pooling: 'mean', normalize: true });
        vecs.set(out.data, i * DIM);
      }
      engine.current = WasmSearchEngine.from_vectors(vecs, DIM, 16, 200, 50);
      setReady(true);
    })();
  }, []);

  async function ask(question: string) {
    if (!engine.current) return;
    // 1. Retrieve top-3 relevant chunks
    const qOut = await embedder.current(question, { pooling: 'mean', normalize: true });
    const hits = JSON.parse(engine.current.search(new Float32Array(qOut.data), 3));
    const relevant = hits.map((h: any) => chunks[h.id]);
    setContext(relevant);

    // 2. Build prompt with retrieved context
    const prompt = `Answer based on this context only:\n\n${
      relevant.map(c => c.text).join('\n\n')
    }\n\nQuestion: ${question}\nAnswer:`;

    // 3. Call your LLM (OpenAI API, Anthropic, or WebLLM for fully offline)
    const res = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: { 'Authorization': \`Bearer \${process.env.NEXT_PUBLIC_OPENAI_KEY}\`,
                 'Content-Type': 'application/json' },
      body: JSON.stringify({ model: 'gpt-4o-mini',
        messages: [{ role: 'user', content: prompt }] })
    });
    const data = await res.json();
    setAnswer(data.choices[0].message.content);
  }

  return (
    
{ready ? :

Loading vector index...

} {context.length > 0 &&
Sources used {context.map(c =>

{c.source}: {c.text.slice(0,100)}...

)}
} {answer &&

Answer: {answer}

}
); }

Performance

Retrieval from 10K chunks: <1ms. Total RAG latency dominated by LLM call (1–30s depending on model). Measured on M2 MacBook Pro, Chrome 124. Mobile is typically 2–4× slower — test on target devices before deploying.

Index sizeDimensionsQuery p50Memory
1,000 vectors384~0.1ms~2MB
10,000 vectors384~0.4ms~17MB
50,000 vectors384~0.9ms~85MB

When this approach works best

Limitations

Frequently asked questions

Which LLM can I use with browser-side RAG?

For fully offline RAG, use WebLLM (Llama 3.1, Phi-3, Mistral 7B quantized). For online RAG with a client-side retrieval step, call the OpenAI or Anthropic API with the retrieved context — only the LLM call is remote.

How large should my document chunks be for RAG retrieval?

Chunk size of 200–400 tokens (roughly 150–300 words) works well for most use cases. Shorter chunks give more precise retrieval; longer chunks give more context. Use overlapping chunks (50–100 token overlap) to avoid cutting off mid-thought.

How do I prevent the LLM from hallucinating if no relevant chunks are found?

Check the top result score: if the highest similarity score is below 0.6, tell the LLM 'No relevant information found in the document set' rather than injecting empty or low-relevance context. This significantly reduces hallucinations.

Related resources

framework

use case

reference