rag without server javascript
RAG Without a Server — Complete Browser Implementation in JavaScript
Retrieval-Augmented Generation is usually presented as a server-side architecture: a Python backend calls an embedding API, queries a vector database, and feeds retrieved chunks to an LLM. But the retrieval step — finding relevant chunks from your knowledge base — can run entirely in the browser. This post shows a complete implementation where retrieval and embedding are 100% local, and generation is either local or an optional API call.
npm install altor-vec @huggingface/transformersWhat "RAG without a server" actually means
RAG has two phases with different requirements:
- Retrieval: given a user query, find the most relevant chunks from your knowledge base. This is a nearest-neighbor search over float vectors. It's computationally cheap and requires no external state — perfect for the browser.
- Generation: given retrieved chunks as context, generate a natural language answer. This requires a language model. Running a capable LLM locally in the browser is possible but requires significant memory (700MB+). The alternative is a single API call to OpenAI or Claude — a server call, but only for generation, not for your data.
| Approach | Retrieval | Generation | Data privacy | Latency |
|---|---|---|---|---|
| Fully local | Browser (altor-vec) | Browser (WebLLM) | Complete — nothing leaves browser | First query slow (model load); then fast |
| Hybrid (recommended) | Browser (altor-vec) | OpenAI/Claude API | Query + chunks go to API; corpus stays local | Fast — 0.5ms retrieval + 1-2s generation |
| Traditional server RAG | Server vector DB | Server LLM | Everything on server | Depends on backend |
The hybrid approach is the most practical starting point: your knowledge base never leaves the user's browser, query latency is fast, and you don't need to ship a 700MB model. This guide implements both options.
Architecture overview
Knowledge base (markdown/text files)
└── Build time: embed each chunk → WasmSearchEngine.from_vectors()
└── Write: search-index.bin + metadata.json → /public
Runtime (browser):
User question
└── 1. Embed question → Float32Array (Transformers.js, 23MB model, local)
└── 2. Search index → top-5 chunks (altor-vec, <1ms, local)
└── 3. Build prompt: [system] + [retrieved chunks] + [question]
└── 4a. Generate locally → WebLLM (no server, needs 700MB+ model)
OR
└── 4b. POST to OpenAI/Claude API → stream response back
└── 5. Stream answer to user
Step 1: Chunk and index your knowledge base
RAG quality depends heavily on chunking strategy. For documentation, chunk by section (H2/H3). For long articles, chunk by paragraph with 100-word overlap. Each chunk should be self-contained enough to answer a question on its own.
// scripts/build-rag-index.mjs
import fs from 'node:fs/promises';
import { glob } from 'glob';
import { pipeline } from '@huggingface/transformers';
import init, { WasmSearchEngine } from 'altor-vec';
await init();
const embed = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const files = await glob('./docs/**/*.md');
const vectors = [];
const chunks = []; // metadata per chunk
for (const file of files) {
const text = await fs.readFile(file, 'utf8');
const title = text.match(/^#\s+(.+)/m)?.[1] ?? file;
// Split into sections by H2 headings
const sections = text.split(/(?=^##\s)/m).filter(s => s.trim().length > 50);
for (const section of sections) {
const heading = section.match(/^#+\s+(.+)/m)?.[1] ?? title;
const body = section.replace(/^#+\s+.+\n/, '').trim();
if (!body) continue;
// For long sections, split into overlapping 500-char chunks
const chunkSize = 500;
const overlap = 100;
for (let i = 0; i < body.length; i += chunkSize - overlap) {
const chunk = body.slice(i, i + chunkSize);
if (chunk.trim().length < 50) continue;
const textToEmbed = `${heading}: ${chunk}`;
const out = await embed(textToEmbed, { pooling: 'mean', normalize: true });
vectors.push(...Array.from(out.data));
chunks.push({
id: chunks.length,
title: heading,
text: chunk,
url: '/' + file.replace('./docs/', '').replace('.md', ''),
pageTitle: title,
});
}
}
}
const engine = WasmSearchEngine.from_vectors(new Float32Array(vectors), 384, 16, 200, 50);
await fs.writeFile('./public/rag-index.bin', Buffer.from(engine.to_bytes()));
await fs.writeFile('./public/rag-chunks.json', JSON.stringify(chunks));
console.log(`Indexed ${chunks.length} chunks from ${files.length} files`);
Step 2: Build the retrieval module
// lib/retrieval.ts
import type { WasmSearchEngine } from 'altor-vec';
interface Chunk {
id: number;
title: string;
text: string;
url: string;
pageTitle: string;
}
export interface RetrievedChunk extends Chunk {
score: number;
}
let engine: WasmSearchEngine | null = null;
let chunks: Chunk[] = [];
let initPromise: Promise<void> | null = null;
export async function initRetrieval(): Promise<void> {
if (engine) return;
if (initPromise) return initPromise;
initPromise = (async () => {
const { default: init, WasmSearchEngine } = await import('altor-vec');
await init();
const [indexBuf, chunkData] = await Promise.all([
fetch('/rag-index.bin').then(r => r.arrayBuffer()),
fetch('/rag-chunks.json').then(r => r.json()),
]);
engine = new WasmSearchEngine(new Uint8Array(indexBuf));
chunks = chunkData;
})();
return initPromise;
}
export async function retrieve(queryVector: Float32Array, topK = 5): Promise<RetrievedChunk[]> {
await initRetrieval();
if (!engine) return [];
const hits = JSON.parse(engine.search(queryVector, topK)) as [number, number][];
return hits
.map(([id, distance]) => ({ ...chunks[id], score: 1 - distance }))
.filter(c => c.score > 0.5); // filter low-relevance chunks
}
Step 3: Embed the user's query
// lib/embed.ts
import { pipeline } from '@huggingface/transformers';
type EmbedPipeline = Awaited<ReturnType<typeof pipeline>>;
let embedder: EmbedPipeline | null = null;
export async function embedQuery(text: string): Promise<Float32Array> {
if (!embedder) {
embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
}
const out = await embedder(text, { pooling: 'mean', normalize: true });
return new Float32Array(out.data as ArrayBuffer);
}
Step 4a: Generate with OpenAI API (hybrid approach)
// lib/generate.ts
import type { RetrievedChunk } from './retrieval';
export async function generateAnswer(
question: string,
context: RetrievedChunk[],
apiKey: string,
onToken: (token: string) => void
): Promise<void> {
if (context.length === 0) {
onToken("I don't have relevant information to answer that question.");
return;
}
const contextText = context
.map(c => `[${c.pageTitle} — ${c.title}]\n${c.text}`)
.join('\n\n---\n\n');
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${apiKey}`,
},
body: JSON.stringify({
model: 'gpt-4o-mini',
stream: true,
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer questions using only the provided context. If the context doesn't contain enough information to answer, say so clearly.\n\nContext:\n${contextText}`,
},
{ role: 'user', content: question },
],
}),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
for (const line of text.split('\n')) {
if (!line.startsWith('data: ') || line === 'data: [DONE]') continue;
try {
const data = JSON.parse(line.slice(6));
const token = data.choices[0]?.delta?.content ?? '';
if (token) onToken(token);
} catch {}
}
}
}
Step 4b: Fully local generation with WebLLM
For a completely server-free implementation, use WebLLM for in-browser generation. The tradeoff is a large model download (700MB+) on first use.
import { CreateMLCEngine } from '@mlc-ai/web-llm';
let llmEngine: Awaited<ReturnType<typeof CreateMLCEngine>> | null = null;
export async function generateLocalAnswer(
question: string,
context: RetrievedChunk[],
onToken: (token: string) => void
): Promise<void> {
if (!llmEngine) {
llmEngine = await CreateMLCEngine('Llama-3.2-1B-Instruct-q4f32_1-MLC', {
initProgressCallback: (p) => console.log(p.text),
});
}
const contextText = context.map(c => c.text).join('\n\n');
const chunks = await llmEngine.chat.completions.create({
messages: [
{ role: 'system', content: `Answer using only this context:\n${contextText}` },
{ role: 'user', content: question },
],
stream: true,
});
for await (const chunk of chunks) {
const token = chunk.choices[0]?.delta?.content ?? '';
if (token) onToken(token);
}
}
Step 5: Wire it all together in a React component
'use client';
import { useState, useEffect } from 'react';
import { initRetrieval, retrieve } from '@/lib/retrieval';
import { embedQuery } from '@/lib/embed';
import { generateAnswer } from '@/lib/generate';
export function RagChat({ apiKey }: { apiKey: string }) {
const [question, setQuestion] = useState('');
const [answer, setAnswer] = useState('');
const [sources, setSources] = useState<string[]>([]);
const [loading, setLoading] = useState(false);
useEffect(() => { initRetrieval(); }, []);
async function ask() {
if (!question.trim() || loading) return;
setLoading(true);
setAnswer('');
setSources([]);
const queryVec = await embedQuery(question);
const chunks = await retrieve(queryVec, 5);
setSources([...new Set(chunks.map(c => c.url))]);
await generateAnswer(question, chunks, apiKey, (token) => {
setAnswer(prev => prev + token);
});
setLoading(false);
}
return (
<div>
<div style={{ display: 'flex', gap: 8 }}>
<input
value={question}
onChange={e => setQuestion(e.target.value)}
onKeyDown={e => e.key === 'Enter' && ask()}
placeholder="Ask a question..."
disabled={loading}
/>
<button onClick={ask} disabled={loading || !question.trim()}>
{loading ? 'Thinking…' : 'Ask'}
</button>
</div>
{answer && <div><p>{answer}</p></div>}
{sources.length > 0 && (
<div>
<p>Sources:</p>
<ul>{sources.map(s => <li key={s}><a href={s}>{s}</a></li>)}</ul>
</div>
)}
</div>
);
}
Handling "I don't know"
The most important quality feature in RAG is knowing when not to answer. If retrieval returns chunks with low similarity scores, the context is probably irrelevant to the question. Pass low-quality context to an LLM and it will hallucinate a confident-sounding answer.
// In retrieve() — filter by minimum score
return hits
.map(([id, distance]) => ({ ...chunks[id], score: 1 - distance }))
.filter(c => c.score > 0.65); // tune this threshold
// If retrieve() returns [], don't call generateAnswer()
// Instead show: "I don't have information on that in the current knowledge base"
A threshold of 0.65 cosine similarity works for most documentation use cases. Tune it by testing with 20-30 representative queries — if you're getting hallucinations, raise the threshold; if you're getting too many "I don't know" responses, lower it.
Chunking strategy matters more than the model
The biggest quality lever in RAG is chunking, not the embedding model or the LLM. Chunks that are too long lose precision (the embedding averages too much). Chunks that are too short lose context (the LLM doesn't have enough to generate a good answer).
Starting recommendations:
- Documentation pages: chunk by H2 section, maximum 500 characters per chunk
- FAQ pages: one Q&A pair per chunk
- Long articles: 400-character chunks with 80-character overlap
- Code-heavy docs: keep code blocks together with their explanation paragraph
FAQ
Can RAG really work without any server?
The retrieval step runs entirely in the browser using altor-vec WASM. Embedding generation also runs locally with Transformers.js. For generation you can use WebLLM for full local execution, or a cloud API call for just the generation step — your corpus and queries never leave the browser in either case.
What are the memory requirements?
The retrieval index for 1,000 documents at 384 dimensions is about 1.5MB. The embedding model is 23MB. A local generation model via WebLLM is 700MB-1GB. Hybrid approach (local retrieval + cloud generation) uses only 25MB total — practical on any device.
How do I handle questions the knowledge base can't answer?
Filter retrieved chunks by minimum similarity score (0.65 is a good starting point). If no chunks pass the threshold, return a standard "I don't have that information" response rather than passing empty or low-quality context to the generator.