What are the memory requirements for browser RAG?

The retrieval index for 1,000 documents at 384 dimensions is about 1.5MB. The embedding model (all-MiniLM-L6-v2) is 23MB. A small generation model via WebLLM (like Llama 3.2 1B) is 800MB-1GB. Total memory usage for fully local RAG is roughly 1-2GB — fine on desktop, marginal on mobile. If memory is a constraint, use a cloud API for generation and keep only the retrieval local.

rag without server javascript

RAG Without a Server — Complete Browser Implementation in JavaScript

Q: Can RAG really work without any server?

The retrieval step — finding relevant chunks from your knowledge base — runs entirely in the browser using altor-vec WASM. Embedding generation also runs locally with Transformers.js. For the generation step you have two options: a small local language model via WebLLM (fully browser-local, no server at all) or an API call to OpenAI or Claude (server for generation only, not for retrieval or your data).

Q: How do I handle questions the knowledge base can't answer?

Check the similarity scores from retrieval. If the top result's score is below a threshold (typically 0.65-0.70 for cosine similarity), the retrieved chunks are probably not relevant to the question. In that case, return a standard 'I don't have information on that' response rather than passing low-quality context to the generator.

Retrieval-Augmented Generation is usually presented as a server-side architecture: a Python backend calls an embedding API, queries a vector database, and feeds retrieved chunks to an LLM. But the retrieval step — finding relevant chunks from your knowledge base — can run entirely in the browser. This post shows a complete implementation where retrieval and embedding are 100% local, and generation is either local or an optional API call.

Install: npm install altor-vec @huggingface/transformers

What "RAG without a server" actually means

RAG has two phases with different requirements:

Retrieval: given a user query, find the most relevant chunks from your knowledge base. This is a nearest-neighbor search over float vectors. It's computationally cheap and requires no external state — perfect for the browser.
Generation: given retrieved chunks as context, generate a natural language answer. This requires a language model. Running a capable LLM locally in the browser is possible but requires significant memory (700MB+). The alternative is a single API call to OpenAI or Claude — a server call, but only for generation, not for your data.

Approach	Retrieval	Generation	Data privacy	Latency
Fully local	Browser (altor-vec)	Browser (WebLLM)	Complete — nothing leaves browser	First query slow (model load); then fast
Hybrid (recommended)	Browser (altor-vec)	OpenAI/Claude API	Query + chunks go to API; corpus stays local	Fast — 0.5ms retrieval + 1-2s generation
Traditional server RAG	Server vector DB	Server LLM	Everything on server	Depends on backend

The hybrid approach is the most practical starting point: your knowledge base never leaves the user's browser, query latency is fast, and you don't need to ship a 700MB model. This guide implements both options.

Architecture overview

Knowledge base (markdown/text files)
  └── Build time: embed each chunk → WasmSearchEngine.from_vectors()
      └── Write: search-index.bin + metadata.json → /public

Runtime (browser):
  User question
  └── 1. Embed question → Float32Array (Transformers.js, 23MB model, local)
  └── 2. Search index → top-5 chunks (altor-vec, <1ms, local)
  └── 3. Build prompt: [system] + [retrieved chunks] + [question]
  └── 4a. Generate locally → WebLLM (no server, needs 700MB+ model)
       OR
  └── 4b. POST to OpenAI/Claude API → stream response back
  └── 5. Stream answer to user

Step 1: Chunk and index your knowledge base

RAG quality depends heavily on chunking strategy. For documentation, chunk by section (H2/H3). For long articles, chunk by paragraph with 100-word overlap. Each chunk should be self-contained enough to answer a question on its own.

// scripts/build-rag-index.mjs
import fs from 'node:fs/promises';
import { glob } from 'glob';
import { pipeline } from '@huggingface/transformers';
import init, { WasmSearchEngine } from 'altor-vec';

await init();
const embed = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

const files = await glob('./docs/**/*.md');
const vectors = [];
const chunks = []; // metadata per chunk

for (const file of files) {
  const text = await fs.readFile(file, 'utf8');
  const title = text.match(/^#\s+(.+)/m)?.[1] ?? file;

  // Split into sections by H2 headings
  const sections = text.split(/(?=^##\s)/m).filter(s => s.trim().length > 50);

  for (const section of sections) {
    const heading = section.match(/^#+\s+(.+)/m)?.[1] ?? title;
    const body = section.replace(/^#+\s+.+\n/, '').trim();
    if (!body) continue;

    // For long sections, split into overlapping 500-char chunks
    const chunkSize = 500;
    const overlap = 100;
    for (let i = 0; i < body.length; i += chunkSize - overlap) {
      const chunk = body.slice(i, i + chunkSize);
      if (chunk.trim().length < 50) continue;

      const textToEmbed = `${heading}: ${chunk}`;
      const out = await embed(textToEmbed, { pooling: 'mean', normalize: true });
      vectors.push(...Array.from(out.data));

      chunks.push({
        id: chunks.length,
        title: heading,
        text: chunk,
        url: '/' + file.replace('./docs/', '').replace('.md', ''),
        pageTitle: title,
      });
    }
  }
}

const engine = WasmSearchEngine.from_vectors(new Float32Array(vectors), 384, 16, 200, 50);
await fs.writeFile('./public/rag-index.bin', Buffer.from(engine.to_bytes()));
await fs.writeFile('./public/rag-chunks.json', JSON.stringify(chunks));
console.log(`Indexed ${chunks.length} chunks from ${files.length} files`);

Step 2: Build the retrieval module

// lib/retrieval.ts
import type { WasmSearchEngine } from 'altor-vec';

interface Chunk {
  id: number;
  title: string;
  text: string;
  url: string;
  pageTitle: string;
}

export interface RetrievedChunk extends Chunk {
  score: number;
}

let engine: WasmSearchEngine | null = null;
let chunks: Chunk[] = [];
let initPromise: Promise<void> | null = null;

export async function initRetrieval(): Promise<void> {
  if (engine) return;
  if (initPromise) return initPromise;
  initPromise = (async () => {
    const { default: init, WasmSearchEngine } = await import('altor-vec');
    await init();
    const [indexBuf, chunkData] = await Promise.all([
      fetch('/rag-index.bin').then(r => r.arrayBuffer()),
      fetch('/rag-chunks.json').then(r => r.json()),
    ]);
    engine = new WasmSearchEngine(new Uint8Array(indexBuf));
    chunks = chunkData;
  })();
  return initPromise;
}

export async function retrieve(queryVector: Float32Array, topK = 5): Promise<RetrievedChunk[]> {
  await initRetrieval();
  if (!engine) return [];
  const hits = JSON.parse(engine.search(queryVector, topK)) as [number, number][];
  return hits
    .map(([id, distance]) => ({ ...chunks[id], score: 1 - distance }))
    .filter(c => c.score > 0.5); // filter low-relevance chunks
}

Step 3: Embed the user's query

// lib/embed.ts
import { pipeline } from '@huggingface/transformers';

type EmbedPipeline = Awaited<ReturnType<typeof pipeline>>;
let embedder: EmbedPipeline | null = null;

export async function embedQuery(text: string): Promise<Float32Array> {
  if (!embedder) {
    embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
  }
  const out = await embedder(text, { pooling: 'mean', normalize: true });
  return new Float32Array(out.data as ArrayBuffer);
}

Step 4a: Generate with OpenAI API (hybrid approach)

// lib/generate.ts
import type { RetrievedChunk } from './retrieval';

export async function generateAnswer(
  question: string,
  context: RetrievedChunk[],
  apiKey: string,
  onToken: (token: string) => void
): Promise<void> {
  if (context.length === 0) {
    onToken("I don't have relevant information to answer that question.");
    return;
  }

  const contextText = context
    .map(c => `[${c.pageTitle} — ${c.title}]\n${c.text}`)
    .join('\n\n---\n\n');

  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${apiKey}`,
    },
    body: JSON.stringify({
      model: 'gpt-4o-mini',
      stream: true,
      messages: [
        {
          role: 'system',
          content: `You are a helpful assistant. Answer questions using only the provided context. If the context doesn't contain enough information to answer, say so clearly.\n\nContext:\n${contextText}`,
        },
        { role: 'user', content: question },
      ],
    }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const text = decoder.decode(value);
    for (const line of text.split('\n')) {
      if (!line.startsWith('data: ') || line === 'data: [DONE]') continue;
      try {
        const data = JSON.parse(line.slice(6));
        const token = data.choices[0]?.delta?.content ?? '';
        if (token) onToken(token);
      } catch {}
    }
  }
}

Step 4b: Fully local generation with WebLLM

For a completely server-free implementation, use WebLLM for in-browser generation. The tradeoff is a large model download (700MB+) on first use.

import { CreateMLCEngine } from '@mlc-ai/web-llm';

let llmEngine: Awaited<ReturnType<typeof CreateMLCEngine>> | null = null;

export async function generateLocalAnswer(
  question: string,
  context: RetrievedChunk[],
  onToken: (token: string) => void
): Promise<void> {
  if (!llmEngine) {
    llmEngine = await CreateMLCEngine('Llama-3.2-1B-Instruct-q4f32_1-MLC', {
      initProgressCallback: (p) => console.log(p.text),
    });
  }

  const contextText = context.map(c => c.text).join('\n\n');
  const chunks = await llmEngine.chat.completions.create({
    messages: [
      { role: 'system', content: `Answer using only this context:\n${contextText}` },
      { role: 'user', content: question },
    ],
    stream: true,
  });

  for await (const chunk of chunks) {
    const token = chunk.choices[0]?.delta?.content ?? '';
    if (token) onToken(token);
  }
}

Step 5: Wire it all together in a React component

'use client';
import { useState, useEffect } from 'react';
import { initRetrieval, retrieve } from '@/lib/retrieval';
import { embedQuery } from '@/lib/embed';
import { generateAnswer } from '@/lib/generate';

export function RagChat({ apiKey }: { apiKey: string }) {
  const [question, setQuestion] = useState('');
  const [answer, setAnswer] = useState('');
  const [sources, setSources] = useState<string[]>([]);
  const [loading, setLoading] = useState(false);

  useEffect(() => { initRetrieval(); }, []);

  async function ask() {
    if (!question.trim() || loading) return;
    setLoading(true);
    setAnswer('');
    setSources([]);

    const queryVec = await embedQuery(question);
    const chunks = await retrieve(queryVec, 5);

    setSources([...new Set(chunks.map(c => c.url))]);

    await generateAnswer(question, chunks, apiKey, (token) => {
      setAnswer(prev => prev + token);
    });

    setLoading(false);
  }

  return (
    <div>
      <div style={{ display: 'flex', gap: 8 }}>
        <input
          value={question}
          onChange={e => setQuestion(e.target.value)}
          onKeyDown={e => e.key === 'Enter' && ask()}
          placeholder="Ask a question..."
          disabled={loading}
        />
        <button onClick={ask} disabled={loading || !question.trim()}>
          {loading ? 'Thinking…' : 'Ask'}
        </button>
      </div>
      {answer && <div><p>{answer}</p></div>}
      {sources.length > 0 && (
        <div>
          <p>Sources:</p>
          <ul>{sources.map(s => <li key={s}><a href={s}>{s}</a></li>)}</ul>
        </div>
      )}
    </div>
  );
}

Handling "I don't know"

The most important quality feature in RAG is knowing when not to answer. If retrieval returns chunks with low similarity scores, the context is probably irrelevant to the question. Pass low-quality context to an LLM and it will hallucinate a confident-sounding answer.

// In retrieve() — filter by minimum score
return hits
  .map(([id, distance]) => ({ ...chunks[id], score: 1 - distance }))
  .filter(c => c.score > 0.65); // tune this threshold

// If retrieve() returns [], don't call generateAnswer()
// Instead show: "I don't have information on that in the current knowledge base"

A threshold of 0.65 cosine similarity works for most documentation use cases. Tune it by testing with 20-30 representative queries — if you're getting hallucinations, raise the threshold; if you're getting too many "I don't know" responses, lower it.

Chunking strategy matters more than the model

The biggest quality lever in RAG is chunking, not the embedding model or the LLM. Chunks that are too long lose precision (the embedding averages too much). Chunks that are too short lose context (the LLM doesn't have enough to generate a good answer).

Starting recommendations:

Documentation pages: chunk by H2 section, maximum 500 characters per chunk
FAQ pages: one Q&A pair per chunk
Long articles: 400-character chunks with 80-character overlap
Code-heavy docs: keep code blocks together with their explanation paragraph

FAQ

Can RAG really work without any server?

The retrieval step runs entirely in the browser using altor-vec WASM. Embedding generation also runs locally with Transformers.js. For generation you can use WebLLM for full local execution, or a cloud API call for just the generation step — your corpus and queries never leave the browser in either case.

What are the memory requirements?

The retrieval index for 1,000 documents at 384 dimensions is about 1.5MB. The embedding model is 23MB. A local generation model via WebLLM is 700MB-1GB. Hybrid approach (local retrieval + cloud generation) uses only 25MB total — practical on any device.

How do I handle questions the knowledge base can't answer?

Filter retrieved chunks by minimum similarity score (0.65 is a good starting point). If no chunks pass the threshold, return a standard "I don't have that information" response rather than passing empty or low-quality context to the generator.

Start with retrieval: npm install altor-vec · GitHub