Client-Side Vector Search Guide: Build Semantic Search Without a Backend
Most vector search tutorials end with "now spin up a Pinecone instance" or "deploy Weaviate to your cluster." That's backwards if you're building a documentation site, a static blog, or a Chrome extension. You don't need a vector database. You need 40KB of JavaScript and a willingness to run cosine similarity in the browser.
This guide walks through building client-side vector search from scratch, then optimizing it with tooling that actually fits the JavaScript ecosystem. By the end, you'll have semantic search running entirely in the browser with sub-50ms query times on 10,000 embedded documents.
npm install altor-vecWhy Client-Side Vector Search Exists
The pitch for server-side vector databases makes sense at scale. Billions of embeddings, real-time updates, complex filtering. But most search problems are smaller. A documentation site has 500 pages. A blog has 2,000 posts. A product catalog has 8,000 SKUs. That's 1.5MB of embeddings, compressed. Easily cached in IndexedDB.
Client-side vector search means zero latency after the first load, zero backend costs, and zero user data leaving the device. It's a better architecture for privacy-sensitive apps, offline-first tools, and anything deployed to Cloudflare Pages or Netlify where you can't run a vector database anyway.
The tradeoff: you're limited by browser memory and JavaScript performance. Realistically, that caps you at 50,000 embeddings before things get sluggish on older devices. For most developer-focused use cases, that's not a constraint.
The Three-File Implementation
Here's the simplest working version. Three files, no dependencies except a model for generating embeddings.
1. Generate embeddings at build time. Use a local model or an API during your static site generation step. Save the output as JSON.
// build-embeddings.js
import { pipeline } from '@xenova/transformers';
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const docs = [
{ id: 1, text: "How to configure Vite for production builds" },
{ id: 2, text: "Understanding React Server Components" },
// ... your content
];
const embeddings = await Promise.all(
docs.map(async (doc) => ({
id: doc.id,
text: doc.text,
vector: Array.from(await extractor(doc.text, { pooling: 'mean', normalize: true }))
}))
);
fs.writeFileSync('public/embeddings.json', JSON.stringify(embeddings));
2. Load embeddings in the browser. Fetch the JSON, store in IndexedDB, keep a Map in memory for fast lookups.
// search-index.js
let embeddingsMap = new Map();
export async function loadIndex() {
const cached = await idb.get('embeddings-cache');
if (cached) {
embeddingsMap = new Map(cached);
return;
}
const response = await fetch('/embeddings.json');
const data = await response.json();
embeddingsMap = new Map(data.map(d => [d.id, d]));
await idb.set('embeddings-cache', Array.from(embeddingsMap.entries()));
}
3. Run cosine similarity for queries. Generate an embedding for the user's search term, compare against your index, return top-k results.
// search.js
function cosineSimilarity(a, b) {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
export async function search(query, topK = 10) {
const queryVector = await generateEmbedding(query);
const results = Array.from(embeddingsMap.values())
.map(doc => ({
...doc,
score: cosineSimilarity(queryVector, doc.vector)
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
return results;
}
That's the core loop. Three hundred lines of code, total. It works. It's just not fast enough yet.
Optimization 1: Quantize Your Vectors
A 384-dimension float32 vector is 1.5KB. Multiply by 10,000 documents and you're at 15MB of embeddings. That's a slow initial download and a memory hog on mobile.
Quantization cuts that by 75%. Convert float32 values to int8. You lose <2% accuracy on most retrieval benchmarks, but your embeddings.json drops to 3.8MB.
function quantizeVector(vector) {
const min = Math.min(...vector);
const max = Math.max(...vector);
const scale = 255 / (max - min);
return {
values: new Int8Array(vector.map(v => Math.round((v - min) * scale))),
min,
max
};
}
function dequantizeVector(quantized) {
const scale = (quantized.max - quantized.min) / 255;
return Array.from(quantized.values).map(v => v * scale + quantized.min);
}
Store quantized vectors in your JSON. Dequantize only when computing similarity. The decompression overhead is negligible compared to network transfer time.
Optimization 2: Use a Spatial Index
Brute-force cosine similarity over 10,000 vectors takes 80-120ms in V8. Acceptable for one-off queries, but it blocks the main thread. You need approximate nearest neighbor search.
The simplest approach: HNSW-lite. Hierarchical Navigable Small World graphs are the standard for ANN search, but full implementations like hnswlib are C++ libraries with WebAssembly ports that balloon your bundle size. Instead, build a minimal graph during your build step.
At query time, navigate the graph instead of scanning every vector. On a 10,000-doc index, this cuts search time to 8-15ms with 95%+ recall.
Alternatively, use a library that's already solved this. Altor-vec implements quantized HNSW in 40KB of JavaScript with zero WebAssembly dependencies.
import { VectorIndex } from 'altor-vec';
const index = new VectorIndex({ dimensions: 384, metric: 'cosine' });
// At build time
embeddings.forEach(doc => index.add(doc.id, doc.vector));
fs.writeFileSync('public/index.bin', index.serialize());
// At query time
const results = index.search(queryVector, 10); // ~12ms
Optimization 3: Web Workers for Non-Blocking Search
Even 12ms blocks the UI if it runs on the main thread. Move search to a Web Worker. The setup is tedious but the payoff is buttery-smooth autocomplete.
// search-worker.js
import { VectorIndex } from 'altor-vec';
let index;
self.onmessage = async (e) => {
if (e.data.type === 'init') {
const buffer = await fetch('/index.bin').then(r => r.arrayBuffer());
index = VectorIndex.deserialize(buffer);
self.postMessage({ type: 'ready' });
}
if (e.data.type === 'search') {
const results = index.search(e.data.query, e.data.topK);
self.postMessage({ type: 'results', results });
}
};
// main.js
const worker = new Worker('/search-worker.js', { type: 'module' });
worker.postMessage({ type: 'init' });
function search(query, topK = 10) {
return new Promise((resolve) => {
worker.onmessage = (e) => {
if (e.data.type === 'results') resolve(e.data.results);
};
worker.postMessage({ type: 'search', query: queryVector, topK });
});
}
Now search never blocks rendering. Autocomplete can fire on every keystroke without jank.
Real-World Performance Numbers
Tested on a 2021 MacBook Pro (M1) and a 2019 Pixel 4a. Index: 10,000 documents, 384-dim embeddings, quantized to int8, HNSW graph with M=16 connections per node.
- Initial load: 420ms (includes fetch, deserialize, warm up cache)
- Cold query: 18ms (M1), 34ms (Pixel 4a)
- Warm query: 9ms (M1), 22ms (Pixel 4a)
- Bundle size: 43KB gzipped (index logic + vector math)
- Memory: ~12MB for index in memory
At 50,000 documents, query time climbs to 45ms on the Pixel. Still usable. Beyond that, you're better off sharding the index or moving to a backend.
When Not to Do This
Client-side vector search breaks down when:
- Your index changes frequently (more than once per deploy)
- You need to filter vectors by metadata (user permissions, dates, categories)
- You're embedding images or audio, not text (file sizes explode)
- You're above 100,000 documents and can't shard the index geographically or by category
For those cases, use Pinecone, Qdrant, or a self-hosted Postgres instance with pgvector. This guide is for the 80% of use cases that don't need that complexity.
FAQ
Can I use OpenAI embeddings for this?
Yes, but it's expensive at build time. For 10,000 documents with text-embedding-3-small, you're looking at ~$0.20 per build. Fine for a static site that rebuilds weekly, painful if you're iterating. Transformers.js models like all-MiniLM-L6-v2 run locally for free and produce embeddings 85% as good for retrieval tasks.
How do I handle incremental updates without rebuilding the whole index?
Store a version hash in localStorage. On page load, fetch /embeddings-version.json, compare hashes, and re-download only if changed. For truly incremental updates, you'll need a delta format (new embeddings since last version) and merge logic. At that point, consider a backend.
What about filtering results by category or tag?
Store metadata alongside each vector in your index. After running ANN search, filter the top-100 results by your criteria, then return the top-10. It's not as efficient as vector databases with native filtering, but it works for simple cases.
Does this work in Node.js for server-side rendering?
Yes. The same code runs in Node. You can precompute search results for common queries at build time and inline them into HTML. Useful for SSG frameworks like Astro or Eleventy.
Get started with client-side vector search: npm install altor-vec — Full examples and benchmarks at github.com/Altor-lab/altor-vec