Node.js guide
Document Search in Node.js with altor-vec
Use altor-vec to add document search to your Node.js app — entirely in the browser, with no server, no API keys, and zero per-query cost. Search a collection of documents by semantic meaning — find articles, docs, or notes that are conceptually related to the user's query, not just keyword matches.
npm install altor-vec @xenova/transformersImplementation
Server-side indexing script (Node 18+, ESM). Uses module-level variable for the engine.
// build-doc-index.mjs — Node.js build script to generate search index
// Run: node build-doc-index.mjs
// Output: public/doc-search-index.json
import { pipeline } from '@xenova/transformers';
import init, { WasmSearchEngine } from 'altor-vec';
import { readFileSync, writeFileSync, readdirSync } from 'fs';
import { join, extname, basename } from 'path';
// Load all markdown/text documents
const DOCS_DIR = './content';
const docs = readdirSync(DOCS_DIR)
.filter(f => extname(f) === '.md')
.map((f, id) => {
const raw = readFileSync(join(DOCS_DIR, f), 'utf8');
const title = raw.split('\n')[0].replace(/^#+\s*/, '');
const content = raw.slice(0, 1000); // first 1000 chars
return { id, title, content, filename: f };
});
console.log(\`Embedding \${docs.length} documents...\`);
await init();
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const DIM = 384;
const vectors = new Float32Array(docs.length * DIM);
for (const [i, doc] of docs.entries()) {
const out = await embedder(\`\${doc.title}. \${doc.content}\`,
{ pooling: 'mean', normalize: true });
vectors.set(out.data, i * DIM);
if (i % 10 === 0) console.log(\` \${i}/\${docs.length}\`);
}
const engine = WasmSearchEngine.from_vectors(vectors, DIM, 16, 200, 50);
writeFileSync('public/doc-search-index.json', engine.to_json());
writeFileSync('public/doc-search-metadata.json', JSON.stringify(docs));
console.log('Done. Files written to public/');
Performance
10,000 documents at 384 dimensions: ~17MB memory, <1ms per query. Measured on M2 MacBook Pro, Chrome 124. Mobile is typically 2–4× slower — test on target devices before deploying.
| Index size | Dimensions | Query p50 | Memory |
|---|---|---|---|
| 1,000 vectors | 384 | ~0.1ms | ~2MB |
| 10,000 vectors | 384 | ~0.4ms | ~17MB |
| 50,000 vectors | 384 | ~0.9ms | ~85MB |
When this approach works best
- Documentation sites and knowledge bases with 500–50K pages
- Blog or article archives where keyword search misses conceptual queries
- Offline-first apps that need search to work without a network connection
Limitations
- Index must be rebuilt on every content update (no real-time sync)
- Requires pre-computed embeddings — you need an embedding step at build time
Frequently asked questions
How do I update the document index when content changes?
Rebuild the index at deploy time using a Node.js build script. Call WasmSearchEngine.from_vectors() with the updated embeddings and write the result to public/search-index.json. The browser loads the new index on the next page load.
Can I search PDF or Word documents with altor-vec?
Yes, but you need to extract the text first. Use pdf-parse or mammoth.js to extract plain text, then embed the text chunks with your embedding model, and index the embeddings with altor-vec.
How many documents can I search before performance degrades?
altor-vec handles up to ~100K documents comfortably in modern browsers. A 10K-document index at 384 dimensions uses ~17MB RAM and searches in under 1ms. For 100K documents, expect ~170MB and ~1.2ms — test on mobile before deploying.
Related resources
framework
reference