Node.js guide

Document Search in Node.js with altor-vec

Q: How do I update the document index when content changes?

Rebuild the index at deploy time using a Node.js build script. Call WasmSearchEngine.from_vectors() with the updated embeddings and write the result to public/search-index.json. The browser loads the new index on the next page load.

Q: Can I search PDF or Word documents with altor-vec?

Yes, but you need to extract the text first. Use pdf-parse or mammoth.js to extract plain text, then embed the text chunks with your embedding model, and index the embeddings with altor-vec.

Q: How many documents can I search before performance degrades?

altor-vec handles up to ~100K documents comfortably in modern browsers. A 10K-document index at 384 dimensions uses ~17MB RAM and searches in under 1ms. For 100K documents, expect ~170MB and ~1.2ms — test on mobile before deploying.

Use altor-vec to add document search to your Node.js app — entirely in the browser, with no server, no API keys, and zero per-query cost. Search a collection of documents by semantic meaning — find articles, docs, or notes that are conceptually related to the user's query, not just keyword matches.

Install: npm install altor-vec @xenova/transformers

Implementation

Server-side indexing script (Node 18+, ESM). Uses module-level variable for the engine.

// build-doc-index.mjs — Node.js build script to generate search index
// Run: node build-doc-index.mjs
// Output: public/doc-search-index.json
import { pipeline } from '@xenova/transformers';
import init, { WasmSearchEngine } from 'altor-vec';
import { readFileSync, writeFileSync, readdirSync } from 'fs';
import { join, extname, basename } from 'path';

// Load all markdown/text documents
const DOCS_DIR = './content';
const docs = readdirSync(DOCS_DIR)
  .filter(f => extname(f) === '.md')
  .map((f, id) => {
    const raw = readFileSync(join(DOCS_DIR, f), 'utf8');
    const title = raw.split('\n')[0].replace(/^#+\s*/, '');
    const content = raw.slice(0, 1000); // first 1000 chars
    return { id, title, content, filename: f };
  });

console.log(\`Embedding \${docs.length} documents...\`);
await init();
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const DIM = 384;
const vectors = new Float32Array(docs.length * DIM);

for (const [i, doc] of docs.entries()) {
  const out = await embedder(\`\${doc.title}. \${doc.content}\`,
    { pooling: 'mean', normalize: true });
  vectors.set(out.data, i * DIM);
  if (i % 10 === 0) console.log(\`  \${i}/\${docs.length}\`);
}

const engine = WasmSearchEngine.from_vectors(vectors, DIM, 16, 200, 50);
writeFileSync('public/doc-search-index.json', engine.to_json());
writeFileSync('public/doc-search-metadata.json', JSON.stringify(docs));
console.log('Done. Files written to public/');

Performance

10,000 documents at 384 dimensions: ~17MB memory, <1ms per query. Measured on M2 MacBook Pro, Chrome 124. Mobile is typically 2–4× slower — test on target devices before deploying.

Index size	Dimensions	Query p50	Memory
1,000 vectors	384	~0.1ms	~2MB
10,000 vectors	384	~0.4ms	~17MB
50,000 vectors	384	~0.9ms	~85MB

When this approach works best

Documentation sites and knowledge bases with 500–50K pages
Blog or article archives where keyword search misses conceptual queries
Offline-first apps that need search to work without a network connection

Limitations

Index must be rebuilt on every content update (no real-time sync)
Requires pre-computed embeddings — you need an embedding step at build time

Frequently asked questions

How do I update the document index when content changes?

Rebuild the index at deploy time using a Node.js build script. Call WasmSearchEngine.from_vectors() with the updated embeddings and write the result to public/search-index.json. The browser loads the new index on the next page load.

Can I search PDF or Word documents with altor-vec?

Yes, but you need to extract the text first. Use pdf-parse or mammoth.js to extract plain text, then embed the text chunks with your embedding model, and index the embeddings with altor-vec.

How many documents can I search before performance degrades?

altor-vec handles up to ~100K documents comfortably in modern browsers. A 10K-document index at 384 dimensions uses ~17MB RAM and searches in under 1ms. For 100K documents, expect ~170MB and ~1.2ms — test on mobile before deploying.

Related resources

framework

use case

reference