wasm vector search javascript
WASM Vector Search in JavaScript — How HNSW Runs in the Browser
altor-vec runs HNSW approximate nearest-neighbor search in the browser at under 1ms p95 for 10,000 vectors. The reason it can do this in 54KB is specific to how Rust compiles to WASM and how the HNSW algorithm maps to WASM's memory model. This post explains the implementation decisions that make this possible.
npm install altor-vecWhy not just implement HNSW in JavaScript?
The HNSW search inner loop computes cosine similarity between a query vector and candidate vectors. For 384-dimensional vectors, each similarity computation is 384 multiply-accumulate operations. A single query at 10,000 vectors might visit 100-500 candidate nodes, meaning 38,400 to 192,000 floating-point operations per query.
JavaScript handles this, but with two penalties that WASM avoids:
- JIT unpredictability: V8's JIT compiler optimizes JavaScript to near-native speed for simple numeric loops, but the optimization isn't guaranteed. The JIT can deoptimize a tight loop if it encounters unexpected types or object shapes. WASM compiles to machine code deterministically — the same performance every time.
- Garbage collection pressure: JavaScript's GC runs unpredictably. If a GC cycle triggers mid-search, query latency spikes. WASM linear memory has no GC — allocation is explicit and deterministic.
| Approach | 10K vectors, 384d (p50) | 10K vectors, 384d (p99) | Memory layout |
|---|---|---|---|
| altor-vec (WASM) | 0.3ms | 0.8ms | WASM linear memory |
| Pure JS HNSW equivalent | 2-4ms | 8-20ms (GC spikes) | JS heap, GC-managed |
| fuse.js (full scan) | 3-8ms | 12-25ms | JS heap |
From Rust to WASM: what wasm-bindgen does
altor-vec is written in Rust and compiled to WASM via wasm-pack with wasm-bindgen. The compilation produces three artifacts: a .wasm binary, a JavaScript glue file, and TypeScript type definitions.
wasm-bindgen generates the bridge between Rust types and JavaScript types. For altor-vec's primary API:
// Rust (src/lib.rs — simplified)
#[wasm_bindgen]
pub struct WasmSearchEngine {
index: HnswIndex,
}
#[wasm_bindgen]
impl WasmSearchEngine {
#[wasm_bindgen(constructor)]
pub fn new(bytes: &[u8]) -> WasmSearchEngine {
let index = HnswIndex::from_bytes(bytes);
WasmSearchEngine { index }
}
pub fn search(&self, query: &[f32], top_k: usize) -> String {
let results = self.index.search(query, top_k);
serde_json::to_string(&results).unwrap()
}
pub fn to_bytes(&self) -> Vec<u8> {
self.index.to_bytes()
}
}
The generated JavaScript glue handles memory allocation and type conversion. When you call engine.search(new Float32Array([...]), 5):
- The glue code copies your
Float32Arrayinto WASM linear memory at a freshly allocated pointer - It calls the Rust
searchfunction with that pointer and length - Rust reads the vector directly from WASM memory — no JS/WASM boundary crossing during computation
- The result (a JSON string of
[id, distance]pairs) is written to WASM memory and copied back to JS as a string - The temporary allocation in WASM is freed
The key insight: once the query vector is in WASM memory, the entire HNSW traversal runs in Rust at native speed. The JS/WASM boundary is crossed exactly twice — once to pass the query in, once to receive the results.
HNSW memory layout in WASM linear memory
HNSW stores vectors and graph structure. In altor-vec's implementation, the in-memory layout is:
WASM linear memory:
┌─────────────────────────────────────────┐
│ Header (8 bytes) │
│ - dimension (u32) │
│ - node_count (u32) │
├─────────────────────────────────────────┤
│ Vector data (node_count * dim * 4 bytes)│
│ - f32 vectors, row-major │
│ - 10K vectors @ 384d = 15.36MB │
├─────────────────────────────────────────┤
│ HNSW graph layers │
│ - per-node: level, neighbors per layer│
│ - compact adjacency lists (u32 IDs) │
│ - typically 10-15% of vector data │
└─────────────────────────────────────────┘
The vector data is contiguous f32 values in row-major order. Computing cosine similarity between a query vector and node i is a sequential read starting at offset i * dim * 4 — friendly to CPU cache prefetching and SIMD auto-vectorization in the Rust compiler.
Why the bundle is 54KB
Standard Rust programs include the standard library, allocator, and runtime. For WASM targets, wasm-pack uses a stripped-down configuration:
- No standard library filesystem or networking — not available in WASM anyway
wee_alloc— a minimal allocator optimized for binary size rather than allocation speed- Link-time optimization (LTO) — removes dead code aggressively across crate boundaries
wasm-opt— post-compilation optimizer that reduces WASM binary size furthergzip— the final 54KB figure is the compressed transfer size; uncompressed is ~117KB
The tradeoffs: wee_alloc is slower than the default allocator for many small allocations. For altor-vec, most allocations are large (the vector array, adjacency lists) or rare (index construction), so this tradeoff is favorable.
Float32Array: zero-copy transfer
When JavaScript calls engine.search(queryVector, topK) with a Float32Array, the generated glue code doesn't necessarily copy the data — it can pass a view into the existing memory if the buffer is in a compatible form. For typical usage (Float32Array backed by an ArrayBuffer), the path is:
// JavaScript side
const queryVec = new Float32Array([0.1, 0.2, ..., 0.384]); // 384 floats
const results = engine.search(queryVec, 5);
// results is a JSON string: "[[42, 0.12], [7, 0.18], ...]"
// What happens inside wasm-bindgen glue:
// 1. Allocate space in WASM memory: wasm.__wbindgen_malloc(queryVec.byteLength, 4)
// 2. Copy Float32Array bytes to WASM memory
// 3. Call Rust search() with pointer + length
// 4. Free the allocation after call
The copy in step 2 is one memcpy of 384 * 4 = 1,536 bytes — roughly 0.1 microseconds on modern hardware. This is not the bottleneck. The bottleneck is the HNSW traversal itself, which for 10,000 vectors at ef=50 visits approximately 50-200 nodes.
Cold start: parsing and compilation
The first call to init() triggers WASM instantiation. The browser parses the binary and compiles it to native code. For altor-vec's 117KB uncompressed WASM:
- Chrome (V8): 5-15ms parse + compile
- Firefox (SpiderMonkey): 8-25ms parse + compile
- Safari (JavaScriptCore): 10-30ms
After the first instantiation, the compiled module is cached — subsequent page loads skip compilation entirely. The index file download (1-20MB depending on corpus) is a larger factor than WASM compilation for most users.
WASM streaming compilation: Modern browsers support streaming compilation, which parses and compiles WASM incrementally as it downloads. To use it, serve the WASM file with Content-Type: application/wasm. This overlaps download and compilation, reducing perceived cold start by 30-50%.
Benchmarks: measured on real hardware
Benchmarks below were measured on a 2023 MacBook Pro (M2 Pro), Chrome 124. Each measurement is the median of 100 queries after warm-up.
| Index size | Dimensions | ef_search | p50 latency | p95 latency | Recall@10 |
|---|---|---|---|---|---|
| 1,000 vectors | 384 | 50 | 0.08ms | 0.15ms | 98.2% |
| 10,000 vectors | 384 | 50 | 0.31ms | 0.62ms | 95.4% |
| 10,000 vectors | 768 | 50 | 0.58ms | 1.1ms | 96.1% |
| 50,000 vectors | 384 | 50 | 0.71ms | 1.4ms | 93.8% |
| 10,000 vectors | 384 | 200 | 0.89ms | 1.6ms | 99.1% |
The ef_search parameter controls the search beam width — higher values explore more of the graph and improve recall at the cost of latency. For most documentation search use cases, ef=50 at 95% recall is the right balance.
Index construction: from_vectors vs loading bytes
Two ways to get an engine into the browser:
// Option 1: Build from raw vectors in JavaScript
// (slow — use this in Node build scripts, not in browser)
const engine = WasmSearchEngine.from_vectors(vectors, dim, M, ef_construction, ef_search);
// Option 2: Load pre-built index from a binary file
// (fast — this is the browser-side path)
const bytes = await fetch('/search-index.bin').then(r => r.arrayBuffer());
const engine = new WasmSearchEngine(new Uint8Array(bytes));
from_vectors runs HNSW graph construction — an O(n log n) operation that takes seconds for large indexes. Always run this offline in a build script. new WasmSearchEngine(bytes) just deserializes a pre-built graph, which is fast (a few hundred milliseconds for a 15MB index).
Memory usage in practice
For a 10,000-document index at 384 dimensions:
- Vector data:
10,000 × 384 × 4 bytes = 15.36MB - Graph structure (HNSW adjacency lists): approximately 2-3MB
- WASM module: 117KB uncompressed
- Total WASM linear memory: ~18MB
This is well within Chrome's typical WASM memory limit (2GB by default on 64-bit systems). For larger indexes, the main constraint is the download size and initial memory allocation time, not WASM's capability.
FAQ
Why is WASM faster than pure JavaScript for vector search?
WASM compiles to machine code deterministically — no JIT unpredictability. Its linear memory model has no garbage collection, eliminating GC pause spikes. For tight numeric loops like dot products over float arrays, WASM achieves 3-10x better throughput than equivalent JavaScript.
What is the cold start overhead?
For altor-vec's 117KB WASM, compile time is 5-30ms depending on browser. The compiled module is cached, so subsequent page loads skip compilation entirely. The index file download (1-20MB) is the dominant first-load cost.
How does HNSW graph traversal stay in WASM?
Once the query Float32Array is copied to WASM linear memory, the entire HNSW traversal runs in Rust. The JS/WASM boundary is crossed twice per query: once to copy the query vector in, once to receive the result string out. All node comparisons happen in WASM memory with no boundary crossings.