semantic autocomplete
Semantic Autocomplete With Client-Side Vector Search
Semantic Autocomplete is one of the clearest places where browser-native vector retrieval can be either a legitimate competitive advantage or a complete architectural mistake. The difference comes down to data boundaries, update cadence, and who should own the search experience. altor-vec is useful here because it turns nearest-neighbor search into a local runtime primitive rather than a network dependency. That means the main implementation questions move from “which vector database should we provision?” to “can the browser safely hold the corpus and produce a query vector fast enough?”
npm install altor-vecFor programmatic SEO pages, vague claims are useless, so this guide stays concrete. The example code uses tiny four-dimensional vectors to remain runnable without external APIs. In production, you would swap those vectors with real embeddings from a text, image, or multimodal model. The mechanics are identical: build or load an HNSW index, add any new vectors you explicitly want to append, and search locally with a Float32Array.
Architecture diagram
The architecture is intentionally split into offline and online phases. Offline is where you chunk content, generate embeddings, and serialize the index. Online is where the browser loads static assets, turns user input into a query vector, and runs ANN search. Keeping those phases separate prevents you from doing expensive, repeated work at request time and gives you a deployment model that looks more like shipping images or JSON than operating a search cluster.
Full code example
The example below is small, but it is using the real altor-vec API: init(), WasmSearchEngine.from_vectors(), add_vectors(), and search(). That matters because the adaptation path to a real application is mechanical rather than conceptual.
import init, { WasmSearchEngine } from 'altor-vec';
const suggestions = [
{ text: 'refund policy', vector: [1, 0, 0, 0] },
{ text: 'pricing plans', vector: [0, 1, 0, 0] },
{ text: 'embedding models', vector: [0, 0, 1, 0] },
];
await init();
const dim = 4;
const engine = WasmSearchEngine.from_vectors(new Float32Array(suggestions.flatMap((s) => s.vector)), dim, 16, 200, 50);
suggestions.push({ text: 'return window', vector: [0.96, 0.04, 0, 0] });
engine.add_vectors(new Float32Array([0.96, 0.04, 0, 0]), dim);
export function semanticAutocomplete(queryVector) {
const hits = JSON.parse(engine.search(new Float32Array(queryVector), 5));
return hits.map(([id, distance]) => ({ text: suggestions[id].text, distance }));
}Step-by-step implementation notes
- Install: add
altor-vecto the project. - Import: load
initandWasmSearchEngine. - Create index: flatten vectors into one
Float32Arrayand build the HNSW graph. - Add vectors: append carefully when the runtime genuinely needs local incremental updates.
- Query: pass a query vector into
search(), parse the JSON result, and map vector IDs back to metadata.
This works when the candidate set is mostly static and semantic intent matters more than live backend analytics. The biggest conceptual shift is that search becomes a local user-experience feature. That reduces latency and cost, but it also means the browser is now responsible for cache management, memory usage, and any degradation strategy when embeddings or static assets fail to load.
Performance benchmarks
| Metric | Published baseline or operational note |
|---|---|
| HNSW retrieval | altor-vec publishes roughly 0.6ms p95 retrieval in Chrome on 10K vectors with 384 dimensions. |
| Index load | The same published benchmark reports around 19ms to instantiate the engine from serialized bytes. |
| WASM payload | The WebAssembly artifact is around 54KB gzipped, which is small enough that your index and metadata usually dominate transfer size. |
| Reference index size | Published reference size for a 10K / 384d index is about 17MB. Always profile your own metadata, compression, and cache behavior. |
Those numbers are useful because they stop teams from optimizing the wrong thing. In most real interfaces, the slowest operation is not engine.search(); it is building the query embedding, rendering a long result list, hydrating a framework page, or downloading too much metadata. If you want better UX, move embedding generation to a worker, lazy-load large metadata payloads, and keep your result cards light.
When this approach works vs when you need a server
Works well: This works when the candidate set is mostly static and semantic intent matters more than live backend analytics.
Needs a server: Use a server when suggestions depend on private history, fast-changing inventory, or tenant-aware ranking.
The honest architectural test is simple: if every browser session is allowed to have the relevant vectors and metadata, local retrieval is a real option. If the answer is no, then client-side search should only be a cache or preview layer. That distinction prevents a lot of “vector search in the browser” experiments from turning into security incidents or disappointing scale stories.
Developer checklist
- Version the index and metadata together so IDs never drift.
- Keep a lexical fallback for exact IDs, filenames, and very short queries.
- Use worker-based embedding or build-time embeddings whenever possible.
- Measure memory on mid-range mobile devices instead of only profiling on desktop Chrome.
- Plan a graceful fallback state when the model or index cannot load.
Conclusion
Semantic Autocomplete is a strong fit for browser-native vector retrieval when the browser is the correct ownership boundary for the data and for the search experience. altor-vec removes the infrastructure layer, but it does not remove the need for good chunking, thoughtful ranking, or realistic evaluation. Used in the right place, though, it is one of the shortest paths from idea to a technical feature that genuinely feels fast.