4 min read
Running ONNX models in the browser without losing your weekend
A working recipe for shipping image segmentation in a tab — Web Workers, WASM, pre-encoded embeddings, and the small things that decide whether the demo is fast or felt-fast.
On this page
Most of the AI demos I’ve seen on the web in 2026 punt: they POST an image to a server, render a spinner, then paint the mask. That works, but it costs you per call, leaks user data, and feels like 2018. The alternative — running the model directly in the user’s tab — is shockingly close to boring now. Here’s the recipe I’ve been using.
The shape of the runtime
Three rules that have served me well:
- Run inference in a Web Worker. Otherwise a 60ms decoder pass freezes
your animation loop. Workers are cheap, and
onnxruntime-webships an ESM-friendly worker entry. - Pre-encode whatever you can at build time. For interactive segmentation (think Meta’s SAM family), the heavy work is the encoder. Run it once on a server (or your laptop), serialise the resulting tensors to disk, and ship them next to the image. The browser only ever runs the lightweight decoder.
- Use WASM unless you really need WebGPU. WASM is universally supported, debugs cleanly, and is fast enough for decoder-only inference. WebGPU is a nice optimisation when the model is bigger.
Loading the runtime in a worker
import * as ort from "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.21.0/dist/ort.wasm.bundle.min.mjs";
ort.env.wasm.wasmPaths =
"https://cdn.jsdelivr.net/npm/onnxruntime-web@1.21.0/dist/";
ort.env.wasm.numThreads = navigator.hardwareConcurrency || 4;
let session = null;
self.onmessage = async (e) => {
const { type, data } = e.data;
if (type === "load") {
const buf = await fetch(data.modelUrl).then((r) => r.arrayBuffer());
session = await ort.InferenceSession.create(buf, {
executionProviders: ["wasm"],
});
self.postMessage({ type: "ready" });
}
};A few things worth noting:
- The CDN import keeps your main bundle tiny — none of
onnxruntime-webreaches your app code until the worker boots. numThreadswill quietly fall back to 1 unless your page is cross-origin-isolated. Don’t fight it; single-threaded is fine for a decoder.
Talking to the worker from React
Keep the worker outside React state and lazy-load it via
IntersectionObserver. The model file is ~3–20MB; you don’t want
to fetch it before the section is on screen.
"use client";
import { useEffect, useRef, useState } from "react";
export default function Sam2Tile() {
const [active, setActive] = useState(false);
const wrap = useRef<HTMLDivElement>(null);
useEffect(() => {
if (!wrap.current || active) return;
const io = new IntersectionObserver(([e]) => {
if (e.isIntersecting) {
setActive(true);
io.disconnect();
}
}, { rootMargin: "200px" });
io.observe(wrap.current);
return () => io.disconnect();
}, [active]);
useEffect(() => {
if (!active) return;
const w = new Worker("/workers/sam2-decoder-worker.js", { type: "module" });
w.postMessage({ type: "load", data: { modelUrl: "/models/decoder.onnx" } });
return () => w.terminate();
}, [active]);
return <div ref={wrap}>{active ? "loading…" : "scroll into view"}</div>;
}Pre-encoding embeddings
For SAM, the trick that makes click-to-segment feel snappy is the encoder /
decoder split. The encoder produces three tensors — image_embed,
high_res_feats_0, high_res_feats_1 — that the decoder consumes alongside
your click points. Encode once, serialise as MessagePack, ship next to the
image:
import { encode } from "@msgpack/msgpack";
import fs from "node:fs/promises";
// after running the encoder server-side once:
const buf = encode({
tensors: {
image_embed: { data: imageEmbed, shape: [1, 256, 64, 64] },
high_res_feats_0: { data: f0, shape: [1, 32, 256, 256] },
high_res_feats_1: { data: f1, shape: [1, 64, 128, 128] },
},
original_size: [imageHeight, imageWidth],
});
await fs.writeFile("public/demo/portrait/embeddings.bin", buf);That embeddings.bin is now a static asset. Decoding it in the worker is two
lines:
const buf = await fetch(url).then((r) => r.arrayBuffer());
const decoded = decode(new Uint8Array(buf)); // from @msgpack/msgpackThe small things that decide whether the demo feels fast
- Lazy-load via IntersectionObserver. If the section never enters the viewport, the user never paid for it.
- Show a status string.
"Loading SAM2 decoder… ","Loading embeddings…","Click anywhere on the image". People wait calmly when they know what they’re waiting for. - Render the image to canvas first. Then composite the mask on top via
globalAlpha. Avoids the “mask appears, image flashes” flicker. - Terminate the worker on unmount. Otherwise it keeps a 50MB heap alive in the background while the user reads the next section.
Where to go next
- Swap WASM for WebGPU when you have the model and the user’s device for
it. Rendering inside
<canvas>while inference runs onwebgpuis the cleanest way to keep frames smooth. - Use
transformers.jsfor pure encoder-side work — it’s the simplest API I’ve found for image classification, embeddings and small text models. - Cache
ort.InferenceSessioninstances when you have multiple tiles. The WASM load is the expensive part.
The segmentation tile on my home page follows this exact recipe. Click the portrait — that’s ONNX, in your tab, no server.
Comments
Tags in this post
Keep reading
Making AI feel realtime with hybrid segmentation
Segmentation is the substrate for nearly every AI photo workflow worth shipping in 2026 — inpainting, object swaps, controlled generation. Here is how to make it feel instant on the web by splitting SAM2 across a notebook on the user's hardware and a decoder in their browser.
23 min · May 5, 2026
Building a 3D ring configurator in Expo
A React-Native-first take on the classic R3F ring configurator: GLB loading on device, four metal materials, gesture-driven rotation, Zustand state, and ARKit / ARCore preview — all behind one Expo build.
11 min · May 5, 2026
Welcome to the new blog
A short tour of the new MDX-powered writing setup, complete with syntax-highlighted code blocks rendered by Shiki at build time.
1 min · May 4, 2026
All tags