May 4, 20264 min read

Running ONNX models in the browser without losing your weekend

A working recipe for shipping image segmentation in a tab — Web Workers, WASM, pre-encoded embeddings, and the small things that decide whether the demo is fast or felt-fast.

On this page

Most of the AI demos I’ve seen on the web in 2026 punt: they POST an image to a server, render a spinner, then paint the mask. That works, but it costs you per call, leaks user data, and feels like 2018. The alternative — running the model directly in the user’s tab — is shockingly close to boring now. Here’s the recipe I’ve been using.

The shape of the runtime

Three rules that have served me well:

Run inference in a Web Worker. Otherwise a 60ms decoder pass freezes your animation loop. Workers are cheap, and onnxruntime-web ships an ESM-friendly worker entry.
Pre-encode whatever you can at build time. For interactive segmentation (think Meta’s SAM family), the heavy work is the encoder. Run it once on a server (or your laptop), serialise the resulting tensors to disk, and ship them next to the image. The browser only ever runs the lightweight decoder.
Use WASM unless you really need WebGPU. WASM is universally supported, debugs cleanly, and is fast enough for decoder-only inference. WebGPU is a nice optimisation when the model is bigger.

Loading the runtime in a worker

public/workers/sam2-decoder-worker.js

import * as ort from "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.21.0/dist/ort.wasm.bundle.min.mjs";
 
ort.env.wasm.wasmPaths =
  "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.21.0/dist/";
ort.env.wasm.numThreads = navigator.hardwareConcurrency || 4;
 
let session = null;
 
self.onmessage = async (e) => {
  const { type, data } = e.data;
  if (type === "load") {
    const buf = await fetch(data.modelUrl).then((r) => r.arrayBuffer());
    session = await ort.InferenceSession.create(buf, {
      executionProviders: ["wasm"],
    });
    self.postMessage({ type: "ready" });
  }
};

A few things worth noting:

The CDN import keeps your main bundle tiny — none of onnxruntime-web reaches your app code until the worker boots.
numThreads will quietly fall back to 1 unless your page is cross-origin-isolated. Don’t fight it; single-threaded is fine for a decoder.

Talking to the worker from React

Keep the worker outside React state and lazy-load it via IntersectionObserver. The model file is ~3–20MB; you don’t want to fetch it before the section is on screen.

components/sam2Tile.tsx

"use client";
import { useEffect, useRef, useState } from "react";
 
export default function Sam2Tile() {
  const [active, setActive] = useState(false);
  const wrap = useRef<HTMLDivElement>(null);
 
  useEffect(() => {
    if (!wrap.current || active) return;
    const io = new IntersectionObserver(([e]) => {
      if (e.isIntersecting) {
        setActive(true);
        io.disconnect();
      }
    }, { rootMargin: "200px" });
    io.observe(wrap.current);
    return () => io.disconnect();
  }, [active]);
 
  useEffect(() => {
    if (!active) return;
    const w = new Worker("/workers/sam2-decoder-worker.js", { type: "module" });
    w.postMessage({ type: "load", data: { modelUrl: "/models/decoder.onnx" } });
    return () => w.terminate();
  }, [active]);
 
  return <div ref={wrap}>{active ? "loading…" : "scroll into view"}</div>;
}

Pre-encoding embeddings

For SAM, the trick that makes click-to-segment feel snappy is the encoder / decoder split. The encoder produces three tensors — image_embed, high_res_feats_0, high_res_feats_1 — that the decoder consumes alongside your click points. Encode once, serialise as MessagePack, ship next to the image:

import { encode } from "@msgpack/msgpack";
import fs from "node:fs/promises";
 
// after running the encoder server-side once:
const buf = encode({
  tensors: {
    image_embed: { data: imageEmbed, shape: [1, 256, 64, 64] },
    high_res_feats_0: { data: f0, shape: [1, 32, 256, 256] },
    high_res_feats_1: { data: f1, shape: [1, 64, 128, 128] },
  },
  original_size: [imageHeight, imageWidth],
});
 
await fs.writeFile("public/demo/portrait/embeddings.bin", buf);

That embeddings.bin is now a static asset. Decoding it in the worker is two lines:

const buf = await fetch(url).then((r) => r.arrayBuffer());
const decoded = decode(new Uint8Array(buf)); // from @msgpack/msgpack

The small things that decide whether the demo feels fast

Lazy-load via IntersectionObserver. If the section never enters the viewport, the user never paid for it.
Show a status string. "Loading SAM2 decoder… ", "Loading embeddings…", "Click anywhere on the image". People wait calmly when they know what they’re waiting for.
Render the image to canvas first. Then composite the mask on top via globalAlpha. Avoids the “mask appears, image flashes” flicker.
Terminate the worker on unmount. Otherwise it keeps a 50MB heap alive in the background while the user reads the next section.

Where to go next

Swap WASM for WebGPU when you have the model and the user’s device for it. Rendering inside <canvas> while inference runs on webgpu is the cleanest way to keep frames smooth.
Use transformers.js for pure encoder-side work — it’s the simplest API I’ve found for image classification, embeddings and small text models.
Cache ort.InferenceSession instances when you have multiple tiles. The WASM load is the expensive part.

The segmentation tile on my home page follows this exact recipe. Click the portrait — that’s ONNX, in your tab, no server.

Comments

Tags in this post

Keep reading

All tags