---
title: "Making AI feel realtime with hybrid segmentation"
description: "Segmentation is the substrate for nearly every AI photo workflow worth shipping in 2026 — inpainting, object swaps, controlled generation. Here is how to make it feel instant on the web by splitting SAM2 across a notebook on the user's hardware and a decoder in their browser."
date: "2026-05-05"
updated: "2026-05-05"
cover: "/blog/splitting-sam2-encoder-decoder/cover.png"
coverAlt: "Three frames showing interactive segmentation: a click on a penguin photo, then bounding-box and click prompts, then the final magenta mask."
coverCredit: ""
tags:
  - sam2
  - onnx
  - webgpu
  - segmentation
  - image-generation
  - next.js
  - vercel
  - replicate
  - huggingface
author: "Jean Rojas"
---

Most "AI on the web" demos punt the hard part to a server: upload an image,
spin a loader, pray the queue isn't deep, paint the result. That's fine for a
proof of concept and miserable as a product. The real win &mdash; the thing
that turns a model into a tool people use without thinking &mdash; is when the
loop closes in under 100&thinsp;ms and the model feels like a brush, not an
API.

This article is about how to get there for **interactive segmentation**, the
substrate of every modern photo workflow worth shipping: inpainting,
background swap, object replacement, controlled image generation, smart
selection. We'll do it with the best model in its class
(`sam2.1_hiera_large`, 224M parameters), without compromising on quality, and
without paying a GPU bill per click.

The trick is a deliberately weird deployment: the **encoder** runs on the
user's own machine via a notebook, the **decoder** runs in their browser tab.
Two halves of the same model living on opposite sides of the network, joined
by a 16&thinsp;MB binary blob.

<LaunchBar
  liveDemo="https://sam2-hybrid.vercel.app/"
  github="https://github.com/jeanc18rlos/sam2-hybrid"
  colab="https://colab.research.google.com/github/jeanc18rlos/sam2-hybrid/blob/main/notebooks/sam2_encode.ipynb"
/>

## What "segmentation" actually is, and why image generation needs it

Segmentation is the task of assigning a label to every pixel of an image
&mdash; "this pixel is part of the dog, that one is grass, this one is sky."
It's the difference between *knowing* there's a dog (classification),
*knowing roughly where* the dog is (detection, a bounding box), and **knowing
the exact silhouette** of the dog (segmentation, a mask).

<figure className="my-8 not-prose overflow-hidden rounded-2xl border border-stone-200 dark:border-stone-800">
  {/* eslint-disable-next-line @next/next/no-img-element */}
  <img
    src="https://user-images.githubusercontent.com/115161827/229972031-fdf2d0a4-b919-4bd8-88b1-99e284c82e26.gif"
    alt="A user clicking on objects in an image and watching colored masks appear pixel-perfect around each click."
    className="w-full"
  />
  <figcaption className="px-4 py-2 text-[11px] text-stone-500 bg-stone-50 dark:bg-stone-900/60">
    A pixel-level mask is what makes "AI photo editing" actually feel like
    editing. Source: ClickSEG (Apache 2.0).
  </figcaption>
</figure>

That mask is the load-bearing primitive for almost every modern image
workflow:

- **Inpainting & outpainting**: a diffusion model needs to know which pixels
  to regenerate. The mask is the input.
- **Object replacement / removal**: lift the masked region out, run a
  generator on the hole, composite back. Whole product categories
  (Photoroom, Magic Editor, every "Remove background" feature) are this
  pipeline.
- **Controlled image generation**: ControlNet and its descendants take a
  mask as the spatial prior. "Generate a dragon *here*" only works if you
  have a *here*.
- **Annotation pipelines**: training data for any pixel-level model
  (medical imaging, autonomous driving, satellite analysis) is built by
  humans clicking, refining, and exporting masks. Faster clicks = more data.

There are two flavors worth distinguishing. **Semantic segmentation** outputs
one mask per category for the whole image &mdash; "here's all the road,
here's all the sky." **Interactive segmentation** is the one we want today:
the user clicks a point (or drags a box), and the model returns the mask of
*that* specific object, refining as more clicks come in. Add a positive
click to grow the selection; add a negative click to fix a region the model
got wrong.

<figure className="my-8 not-prose overflow-hidden rounded-2xl border border-stone-200 dark:border-stone-800">
  {/* eslint-disable-next-line @next/next/no-img-element */}
  <img
    src="https://user-images.githubusercontent.com/119248312/229991240-9afc6fc9-fc94-45b0-bf96-40d1dda82ba0.jpg"
    alt="A diagram showing positive (green) and negative (red) clicks correcting a mask: false-negative regions are added with green, false-positive regions are removed with red."
    className="w-full"
  />
  <figcaption className="px-4 py-2 text-[11px] text-stone-500 bg-stone-50 dark:bg-stone-900/60">
    Positive (green) clicks recover false negatives; negative (red) clicks
    cut false positives. SAM2's decoder takes both natively. Source:
    ClickSEG (Apache 2.0).
  </figcaption>
</figure>

The grandparent of the modern interactive segmentation lineage is Meta's SAM
(2023), and the child we're using is **SAM2**, which adds video, better
quality on small objects, and a feature pyramid that's friendly to the split
deployment we're about to build. The lineage of open interactive
segmentation research &mdash; CDNet, FocalClick, RITM, SegFormer-based
clickers, ClickSEG &mdash; is well worth reading if you want to understand
the design space; SAM2 is the current top of that tree.

## What ONNX is, and why it lets us do this at all

The other half of the story is **ONNX**: Open Neural Network Exchange. A
file format and a runtime ecosystem that lets you take a model trained in
PyTorch, serialize its compute graph plus weights to a single artifact, and
run it *anywhere there's an ONNX runtime* &mdash; which now means basically
everywhere:

| Surface              | Runtime                    | Backend                       |
| -------------------- | -------------------------- | ----------------------------- |
| Server / Linux x86   | `onnxruntime` (Py / C++)   | CUDA, TensorRT, OpenVINO, CPU |
| macOS                | `onnxruntime` + CoreML EP  | Apple Neural Engine, Metal    |
| Browser              | `onnxruntime-web`          | WebGPU, WASM, WebGL           |
| iOS / Android native | `onnxruntime-mobile`       | NNAPI, CoreML, XNNPACK        |
| React Native         | `onnxruntime-react-native` | wraps the mobile runtimes     |
| Edge / IoT           | `onnxruntime` ARM builds   | CPU, NPU                      |

The point is not that ONNX makes models faster &mdash; it doesn't, by
itself. The point is that ONNX **decouples** the model from the framework,
so the same `decoder.onnx` file can run via WebGPU in Chrome, via CoreML on
a Mac, via CUDA on a server, and via NNAPI on Android, with no change to the
weights. For a hybrid deployment like the one in this article, that's what
makes the whole shape possible: serialize once, ship the artifact across the
network, let the user's local runtime pick the fastest backend it has.

## Try it — right now, in your browser

Before any architecture diagrams &mdash; here's the actual decoder running
locally against a pre-encoded portrait. Left-click adds an include point,
right-click adds an exclude point. The `View` toggle switches between mask
overlay, cutout (transparent background), and erase.

<SegmentationTile />

That's `sam2.1_hiera_large.decoder.onnx` (~16&thinsp;MB) running in your
tab, fed by an embedding bundle that was produced once on a separate
machine. No server is being hit by your clicks; the only network requests
are the decoder weights and the embedding bundle, both fetched once at page
load. Every click after that is local inference.

Stop reading and play with it for a moment. Notice the latency. That's the
entire point.

## The architecture

<Mermaid
  caption="Encoder runs once on the user's hardware, decoder runs once per click in their browser. The 16 MB embedding bundle is the only thing on the wire."
  chart={`flowchart LR
    subgraph U["💻  User machine — Colab or local"]
      direction TB
      I["image.jpg"] --> E["sam2.1 encoder<br/>(~850 MB ONNX)"]
      E --> P["embedding.bin<br/>+ manifest.json<br/>(~16 MB float16)"]
    end
    subgraph V["🌐  Vercel app — Next.js"]
      direction TB
      D["sam2.1 decoder<br/>(~16 MB ONNX)"]
      C["Click / drag prompt"]
      D --> M["mask"]
      C --> D
    end
    P -- "drag-drop / upload / pre-bake" --> D
    style U fill:transparent,stroke-dasharray:4
    style V fill:transparent,stroke-dasharray:4
`}
/>

SAM2 is unusually friendly to this kind of split, and that's not an accident
&mdash; Meta designed the original SAM with interactive use in mind, and
SAM2 inherited the same two-stage shape. Conceptually:

<Mermaid
  chart={`flowchart LR
    A[Image] --> B[Image Encoder]
    B --> E[Image embedding<br/>+ feature pyramid]
    P[Click / box / mask] --> D[Mask Decoder]
    E --> D
    D --> M[Mask<br/>+ IoU score]
`}
/>

The encoder is a heavy Hiera vision transformer that produces a fixed-size
feature pyramid for the whole image. The decoder is a small transformer
that takes those features plus a prompt and produces a mask. Crucially,
**the encoder's output does not depend on the prompt**. Once an image is
encoded, the decoder can run thousands of times with different clicks and
never touch the encoder again. That's the entire asymmetry the architecture
exploits:

| Operation        | Cost (sam2.1 large)              | Frequency           |
| ---------------- | -------------------------------- | ------------------- |
| Encode image     | 8&ndash;10s on Apple Silicon GPU | Once per image      |
| Decode w/ prompt | 30&ndash;60ms on WebGPU          | Once per click/drag |
| Embedding size   | ~16 MB float16 (compressed)      | Transferred once    |

For consumer products, hybrid deployments &mdash; encoder on a backend,
decoder client-side &mdash; are the production-grade pattern that companies
like Labelbox have settled on. For a developer-facing demo, a research
tool, or anything where the audience is technical, we can go one step
further:

> Let the user run the encoder. The image never leaves their machine, the
> GPU bill is theirs (or zero if they use Colab), and we get to ship the
> *large* model on both sides without a 360&thinsp;MB browser download.

## Step 1 — Export the models

Pretrained SAM2 checkpoints are PyTorch. We need ONNX, and we need the
encoder/decoder pre-split &mdash; you don't get this from a naïve
`torch.onnx.export` because the official model is a single `Predictor` that
wraps both stages. The `samexporter` package handles the surgery.

<LaunchBar
  liveDemo="https://sam2-hybrid.vercel.app/"
  colab="https://colab.research.google.com/github/jeanc18rlos/sam2-hybrid/blob/main/notebooks/sam2_encode.ipynb"
  hfSpace="https://huggingface.co/spaces/jrojastechnology/sam2-encoder"
  notebook="https://github.com/jeanc18rlos/sam2-hybrid/raw/main/notebooks/sam2_encode.ipynb"
/>

The notebook is a single file with five cells. Pick a tab to read each one;
they run in order top-to-bottom and are all you need to get from a JPEG to
the deployable embedding bundle:

<CodeTabs
  caption="The full encoder notebook — install, download checkpoint, export to ONNX, encode an image, package for the browser."
  files={[
    {
      name: "1_install.sh",
      language: "bash",
      code: `# Cell 1 — pin a working torch + the ONNX toolchain.
pip install --quiet \\
    torch==2.4.0 \\
    torchvision==0.19.0 \\
    onnx onnxscript onnxsim onnxruntime samexporter

pip install --quiet git+https://github.com/facebookresearch/segment-anything-2.git`,
    },
    {
      name: "2_download.py",
      language: "python",
      code: `# Cell 2 — pull the SAM2.1 Hiera Large checkpoint (~900 MB).
import urllib.request
from pathlib import Path

CHECKPOINT_URL = "https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt"
ckpt_path = Path("original_models/sam2.1_hiera_large.pt")
ckpt_path.parent.mkdir(exist_ok=True)

if not ckpt_path.exists():
    print("Downloading sam2.1_hiera_large.pt (~900 MB)...")
    urllib.request.urlretrieve(CHECKPOINT_URL, ckpt_path)
print(f"Checkpoint at {ckpt_path}")`,
    },
    {
      name: "3_export.sh",
      language: "bash",
      code: `# Cell 3 — split into encoder.onnx + decoder.onnx via samexporter.
# The encoder exceeds the 2 GB ONNX limit, so external .onnx.data weights
# are emitted alongside it; both files must travel together.
python -m samexporter.export_sam2 \\
    --checkpoint original_models/sam2.1_hiera_large.pt \\
    --output_encoder output_models/sam2.1_hiera_large.encoder.onnx \\
    --output_decoder output_models/sam2.1_hiera_large.decoder.onnx \\
    --model_type sam2.1_hiera_large`,
    },
    {
      name: "4_encode.py",
      language: "python",
      code: `# Cell 4 — run the encoder on an image; get the SAM2 feature pyramid.
import numpy as np
import onnxruntime as ort
from PIL import Image

MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)
INPUT_SIZE = 1024

def preprocess(image_path):
    img = Image.open(image_path).convert("RGB")
    original_size = img.size  # (width, height)
    img = img.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
    arr = np.array(img, dtype=np.float32) / 255.0
    arr = (arr - MEAN) / STD
    arr = arr.transpose(2, 0, 1)[None]  # (1, 3, 1024, 1024)
    return arr.astype(np.float32), original_size

# onnxruntime picks the fastest provider available — CUDA / CoreML / CPU.
encoder = ort.InferenceSession(
    "output_models/sam2.1_hiera_large.encoder.onnx",
    providers=ort.get_available_providers(),
)

input_tensor, original_size = preprocess("my_photo.jpg")
high_res_0, high_res_1, image_embed = encoder.run(None, {"image": input_tensor})

# image_embed       (1, 256,  64,  64)  — main embedding
# high_res_feats_0  (1,  32, 256, 256)  — high-res pyramid level 0
# high_res_feats_1  (1,  64, 128, 128)  — high-res pyramid level 1`,
    },
    {
      name: "5_bundle.py",
      language: "python",
      code: `# Cell 5 — pack as float16 + raw binary + JSON manifest for the browser.
import json
from PIL import Image

tensors = {
    "image_embed":      image_embed.astype(np.float16),
    "high_res_feats_0": high_res_0.astype(np.float16),
    "high_res_feats_1": high_res_1.astype(np.float16),
}

manifest = {
    "preview": "preview.jpg",
    "originalWidth":  original_size[0],
    "originalHeight": original_size[1],
    "tensors": {},
}

with open("embedding.bin", "wb") as f:
    offset = 0
    for name, arr in tensors.items():
        manifest["tensors"][name] = {
            "offset": offset,
            "shape":  list(arr.shape),
            "dtype":  "float16",
        }
        f.write(arr.tobytes())
        offset += arr.nbytes
    manifest["totalBytes"] = offset

with open("manifest.json", "w") as f:
    json.dump(manifest, f, indent=2)

# Save a downscaled preview so the browser has something to render under the mask.
preview = Image.open("my_photo.jpg").convert("RGB")
preview.thumbnail((1600, 1600))
preview.save("preview.jpg", quality=85)

print("Upload these three files: embedding.bin, manifest.json, preview.jpg")`,
    },
  ]}
/>

A few things worth knowing about this step that the official docs gloss
over:

- **The 2&thinsp;GB ONNX size limit kicks in here.** The encoder exceeds it,
  so the exporter automatically produces an `.onnx` file plus an external
  `.onnx.data` file containing the weights. Both must live in the same
  directory at load time. Combined size lands around 850&thinsp;MB &mdash;
  manageable on disk, untenable as a browser download. This is one of the
  reasons we keep the encoder off the client.
- **`samexporter` is doing real work**, not just a re-export. It splits the
  graph at the right boundary, sets up dynamic axes for variable point
  counts in the decoder, and configures `multimask_output=True` so the
  decoder returns three candidate masks plus IoU predictions per call.
- **The decoder is tiny.** ~16&thinsp;MB after export, slots comfortably
  into a Vercel deployment's static assets. Vercel's CDN gzips it on the
  way down to ~12&thinsp;MB.

## Step 2 — Encoding details, in plain English

The encoder takes a fixed `1024×1024` RGB tensor (normalized with ImageNet
statistics) and returns three outputs &mdash; the SAM2 feature pyramid:

- `image_embed`: shape `(1, 256, 64, 64)` &mdash; the main embedding.
- `high_res_feats_0`: shape `(1, 32, 256, 256)` &mdash; high-resolution
  features for the decoder.
- `high_res_feats_1`: shape `(1, 64, 128, 128)` &mdash; another resolution
  level.

If you've worked with SAM1 before, this catches you off guard &mdash; SAM1
ships a single embedding tensor; SAM2 plumbs three. All three are needed by
the decoder.

The naïve packaging is to save them as float32, which gives you a
~50&thinsp;MB file per image. We do better with two tricks: cast to float16
(SAM2's decoder is robust to it; size halves), and use a flat binary layout
with a JSON manifest beside it (avoids a numpy reader in the browser).
Combined, the bundle lands at ~12&ndash;18&thinsp;MB depending on the
image's high-frequency content.

## Step 3 — ONNX details that shape the JS code

The decoder's interface is the contract you'll be coding against in the
browser. **Inputs:**

| Name              | Shape                  | Notes                                  |
| ----------------- | ---------------------- | -------------------------------------- |
| `image_embed`     | `(1, 256, 64, 64)`     | from the encoder                       |
| `high_res_feats_0`| `(1, 32, 256, 256)`    | from the encoder                       |
| `high_res_feats_1`| `(1, 64, 128, 128)`    | from the encoder                       |
| `point_coords`    | `(1, N, 2)`            | in 1024×1024 model space               |
| `point_labels`    | `(1, N)`               | `1`=fg, `0`=bg, `-1`=padding            |
| `mask_input`      | `(1, 1, 256, 256)`     | previous mask, or zeros                |
| `has_mask_input`  | `(1,)`                 | `0.0` ignore, `1.0` use                |

**Outputs:** `masks` `(1, 3, 256, 256)` float32 logits (threshold at 0,
upsample), and `iou_predictions` `(1, 3)` float32 (pick the top one).

> **Resolution upgrade we ship in the template.** The default
> samexporter export emits 256×256 mask logits — fine for line work but
> visibly stair-stepped on a phone-camera selfie at 1500+ display px.
> The repo ships `scripts/export_decoder_hires.py`, a one-file
> subclass that adds an in-graph `F.interpolate(size=512)` on the mask
> output. Same weights, same encoder bundle (`image_embed` /
> `high_res_feats_*` are unchanged), but `decoder.onnx` now returns
> `(1, 3, 512, 512)`. Halving the upscale factor to display kills the
> source-grid checker that SAM2's clamped logits would otherwise
> produce in textured regions.

**Coordinate space.** Points must be in the **1024×1024 model input
space**, not the original image's space and not the canvas's pixel space.
This catches everyone the first time:

```ts
const x = (cx / canvasWidth)  * 1024;
const y = (cy / canvasHeight) * 1024;
```

**Encoder vs decoder optimization.** For the encoder we'd convert the
`.onnx` to ONNX Runtime's `.ort` format with
`python -m onnxruntime.tools.convert_onnx_models_to_ort` for faster load.
**For the decoder this conversion sometimes fails on a `Concat` node**
&mdash; the plain `.onnx` works in the browser, so we leave it alone.
Community-known quirk that the official docs don't really mention.

## Step 4 — The Vercel app

The web app is a Next.js project deployed to Vercel. Structure:

```text
/app/page.tsx                # main UI: image picker + canvas
/components/Segmenter.tsx    # canvas + click handling
/lib/sam2-decoder.ts         # ONNX session + inference
/public/models/decoder.onnx  # ~16 MB, ships with the deploy
/public/demos/<slug>/        # preview.jpg + embedding.bin + manifest.json
/public/workers/sam2-decoder-worker.js
```

A note on Vercel limits: 100&thinsp;MB hard cap on serverless function
bundles; static assets are looser. The 16&thinsp;MB decoder is fine. Each
demo embedding is ~16&thinsp;MB; you can ship 3&ndash;5 of these comfortably.
For more, push to R2/S3 and serve via CDN.

<LaunchBar
  liveDemo="https://sam2-hybrid.vercel.app/"
  vercelDeploy="https://vercel.com/new/clone?repository-url=https%3A%2F%2Fgithub.com%2Fjeanc18rlos%2Fsam2-hybrid&project-name=sam2-hybrid&repository-name=sam2-hybrid&env=NEXT_PUBLIC_REPLICATE_MODEL,REPLICATE_API_TOKEN&envDescription=Optional%20Replicate%20fallback%20for%20users%20who%20do%20not%20want%20to%20run%20the%20notebook"
  github="https://github.com/jeanc18rlos/sam2-hybrid"
/>

The Deploy-to-Vercel button clones the template repo, prompts you for two
optional env vars (`REPLICATE_API_TOKEN` and `NEXT_PUBLIC_REPLICATE_MODEL`),
and you get a working deployment with the demo embeddings baked in. If you
skip the Replicate vars, the app still works for pre-baked demos and
notebook-uploaded files.

### Loading the decoder on WebGPU

In the browser we use `onnxruntime-web` with the WebGPU execution provider.
The whole decoder lifecycle is small enough to fit in one file:

```ts title="lib/sam2-decoder.ts" showLineNumbers
import * as ort from "onnxruntime-web/webgpu";

export interface Embedding {
  imageEmbed: ort.Tensor;
  highResFeats0: ort.Tensor;
  highResFeats1: ort.Tensor;
  originalWidth: number;
  originalHeight: number;
}

export async function loadDecoder(): Promise<ort.InferenceSession> {
  return ort.InferenceSession.create("/models/decoder.onnx", {
    executionProviders: ["webgpu"],
    graphOptimizationLevel: "all",
  });
}

export async function loadEmbedding(slug: string): Promise<Embedding> {
  const [manifest, buffer] = await Promise.all([
    fetch(`/demos/${slug}/manifest.json`).then((r) => r.json()),
    fetch(`/demos/${slug}/embedding.bin`).then((r) => r.arrayBuffer()),
  ]);

  const toTensor = (name: string) => {
    const t = manifest.tensors[name];
    const f16 = new Uint16Array(buffer, t.offset, prod(t.shape));
    // ORT-web doesn't accept fp16 ArrayBuffers directly for all ops;
    // expand to fp32 once on load (~50ms for a large embedding).
    const f32 = new Float32Array(f16.length);
    for (let i = 0; i < f16.length; i++) f32[i] = f16ToF32(f16[i]);
    return new ort.Tensor("float32", f32, t.shape);
  };

  return {
    imageEmbed:    toTensor("image_embed"),
    highResFeats0: toTensor("high_res_feats_0"),
    highResFeats1: toTensor("high_res_feats_1"),
    originalWidth: manifest.originalWidth,
    originalHeight: manifest.originalHeight,
  };
}

export async function segment(
  session: ort.InferenceSession,
  emb: Embedding,
  clicks: { x: number; y: number; positive: boolean }[],
  canvasWidth: number,
  canvasHeight: number,
): Promise<Float32Array> {
  const coords = new Float32Array(clicks.length * 2);
  const labels = new Float32Array(clicks.length);
  clicks.forEach((c, i) => {
    coords[i * 2]     = (c.x / canvasWidth)  * 1024;
    coords[i * 2 + 1] = (c.y / canvasHeight) * 1024;
    labels[i]         = c.positive ? 1 : 0;
  });

  const out = await session.run({
    image_embed:      emb.imageEmbed,
    high_res_feats_0: emb.highResFeats0,
    high_res_feats_1: emb.highResFeats1,
    point_coords:     new ort.Tensor("float32", coords, [1, clicks.length, 2]),
    point_labels:     new ort.Tensor("float32", labels, [1, clicks.length]),
    mask_input:       new ort.Tensor("float32", new Float32Array(256 * 256), [1, 1, 256, 256]),
    has_mask_input:   new ort.Tensor("float32", new Float32Array([0]), [1]),
  });

  const masks = out.masks.data as Float32Array;            // (1, 3, 256, 256)
  const iou   = out.iou_predictions.data as Float32Array;  // (1, 3)
  let best = 0;
  for (let i = 1; i < 3; i++) if (iou[i] > iou[best]) best = i;
  return masks.slice(best * 256 * 256, (best + 1) * 256 * 256);
}

const prod = (arr: number[]) => arr.reduce((a, b) => a * b, 1);

function f16ToF32(h: number): number {
  const s = (h & 0x8000) >> 15;
  const e = (h & 0x7c00) >> 10;
  const f = h & 0x03ff;
  if (e === 0) return (s ? -1 : 1) * Math.pow(2, -14) * (f / 1024);
  if (e === 0x1f) return f ? NaN : (s ? -Infinity : Infinity);
  return (s ? -1 : 1) * Math.pow(2, e - 15) * (1 + f / 1024);
}
```

Three operational notes:

- **Run the decoder in a Web Worker.** A 60&thinsp;ms forward pass on the
  main thread eats your animation budget. The
  [previous post](/blog/onnx-in-the-browser) covers that pattern.
- **Throttle drag interactions with `requestAnimationFrame`.** Calling on
  every `mousemove` queues dozens of inference jobs and stutters.
- **Upsample the mask via `<canvas>` `drawImage`.** The browser's bilinear
  resize is good enough and free.

## Step 5 — The interactivity loop

The promise the architecture makes is that the click-to-mask loop is
**instantaneous**. Here's what happens, end-to-end:

<InteractivityFlow />

There's no spinner, no loading state, no API roundtrip. The latency of
"round-trip to a server, run inference, send back a PNG" &mdash; what most
cloud SAM deployments give you &mdash; is replaced with sub-100&thinsp;ms
local inference. *That* is what makes it feel like a tool instead of a demo.

### Picking the right execution provider

`onnxruntime-web` ships several backends, and the choice matters:

| Provider | Cost on the large decoder    | Where it works                          |
| -------- | ---------------------------- | --------------------------------------- |
| WebGPU   | 30&ndash;60&thinsp;ms       | Chrome, Edge, Safari TP, modern Firefox |
| WASM     | 200&ndash;400&thinsp;ms     | Everywhere; SIMD makes it tolerable     |
| WebGL    | (deprecated for this use)    | Skip                                    |

The pragmatic pattern is WebGPU first, WASM fallback:

```ts
async function createSession(modelUrl: string) {
  try {
    return await ort.InferenceSession.create(modelUrl, {
      executionProviders: ["webgpu"],
    });
  } catch (e) {
    console.warn("WebGPU unavailable, falling back to WASM:", e);
    return await ort.InferenceSession.create(modelUrl, {
      executionProviders: ["wasm"],
    });
  }
}
```

For the article's demo I'd gate the experience behind a "WebGPU recommended"
banner if it falls back. The decoder is small enough to run on CPU, but
drag interactions feel laggy and it's worth telling the user why.

## Step 6 — Replicate as an alternative to the notebook

Some users won't run a notebook, period. For them, the encoder needs to live
behind an API call. The cleanest pattern is **bring-your-own Replicate
token**: we ship a tiny encoder model on Replicate that takes an image and
returns the bundle, the user pastes their own API token in the web app, and
the upload-image flow becomes one HTTP call instead of "open Colab."

Three reasons this is the right pattern:

1. **Cost stays with the user.** No GPU bill on our side. Replicate charges
   the user's account directly.
2. **Privacy is honest.** Their image goes to Replicate, not to us. Our
   server never sees the bytes.
3. **Trivial to swap.** Same `embedding.bin` + `manifest.json` shape as the
   notebook produces. The decoder code in the browser doesn't change.

<LaunchBar
  liveDemo="https://sam2-hybrid.vercel.app/"
  replicate="https://replicate.com/jrojastechnology/sam2-encoder"
  hfSpace="https://huggingface.co/spaces/jrojastechnology/sam2-encoder"
/>

`cog` is Replicate's tool for packaging a model. Here's the whole encoder
service &mdash; four files, one push command:

<CodeTabs
  caption="Encoder microservice for Replicate. Same predict() works as a Hugging Face Space if you wrap it in a Gradio app."
  files={[
    {
      name: "cog.yaml",
      language: "yaml",
      code: `build:
  gpu: true
  cuda: "12.1"
  python_version: "3.11"
  python_packages:
    - "torch==2.4.0"
    - "torchvision==0.19.0"
    - "onnxruntime-gpu==1.20.0"
    - "Pillow==10.4.0"
    - "numpy==1.26.4"
predict: "predict.py:Predictor"`,
    },
    {
      name: "predict.py",
      language: "python",
      code: `from cog import BasePredictor, Input, Path
from PIL import Image
import numpy as np
import onnxruntime as ort
import json, io, zipfile

MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)
INPUT_SIZE = 1024

class Predictor(BasePredictor):
    def setup(self):
        self.session = ort.InferenceSession(
            "weights/sam2.1_hiera_large.encoder.onnx",
            providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
        )

    def predict(self, image: Path = Input(description="Image to encode")) -> Path:
        img = Image.open(image).convert("RGB")
        ow, oh = img.size

        x = img.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
        arr = (np.array(x, dtype=np.float32) / 255.0 - MEAN) / STD
        arr = arr.transpose(2, 0, 1)[None].astype(np.float32)
        h0, h1, emb = self.session.run(None, {"image": arr})

        tensors = {
            "image_embed":      emb.astype(np.float16),
            "high_res_feats_0": h0.astype(np.float16),
            "high_res_feats_1": h1.astype(np.float16),
        }

        manifest = {"originalWidth": ow, "originalHeight": oh, "tensors": {}}
        bin_buf, offset = io.BytesIO(), 0
        for name, t in tensors.items():
            manifest["tensors"][name] = {
                "offset": offset, "shape": list(t.shape), "dtype": "float16",
            }
            bin_buf.write(t.tobytes())
            offset += t.nbytes
        manifest["totalBytes"] = offset

        out = Path("/tmp/bundle.zip")
        with zipfile.ZipFile(out, "w") as zf:
            zf.writestr("manifest.json", json.dumps(manifest))
            zf.writestr("embedding.bin", bin_buf.getvalue())
        return out`,
    },
    {
      name: "push.sh",
      language: "bash",
      code: `# One-time setup, then push to Replicate.
cog login
cog push r8.im/jrojastechnology/sam2-encoder

# After the push completes, copy the version hash that's printed
# and put it in your web app's NEXT_PUBLIC_REPLICATE_MODEL env var.`,
    },
    {
      name: "encodeViaReplicate.ts",
      language: "typescript",
      code: `// In the browser: encode an uploaded image with the user's own token.
async function encodeViaReplicate(file: File, token: string) {
  // 1. Kick off the prediction with the user's token.
  const start = await fetch("https://api.replicate.com/v1/predictions", {
    method: "POST",
    headers: {
      Authorization: \`Token \${token}\`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      version: process.env.NEXT_PUBLIC_REPLICATE_MODEL,
      input: { image: await fileToDataUrl(file) },
    }),
  }).then((r) => r.json());

  // 2. Poll until done. (Or use webhooks; polling is simpler for a demo.)
  let prediction = start;
  while (prediction.status !== "succeeded" && prediction.status !== "failed") {
    await new Promise((r) => setTimeout(r, 800));
    prediction = await fetch(prediction.urls.get, {
      headers: { Authorization: \`Token \${token}\` },
    }).then((r) => r.json());
  }
  if (prediction.status === "failed") throw new Error(prediction.error);

  // 3. Download the bundle, unpack, feed to decoder. Same shape as the notebook.
  const zipBuf = await fetch(prediction.output).then((r) => r.arrayBuffer());
  return unpackBundle(zipBuf);
}`,
    },
  ]}
/>

For a Hugging Face Space alternative, the same `predict()` function ships
inside a Gradio app exposing `/api/predict`, and `huggingface.js` hits it
the same way. Pick whichever your audience already has accounts with.

## Where this goes from here

The split-inference pattern generalizes well beyond SAM2. Any model with an
expensive encoder and a cheap promptable head fits this shape:

- **CLIP-style retrieval.** Encode a corpus offline, ship embeddings, do
  nearest-neighbor in the browser. Same pattern.
- **Depth-anything / monocular depth.** Encode once on a server, refine
  client-side.
- **SAM3 (when it lands).** Three-model pipeline (image + language +
  decoder) is a natural extension &mdash; text encoder and image encoder
  server-side, decoder in the browser.

The notebook-as-encoder pattern is particularly nice for research and
developer tools, where the audience is comfortable running Python and the
privacy/cost benefits matter. For consumer products, Replicate or a HF
Space is the friction-free swap. The browser side stays identical.

The point &mdash; worth stating explicitly because it's easy to miss
&mdash; is that **interactive ML on the web doesn't have to mean small,
fast, low-quality models**. You can ship the best model in its category by
being thoughtful about which parts run where. The encoder is heavy and runs
once. The decoder is light and runs constantly. Put each one in the place
where its cost profile makes sense, and the user gets an experience that
wasn't possible five years ago: a 224M-parameter vision transformer
responding to their clicks in real time, in a browser tab, with their image
never leaving their machine.

## Appendix — file checklist

What ships where:

| File                                   | Where it lives                             | Size      |
| -------------------------------------- | ------------------------------------------ | --------- |
| `sam2.1_hiera_large.encoder.onnx`      | User's machine (notebook only)             | ~850&thinsp;MB |
| `sam2.1_hiera_large.encoder.onnx.data` | User's machine (notebook only)             | varies    |
| `sam2.1_hiera_large.decoder.onnx`      | Vercel `/public/models/`                   | ~16&thinsp;MB  |
| `embedding.bin` (per image)            | Vercel `/public/demos/<slug>/` or uploaded | ~12&ndash;18&thinsp;MB |
| `manifest.json` (per image)            | Vercel `/public/demos/<slug>/` or uploaded | &lt;1&thinsp;KB |
| `preview.jpg` (per image)              | Vercel `/public/demos/<slug>/` or uploaded | varies    |

## Appendix — sources, reading, distribution

<LaunchBar
  liveDemo="https://sam2-hybrid.vercel.app/"
  github="https://github.com/jeanc18rlos/sam2-hybrid"
  colab="https://colab.research.google.com/github/jeanc18rlos/sam2-hybrid/blob/main/notebooks/sam2_encode.ipynb"
  hfSpace="https://huggingface.co/spaces/jrojastechnology/sam2-encoder"
  replicate="https://replicate.com/jrojastechnology/sam2-encoder"
  vercelDeploy="https://vercel.com/new/clone?repository-url=https%3A%2F%2Fgithub.com%2Fjeanc18rlos%2Fsam2-hybrid"
  llmRead="/blog/splitting-sam2-encoder-decoder/raw"
  shareTitle="Making AI feel realtime with hybrid segmentation — sam2 split across a notebook and the browser"
  shareUrl="https://jeanrojas.com/blog/splitting-sam2-encoder-decoder"
/>

**Models & code referenced:**

- [`sam2.1_hiera_large` checkpoint](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt)
  &mdash; Meta's official Hiera-Large weights.
- [`samexporter`](https://github.com/vietanhdev/samexporter) &mdash; the
  ONNX export tool that handles the encoder/decoder split.
- [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web) &mdash;
  the in-browser runtime, with WASM and WebGPU backends.
- [`onnxruntime-react-native`](https://www.npmjs.com/package/onnxruntime-react-native)
  &mdash; same model on phones, via Expo's prebuild plugin.
- [Pre-exported community models](https://huggingface.co/vietanhdev/segment-anything-2.1-onnx-models)
  &mdash; if you want to skip the export step entirely.
- [Cog (Replicate's packaging tool)](https://github.com/replicate/cog).
- [Hugging Face Spaces docs](https://huggingface.co/docs/hub/spaces).

**Background reading on interactive segmentation:**

- [Segment Anything (SAM, 2023)](https://arxiv.org/abs/2304.02643).
- [SAM2: Segment Anything in Images and Videos (2024)](https://arxiv.org/abs/2408.00714).
- [ClickSEG codebase](https://github.com/XavierCHEN34/ClickSEG) &mdash;
  CDNet / FocalClick / efficient baselines, the lineage that informed SAM's
  click-driven interface.
- [The earlier post on this site about ONNX in the browser](/blog/onnx-in-the-browser).

**Distribution checklist.** The share buttons above pre-fill the title and
canonical URL for LinkedIn and X. For Medium, publish the canonical version
here and use Medium's "Import a story" flow with this URL &mdash; it
preserves the canonical link tag for SEO. For the LinkedIn long-form post,
paste the body and link back to this page; LinkedIn's algorithm tolerates
plain text plus one strong outbound link. The "Read with an LLM" buttons at
the top of this article point at the raw markdown source &mdash; drop that
URL into ChatGPT or Claude and it has the whole article in one shot.
