March 24, 2026Browser infrastructureTrust systemsChessIQ

Why I Stopped Trusting Browser Multithreading Checks

Why production startup needs probing, fallback, and recovery.

When I started building browser-native chess analysis, I assumed multithreading support was a straightforward feature-detection problem.

If the browser exposed the right APIs, multithreaded Stockfish should work.

That assumption was clean, common, and wrong.

The original mental model

Most browser code follows a familiar rule: if the API exists, use it.

if ("IntersectionObserver" in window) {
  // safe to use
}

I treated multithreaded engine startup the same way. If these checks passed, I expected startup to be reliable:

SharedArrayBuffer exists
crossOriginIsolated is true
Worker support is available

In theory, that stack should be enough.

In production, it was not.

What actually happened

Even with all capability signals present, startup could still fail.

Sometimes failure was obvious. Other times it was subtle:

Workers initialized inconsistently
The engine stalled after launch
Behavior changed between environments with the same apparent support
Certain startup and caching states produced hard-to-reproduce failures

From the outside, the browser looked fully compatible. At runtime, compatibility was conditional.

The shift: startup is a system, not a check

I stopped treating startup as a single yes/no decision and started treating it as a layered system:

Capability detection
Runtime probing
Fallback behavior
Failure classification
Recovery logic

Each layer handled a different failure mode.

1) Capability detection is necessary, not sufficient

The baseline checks still matter. Without SharedArrayBuffer, cross-origin isolation, and workers, multithreading is not possible.

But those checks only answer: Could this work?

They do not answer: Will this work reliably right now?

2) Probe the engine, do not just trust the browser

After startup, the engine runs a short validation search.

If the probe behaves correctly, multithreaded mode is accepted. If the probe fails or looks unstable, the system falls back.

This validates runtime behavior directly instead of trusting capability signals as proof.

3) Fallback is part of the design, not an edge case

Even in capable environments, startup can fail for reasons outside normal feature checks.

So startup always includes a deterministic fallback path:

try multithreaded startup
        ↓
run probe validation
        ↓
success -> continue in multithread mode
failure -> switch to single-thread mode

Single-thread analysis is slower, but it preserves reliability and continuity.

4) Failure modes need distinct handling

Different failures deserve different messages.

In practice, startup improved once failures were classified explicitly:

Offline and engine assets not cached
Engine runtime script failed to load
Multithread probe failed
Generic initialization failure

This improved both user messaging and operational debugging.

5) Long-running analysis needs recovery

Startup reliability is only part of the problem.

Long-running browser workloads can stall. Background tab throttling, worker hangs, and position-specific lockups all happen in real sessions.

So analysis runs with watchdog logic: if updates stop for too long, the run is restarted automatically.

To users, analysis continues. Internally, the system has recovered from a stalled engine.

What surprised me most

Getting Stockfish running in the browser was not the hard part.

Making browser-native analysis dependable under real-world conditions required significantly more system design than I expected.

The broader lesson

This changed how I think about browser capability checks.

Feature detection is necessary. It is not always sufficient.

For browser-native systems doing serious work, capability signals are not enough. Validate behavior directly, provide graceful fallback, and recover automatically when runtime conditions degrade.

Capability signals are permission, not proof.