Why We Chose SignalR as the Primary Transport for Thresh

April 9, 2026 · 10 min read

thresh Development Team

When you're managing a fleet of development environments across a network, the transport layer isn't just plumbing — it's the nervous system. Every heartbeat, every command dispatch, every metrics payload depends on it. We needed something that was real-time, resilient, and could degrade gracefully when the network couldn't cooperate.

We chose ASP.NET SignalR as the primary transport, with REST polling as automatic failover. Here's why, and exactly how it works under the hood.

The Problem: Fleet Communication at Scale

A thresh deployment looks like this: agents running on dev machines, optionally connecting through a mid-tier relay, all reporting back to a central Hub. The communication needs are:

Real-time command dispatch — push a command to an agent and get results back without polling
Live metrics streaming — CPU, RAM, disk, GPU every 60 seconds from every node
Instant status changes — node goes offline, dashboard updates immediately
Bidirectional — the Hub needs to push commands down to agents, not just receive data

HTTP polling can do all of this, but at the cost of latency and unnecessary traffic. We needed push semantics.

Why SignalR Over Raw WebSockets

Raw WebSockets would give us push semantics, but SignalR gives us much more:

The killer feature is transport negotiation. SignalR automatically tries WebSocket first, falls back to Server-Sent Events, then to long-polling — all transparently. This matters because agents can be behind corporate firewalls, NATs, or proxies that block WebSocket upgrades. The agent doesn't need to know or care.

The Full Connection Architecture

Here's how the entire transport layer fits together across all three tiers:

Agent Connection: The Tiered Fallback

When an agent starts up, it doesn't just try one connection method. It walks through a tiered fallback chain, trying the best option first and degrading gracefully:

Each tier has a transport label that's recorded in the database and visible on the dashboard:

Tier	Transport Label	Capabilities
Primary SignalR	`SignalR` or `SignalR-MidTier`	Full real-time push, instant commands
Primary REST	`REST` or `REST-MidTier`	Polling-based, 30-sec command check
Cloud SignalR	`SignalR` (via fallback URL)	Full real-time through cloud DR
Cloud REST	`REST` (via fallback URL)	Polling through cloud DR
Offline	—	Local cache, forward when reconnected

Reconnection & Exponential Backoff

SignalR's built-in reconnection is good, but we layer our own retry policies on top for the initial connection and for mid-tier-to-hub resilience.

Mid-Tier → Hub Retry Policy

The mid-tier's HubClient uses a custom IRetryPolicy with escalating delays:

Before the reconnect policy even kicks in, the initial connection has its own retry loop — 5 attempts with attempt × 5 second delays (5s, 10s, 15s, 20s, 25s). This handles the common case where the Hub hasn't finished booting yet when the mid-tier starts.

Agent → Hub/MidTier Reconnection

Agents use SignalR's WithAutomaticReconnect with delays of [0s, 2s, 5s, 10s, ...]. If the automatic reconnect exhausts its retries, the agent falls through to the next tier in the fallback chain.

Command Dispatch: The Real-Time Advantage

This is where SignalR truly shines. When you dispatch a command to an agent, it's pushed immediately — no waiting for the next poll cycle.

With SignalR, command dispatch is sub-second. With REST failover, there's up to a 30-second delay waiting for the next poll. Both work — but SignalR is the experience we want.

Heartbeat & Liveness Detection

Keeping track of which nodes are actually alive requires two layers:

Layer 1 is SignalR's built-in keep-alive — ping/pong frames that detect dropped TCP connections. Layer 2 is our application-level heartbeat: every agent sends metrics every 60 seconds, and a background service sweeps every 30 seconds looking for agents that haven't reported in 90 seconds.

If an agent goes silent for 90 seconds, it's marked offline and the dashboard is notified in real-time (via SignalR, naturally). If it's been gone for 14 days, the stale record is purged entirely.

Metrics Flow: Batching at the Mid-Tier

Raw metrics from every agent every 60 seconds would overwhelm the Hub in a large fleet. The mid-tier solves this with batched aggregation:

Individual agents send metrics every 60 seconds to their local mid-tier via SignalR (or REST if SignalR is unavailable). The MetricsBatcher service aggregates all local agent metrics and sends a single SendAggregatedMetrics() call to the Hub every 5 minutes. This reduces Hub-bound traffic by roughly N × 5 (where N is agents per mid-tier).

Failover & Failback: Self-Healing Connectivity

The agent doesn't just failover — it automatically fails back when the primary becomes available again:

Key configuration that controls this behavior:

Setting	Default	What It Does
`FailoverEnabled`	`true`	Allow falling through to cloud DR tiers
`FailoverTimeoutSeconds`	`30`	How long to wait before declaring a tier dead
`FailbackEnabled`	`true`	Automatically try to return to primary
`FailbackDelaySeconds`	`300` (5 min)	Wait before attempting failback to avoid flapping
`ConnectTimeoutSeconds`	`10`	Per-connection attempt timeout
`OfflineCacheEnabled`	`true`	Cache data locally when fully disconnected

The 5-minute failback delay prevents connection flapping — if the primary is unstable, the agent stays on cloud DR long enough to confirm the primary is actually healthy before switching back.

The Tradeoff: Why Not Just REST?

REST-only would be simpler. No WebSocket upgrade negotiation, no connection state management, no reconnection logic. But the costs are real:

	SignalR (Primary)	REST (Failover)
Command latency	Sub-second push	Up to 30s (poll interval)
Dashboard updates	Instant via push	Would require polling
Network overhead	One persistent connection	New TCP + TLS handshake per request
Metrics delivery	Stream as available	Batch on timer
Bidirectional	Native	Simulated (client polls)
Firewall compat	WebSocket → SSE → long-poll	Always works

We get the best of both: SignalR for the real-time experience when the network allows it, REST as a reliable fallback when it doesn't. The agent handles the transition transparently — the user never has to configure or think about it.

Summary

SignalR gives thresh three things that matter for fleet management:

Real-time push — commands and status changes arrive instantly, not on the next poll
Transport negotiation — works behind firewalls that block WebSockets by automatically falling back to SSE or long-polling
Built-in reconnection — combined with our tiered failover chain, agents self-heal through network disruptions without operator intervention

REST is always there as the safety net. The system is designed so that no single transport failure takes a node offline — it just degrades gracefully until connectivity recovers.

The Problem: Fleet Communication at Scale​

Why SignalR Over Raw WebSockets​

The Full Connection Architecture​

Agent Connection: The Tiered Fallback​

Reconnection & Exponential Backoff​

Mid-Tier → Hub Retry Policy​

Agent → Hub/MidTier Reconnection​

Command Dispatch: The Real-Time Advantage​

Heartbeat & Liveness Detection​

Metrics Flow: Batching at the Mid-Tier​

Failover & Failback: Self-Healing Connectivity​

The Tradeoff: Why Not Just REST?​

Summary​