How to Handle WebSocket Reconnections Without Losing Events
The worst RPC bug is the one that doesn't throw. Your service connects over WebSocket, subscribes to contract logs with eth_subscribe, and processes events for hours. Then the connection drops — an idle timeout, a load balancer cycling a backend, the provider shipping a deploy — and your code keeps running against a dead socket. No exception, no log line, just silence. Twenty minutes later the socket reconnects (or your wrapper reconnects it), the events start flowing again, and everything looks healthy.
Except you lost every event that fired during those twenty minutes. They were never queued anywhere; eth_subscribe is fire-and-forget, and a subscription does not survive the connection that created it. For an indexer, an accounting service, or a bot that acts on transfers, a silent gap is a correctness bug that surfaces days later as "why is our balance off."
This post is the pattern we use to make WebSocket consumers actually reliable. There are four jobs, and most reconnect code only does the first two.
Why WebSocket connections drop (all the time)
Long-lived WebSockets are not a stable resource. In production you will see disconnects from:
- Idle timeouts. Many providers and proxies close a socket that hasn't sent a frame in 30–120 seconds. A subscription that happens to be quiet looks idle.
- Load balancer recycling. Behind any real RPC endpoint there's a fleet. Backends get drained for deploys and health-check failures; your socket goes with them.
- Network blips. NAT rebinding, Wi-Fi handoff, a container migration — the TCP connection just dies.
- Server-side resource caps. Hit a per-connection subscription limit or a memory ceiling and the server hangs up.
The takeaway: design for the socket dying every few minutes, not as a rare event. If your reconnect path is well-worn, drops become a non-event.
The four jobs of a reliable consumer
- Detect the drop quickly (don't trust
onclosealone). - Reconnect with backoff so you don't hammer a struggling endpoint.
- Re-subscribe to everything you were watching.
- Backfill the events you missed while disconnected — and de-duplicate.
Job 4 is the one almost everyone skips, and it's the only one that prevents data loss.
Job 1: detect the drop with a heartbeat
onclose and onerror fire eventually, but a half-open socket — TCP alive, no data flowing — can sit silent for a long time. Add a heartbeat: send a cheap request on an interval and reset a watchdog whenever any data arrives. If the watchdog expires, treat the socket as dead and tear it down yourself.
let lastData = Date.now();
ws.on("message", () => { lastData = Date.now(); });
setInterval(() => {
// any cheap call works as a liveness ping
ws.send(JSON.stringify({ jsonrpc: "2.0", id: "ping", method: "net_version", params: [] }));
if (Date.now() - lastData > 30_000) ws.terminate(); // force onclose -> reconnect
}, 10_000);
Job 2: reconnect with exponential backoff and jitter
When an endpoint is having a bad minute, fifty clients reconnecting in a tight loop make it worse. Back off, cap the delay, and add jitter so a fleet of your own workers doesn't reconnect in lockstep.
| Attempt | Base delay | With jitter (±30%) |
|---|---|---|
| 1 | 1s | 0.7–1.3s |
| 2 | 2s | 1.4–2.6s |
| 3 | 4s | 2.8–5.2s |
| 4 | 8s | 5.6–10.4s |
| 5+ | 30s (cap) | 21–39s |
function backoff(attempt) {
const base = Math.min(30_000, 1000 * 2 ** attempt);
return base * (0.7 + Math.random() * 0.6); // ±30% jitter
}
Jobs 3 + 4: re-subscribe, then backfill the gap
This is the heart of it. On reconnect you re-create your subscriptions — but a fresh subscription only delivers events from now. The window between your last received event and the new subscription is a hole. Fill it with eth_getLogs.
The trick is to track the last block you fully processed. On reconnect, query logs from that block forward to the current head, replay them, then let the live subscription take over. Because the boundary overlaps, you must de-duplicate on a stable key: blockHash + logIndex (or transactionHash + logIndex).
import { createPublicClient, webSocket, http } from "viem";
import { mainnet } from "viem/chains";
const WSS = "wss://rpc.swiftnodes.io/ws/ethereum?key=YOUR_API_KEY";
const HTTPS = "https://rpc.swiftnodes.io/rpc/ethereum?key=YOUR_API_KEY";
// viem's webSocket transport reconnects on its own; we add the backfill.
const wsClient = createPublicClient({ chain: mainnet, transport: webSocket(WSS, {
reconnect: { attempts: 10, delay: 1_000 },
}) });
const httpClient = createPublicClient({ chain: mainnet, transport: http(HTTPS) });
const seen = new Set(); // `${blockHash}:${logIndex}` for the overlap window
let lastProcessed = 0n; // highest block we have fully handled
const FILTER = { address: "0xYourContract", event: /* parseAbiItem(...) */ undefined };
function handle(log) {
const id = `${log.blockHash}:${log.logIndex}`;
if (seen.has(id)) return;
seen.add(id);
lastProcessed = log.blockNumber > lastProcessed ? log.blockNumber : lastProcessed;
// ... your event handler ...
}
async function backfill() {
if (lastProcessed === 0n) return; // nothing to catch up on yet
const head = await httpClient.getBlockNumber();
// lag the tip by a few blocks so a reorg doesn't replay logs you'll un-see
const safeHead = head - 3n;
if (safeHead <= lastProcessed) return;
const logs = await httpClient.getLogs({ ...FILTER, fromBlock: lastProcessed + 1n, toBlock: safeHead });
for (const log of logs) handle(log);
}
// on every (re)connect: backfill first, then resume live
wsClient.watchEvent({
...FILTER,
onLogs: (logs) => logs.forEach(handle),
onError: () => {/* transport will reconnect; backfill runs on the next open */},
});
A few things make this robust:
- Backfill over HTTP, not WS. A range query is a request/response — it belongs on HTTP. Keep WS for the live tail. (And mind the provider's range cap; if the gap is large, page it. We covered those limits in eth_getLogs range caps.)
- Lag the tip by a few blocks. The very head reorgs. If you backfill all the way to
head, a reorg can make you replay or act on logs that get orphaned. Stay a handful of confirmations back for anything you act on irreversibly. - Bound the dedup set. Don't let
seengrow forever — clear entries older than the overlap window (e.g. anything belowlastProcessed - 50).
The ethers v6 version
ethers v6's WebSocketProvider does not reconnect itself, so you wrap it: recreate the provider on close, re-attach listeners, and run the same backfill.
import { WebSocketProvider, JsonRpcProvider } from "ethers";
const http = new JsonRpcProvider("https://rpc.swiftnodes.io/rpc/ethereum?key=YOUR_API_KEY");
function connect(attempt = 0) {
const ws = new WebSocketProvider("wss://rpc.swiftnodes.io/ws/ethereum?key=YOUR_API_KEY");
ws.websocket.onopen = () => { backfill(); subscribe(ws); };
ws.websocket.onclose = () => {
const delay = Math.min(30_000, 1000 * 2 ** attempt) * (0.7 + Math.random() * 0.6);
setTimeout(() => connect(attempt + 1), delay);
};
}
function subscribe(ws) {
ws.on({ address: "0xYourContract" }, (log) => handle(log)); // re-attach on every reconnect
}
connect();
Same shape in web3.py: catch the ConnectionClosed, reconnect in a loop with backoff, re-create the filter, and run an eth_getLogs backfill from your last stored block.
Persist the watermark
One last piece: lastProcessed has to survive a process restart, not just a reconnect. If your service crashes and comes back, the in-memory watermark is gone and you'll either re-process from genesis or skip the gap. Write the last fully-processed block to Redis or your database after each batch, and load it on boot. Then the same backfill that recovers from a dropped socket also recovers from a deploy.
The mental model
A WebSocket subscription is a best-effort live tail, not a delivery guarantee. Treat it as one half of a pair: WS for low-latency live events, eth_getLogs for the authoritative gap-fill. With a heartbeat to detect drops, jittered backoff to reconnect politely, re-subscription, and a watermark-driven backfill with dedup, your consumer can lose its connection a hundred times a day and never lose an event.
SwiftNodes runs flat-rate WebSocket and HTTP endpoints across 50+ chains — same ?key= on both transports, no compute-unit surprises when a reconnect storm makes you fire a burst of eth_getLogs backfills. Spin up a free key at swiftnodes.io and point both your wss:// tail and your https:// backfill at our Ethereum RPC or any chain you build on. For the WS quirks that differ by network, see Arbitrum WebSocket gotchas.
Related posts
- Solana RPC: WebSocket vs HTTP for High-Frequency Bots
Most Solana bots burn 80% of their RPC budget polling for state that WebSocket subscriptions would push to them for free. Here's when to use which, with the commitment-level gotchas that bite people in production.
- Zora RPC: A 2026 Developer's Guide
Zora is an OP Stack L2 built for mints, and minting traffic is bursty in a way that punishes public RPC endpoints. Here's what you need to build on Zora in 2026 — chain ID 7777777, Superchain finality, the public-node trap, and the query patterns for reading mints at scale.
- BNB Smart Chain RPC: A 2026 Developer's Guide
BNB Smart Chain is one of the busiest EVM chains on the planet, but its public RPC endpoints fold under any real load. Here's what you need to connect reliably in 2026 — chain ID, finality, the public-node trap, and the query patterns that matter at BSC's transaction volume.
