WebSocket keeps disconnecting
Symptom: Connects, drops every few seconds#
Causes:
- Proxy or load balancer has too low an idle timeout. ScaiWave
pings every 30s; if a middlebox closes idle connections at 20s,
you get drops. Increase the LB / nginx
proxy_read_timeoutpast 60s (recommend 75–90s). - Sticky sessions not configured. Some LBs round-robin upgrade requests; the WS gets routed to a pod that doesn't have the session. Configure sticky.
- Network conditions (mobile network, VPN). Less fixable; the
client should auto-reconnect via the
/v1/synclong-poll.
Symptom: Drops without close frame#
Connection just goes silent — no close frame, no error. Usually a misconfigured intermediary. Test:
1 | |
Listen for 60+ seconds. If you get nothing → silent drop somewhere in the path.
Symptom: Close frame with code 4001#
SW_AUTH_INVALID_TOKEN. Token expired or invalid. The web client
should refresh and reconnect; if it doesn't, check the auth
refresh path.
Symptom: Close frame with code 4003#
Server-side abort. Look for ws.disconnect in logs:
reason = "ping_timeout"→ client missed too many pongs.reason = "duplicate_connection"→ another client signed in with the same token; the older one is closed.
Symptom: Connects fine but no events arrive#
You see the hello frame, but then nothing — even when you know messages are happening.
- Wrong tenant scope: are you signed in as a different tenant
than the one with traffic? Check the hello frame's
tenant_id. - Rate limit: you're sending events that fail rate-limit checks; the events never enter the stream.
- WS-side bug: rare, but check the server logs for
ws.fanout_failed. If many, restart the API pod.
Reconnect strategy#
The web client uses:
- First disconnect → reconnect immediately.
- Successive failures → exponential backoff (1s, 2s, 4s, 8s, max 30s).
- On reconnect →
/v1/sync?since=<last_stream_position>to bridge the gap, then resume the WS.
If you're writing your own client, copy that pattern. Don't try to keep WS open across long network outages — fall back to sync.
Stream-position tracking#
Persist last_stream_position (the highest stream_position you've
processed) somewhere durable on the client. On reconnect, query
/v1/sync?since=<that> first. Without this, you miss events that
happened during the disconnect.
Where to look (admin)#
- Logs:
ws.connect,ws.disconnect,ws.fanout_failed. - Metric:
scaiwave_ws_connections{tenant}— should be stable during normal operation. - Metric:
scaiwave_ws_messages_dropped_total— non-zero is bad.