WebSocket keeps disconnecting

Symptom: Connects, drops every few seconds#

Causes:

Proxy or load balancer has too low an idle timeout. ScaiWave pings every 30s; if a middlebox closes idle connections at 20s, you get drops. Increase the LB / nginx proxy_read_timeout past 60s (recommend 75–90s).
Sticky sessions not configured. Some LBs round-robin upgrade requests; the WS gets routed to a pod that doesn't have the session. Configure sticky.
Network conditions (mobile network, VPN). Less fixable; the client should auto-reconnect via the /v1/sync long-poll.

Symptom: Drops without close frame#

Connection just goes silent — no close frame, no error. Usually a misconfigured intermediary. Test:

bash
websocat -v "wss://your-host/v1/stream?token=$TOKEN"

Listen for 60+ seconds. If you get nothing → silent drop somewhere in the path.

Symptom: Close frame with code 4001#

SW_AUTH_INVALID_TOKEN. Token expired or invalid. The web client should refresh and reconnect; if it doesn't, check the auth refresh path.

Symptom: Close frame with code 4003#

Server-side abort. Look for ws.disconnect in logs:

reason = "ping_timeout" → client missed too many pongs.
reason = "duplicate_connection" → another client signed in with the same token; the older one is closed.

Symptom: Connects fine but no events arrive#

You see the hello frame, but then nothing — even when you know messages are happening.

Wrong tenant scope: are you signed in as a different tenant than the one with traffic? Check the hello frame's tenant_id.
Rate limit: you're sending events that fail rate-limit checks; the events never enter the stream.
WS-side bug: rare, but check the server logs for ws.fanout_failed. If many, restart the API pod.

Reconnect strategy#

The web client uses:

First disconnect → reconnect immediately.
Successive failures → exponential backoff (1s, 2s, 4s, 8s, max 30s).
On reconnect → /v1/sync?since=<last_stream_position> to bridge the gap, then resume the WS.

If you're writing your own client, copy that pattern. Don't try to keep WS open across long network outages — fall back to sync.

Stream-position tracking#

Persist last_stream_position (the highest stream_position you've processed) somewhere durable on the client. On reconnect, query /v1/sync?since=<that> first. Without this, you miss events that happened during the disconnect.

Where to look (admin)#

Logs: ws.connect, ws.disconnect, ws.fanout_failed.
Metric: scaiwave_ws_connections{tenant} — should be stable during normal operation.
Metric: scaiwave_ws_messages_dropped_total — non-zero is bad.