Architecture
ScaiLink is a single ScaiGrid module that runs two complementary surfaces: an inbound WebSocket bridge for desktop MCP clients, and an outbound HTTP client for hosted MCP servers the user has registered. Both surfaces feed into the same audit pipeline and the same /mcp aggregation an agent talks to.
Components#
There is no separate ScaiLink deployment. The module lives in the same FastAPI process as the rest of ScaiGrid, with state in the shared MariaDB instance and Redis used for session-pool and capability-catalog caching.
Desktop bridge flow#
- The user's desktop client opens
wss://scaigrid.scailabs.ai/v1/scailink/wswith their JWT. - The handler authenticates the JWT and waits for a
scailink/session_initframe containing device name, platform, capability catalog, and audit settings. - A session id (
slink_...) is minted in Redis underscailink:session:{user_id}:{device_id}. The capability catalog is mirrored into the catalog store. - The WebSocket joins the per-process
ConnectionPoolso REST callers can find it by(user_id, device_id). - REST callers (
POST /users/{user_id}/tools/{tool_name}/invoke) look up the device, push ascailink/tool_invokeframe, and wait for the response. Consent flows interleave withscailink/consent_requestframes the client surfaces to the user. - Heartbeats every 30 seconds; on disconnect a 120-second grace period lets a flaky connection reconnect without losing the catalog.
Cloud registry flow#
- A caller with the right perms
POST /v1/modules/scailink/remote-servers. The service writes the server row, encrypts credentials field-by-field with AES-256-GCM (per-credential DEK wrapped by the platform KEK), then runs a first discovery. - Discovery opens the endpoint over the chosen transport (
streamable_httpby default,ssefor legacy servers), runstools/listplusresources/listplusprompts/list, and writes one capability row per item under(server_id, kind, name). - The capabilities then appear in the platform
/mcpcatalog underremote.{user_id}.{slug}.{tool_name}(personal) orremote.tenant.{slug}.{tool_name}(tenant-shared). - When ScaiMCP receives
tools/callon a namespaced name, it resolves the registered server, asksRemoteSessionPoolfor a session (MAX_LIVE_SESSIONS=50per worker, idle TTL 5 minutes), and forwards the call. - The refresh cron runs every 15 minutes with a per-tenant budget of 10 servers; three consecutive failures flip the server to
status='error'and remove its tools from the aggregated catalog until a successful refresh recovers it.
State#
- Sessions, heartbeats, grace periods — Redis, keyed by
(user_id, device_id). - Capability catalogs from desktop clients — Redis, keyed the same way.
- Audit events —
mod_scailink_audit_logtable; retained by tenant policy. - Remote servers, credentials, capabilities —
mod_scailink_remote_server,mod_scailink_remote_credential,mod_scailink_remote_capabilitytables. - Live outbound MCP sessions — in-process memory in
RemoteSessionPool; LRU-capped, idle-swept.
Where the trust boundary is#
The desktop client controls what's exposed. ScaiGrid never reaches into a user's machine on its own — every tool call goes through the open WebSocket, which the user can close at any time. Consent prompts on first-use are explicit; auto-approval requires the user to configure a consent policy in advance.
The cloud registry is the inverse: ScaiGrid holds the credentials and calls out. The credential write path is one-way — values go in via POST /remote-servers or PUT /credentials/{field}, are encrypted, and never come back out through the API. Only the outbound runtime can decrypt to make a call.
User-id forwarding to hosted servers is opt-in. By default, the third party doesn't see internal user IDs; flip forward_user_id on the registration to add X-ScaiGrid-User: {user_id} to outbound headers, useful when the third party needs per-user attribution.
How it differs from raw MCP#
A raw MCP client talks to one server. ScaiLink is the multi-tenant, audited, credential-managing layer in front of many servers and many users:
| Concern | Raw MCP | ScaiLink |
|---|---|---|
| Auth | Per-call header | Stored, encrypted, rotatable |
| Aggregation | Per-app | Platform-wide via ScaiMCP |
| Audit | You instrument it | Built-in for both surfaces |
| Health checks | You write a cron | Built-in every 15 min |
| Naming collisions | You handle them | Namespaced slug per server |
| Consent UI | You build it | Built into the desktop bridge |
| Session reuse | You manage it | Per-(user, server) pool with idle TTL |
Per-process and shared state#
A single ScaiGrid worker holds two pieces of in-memory state that other workers don't see:
- The desktop
ConnectionPool— the open WebSocket objects. A REST caller routed to a worker that doesn't own the target WebSocket gets routed internally over Redis pub/sub by the gateway logic, so this isn't a correctness issue, just a transparent indirection. - The
RemoteSessionPool— warm outbound MCP sessions. Each worker keeps its own; withuvicorn --workers Nthe cache hit rate scales 1/N. Correctness is unaffected — a miss just pays the handshake.
Everything else is shared: Redis for session, catalog, and grace state; MariaDB for the registry rows; the platform KEK (via ScaiVault in production, settings in dev) for unwrapping credential DEKs at invocation time.
What runs when#
- At process boot. The module's
initializecreates theConnectionPoolonapp.state. - On every WebSocket connection. The handler authenticates, accepts
session_init, mints or resumes a session in Redis, stores the capability catalog, and joins the pool. - Every 15 minutes. The
refresh_remote_serverscron walksstatus='active'rows in stale-first order, with a 10-server-per-tenant budget, re-running discovery and upserting capability rows. - Every five minutes (per process). The session-pool sweeper closes outbound MCP sessions idle past their TTL.
- Every minute.
cleanup_stale_sessionsreaps Redis session rows whose grace period has expired.
What it does not do#
- ScaiLink does not host or run agents. Agents live in ScaiCore / ScaiBot / external MCP clients and consume ScaiLink's catalog through ScaiMCP.
- ScaiLink does not store the content of tool results unless
audit_detail_level=fullwas selected at session_init — by default, only metadata (action, target name, arguments outline, status, duration) is retained. - ScaiLink does not yet support OAuth2 refresh-token flows for the cloud registry; OAuth2 is on the v1.2 roadmap. JWT auth from ScaiKey is not used outbound because most third-party MCP servers don't trust that issuer.
Background tasks#
ScaiLink contributes two cron jobs to the platform's worker:
cleanup_stale_sessions— runs every five minutes. Scans Redis for session rows whose grace period has expired, removes them, and drops their capability catalog so the data doesn't outlive the user's intent.refresh_remote_servers— runs every fifteen minutes. Walksstatus='active'rows in stale-first order with a per-tenant budget, re-running discovery and upserting capability rows. Three consecutive failures flip a row tostatus='error'; a successful refresh restores it.
Both tasks are intentionally resilient to single-server failures — one bad endpoint doesn't break the cron tick.
Inside each worker, a 30-second sweeper closes outbound MCP sessions idle past their 5-minute TTL, so the pool stays bounded without depending on the platform cron.
Failure modes worth knowing#
- Desktop disconnects mid-invocation. The pending REST call surfaces a
CLIENT_DISCONNECTED(-32006) error. The session enters its 120-second grace period; if the client reconnects in time the catalog survives, otherwise the session is reaped on the next minute-cron tick. - Cloud server returns garbage. Discovery fails with
RemoteClientError; the row is committed anyway withstatus='error'so credentials don't need re-entering. Fix the upstream and call refresh. - KEK is missing. Every cloud-registry call returns 503 with
SCAILINK_REGISTRY_DISABLED. Setencryption_local_kekin settings (production wires ScaiVault). - Worker restart. Open WebSockets drop; clients reconnect into their grace period and resume. Warm outbound MCP sessions are lost; the next call pays the handshake.