Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Architecture

ScaiLink is a single ScaiGrid module that runs two complementary surfaces: an inbound WebSocket bridge for desktop MCP clients, and an outbound HTTP client for hosted MCP servers the user has registered. Both surfaces feed into the same audit pipeline and the same /mcp aggregation an agent talks to.

Components#

flowchart LR DC["Desktop client<br/>local MCP servers<br/>+ consent UI"] HMS["Hosted MCP server<br/>(third-party MCP endpoint)"] subgraph SG ["ScaiGrid FastAPI process"] WS["/v1/scailink/ws"] CP["ConnPool"] SESS["/v1/modules/scailink/sessions"] CAP["/v1/modules/scailink/capabilities"] AUD["/v1/modules/scailink/audit"] RS["/v1/modules/scailink/remote-servers"] REG["registry"] RSS["RemoteServerService<br/>RemoteCredentialStore<br/>RemoteSessionPool"] MCP["ScaiMCP /v1/modules/scaimcp/<br/>aggregates both"] end DC <-- "WebSocket JSON-RPC 2.0<br/>session_init, invoke, catalog_update" --> WS WS --> CP HMS <-- "streamable_http / sse<br/>tools/list, tools/call, resources/list" --> RSS RS --> REG REG --> RSS RSS --> MCP CP --> MCP

There is no separate ScaiLink deployment. The module lives in the same FastAPI process as the rest of ScaiGrid, with state in the shared MariaDB instance and Redis used for session-pool and capability-catalog caching.

Desktop bridge flow#

  1. The user's desktop client opens wss://scaigrid.scailabs.ai/v1/scailink/ws with their JWT.
  2. The handler authenticates the JWT and waits for a scailink/session_init frame containing device name, platform, capability catalog, and audit settings.
  3. A session id (slink_...) is minted in Redis under scailink:session:{user_id}:{device_id}. The capability catalog is mirrored into the catalog store.
  4. The WebSocket joins the per-process ConnectionPool so REST callers can find it by (user_id, device_id).
  5. REST callers (POST /users/{user_id}/tools/{tool_name}/invoke) look up the device, push a scailink/tool_invoke frame, and wait for the response. Consent flows interleave with scailink/consent_request frames the client surfaces to the user.
  6. Heartbeats every 30 seconds; on disconnect a 120-second grace period lets a flaky connection reconnect without losing the catalog.

Cloud registry flow#

  1. A caller with the right perms POST /v1/modules/scailink/remote-servers. The service writes the server row, encrypts credentials field-by-field with AES-256-GCM (per-credential DEK wrapped by the platform KEK), then runs a first discovery.
  2. Discovery opens the endpoint over the chosen transport (streamable_http by default, sse for legacy servers), runs tools/list plus resources/list plus prompts/list, and writes one capability row per item under (server_id, kind, name).
  3. The capabilities then appear in the platform /mcp catalog under remote.{user_id}.{slug}.{tool_name} (personal) or remote.tenant.{slug}.{tool_name} (tenant-shared).
  4. When ScaiMCP receives tools/call on a namespaced name, it resolves the registered server, asks RemoteSessionPool for a session (MAX_LIVE_SESSIONS=50 per worker, idle TTL 5 minutes), and forwards the call.
  5. The refresh cron runs every 15 minutes with a per-tenant budget of 10 servers; three consecutive failures flip the server to status='error' and remove its tools from the aggregated catalog until a successful refresh recovers it.

State#

  • Sessions, heartbeats, grace periods — Redis, keyed by (user_id, device_id).
  • Capability catalogs from desktop clients — Redis, keyed the same way.
  • Audit eventsmod_scailink_audit_log table; retained by tenant policy.
  • Remote servers, credentials, capabilitiesmod_scailink_remote_server, mod_scailink_remote_credential, mod_scailink_remote_capability tables.
  • Live outbound MCP sessions — in-process memory in RemoteSessionPool; LRU-capped, idle-swept.

Where the trust boundary is#

The desktop client controls what's exposed. ScaiGrid never reaches into a user's machine on its own — every tool call goes through the open WebSocket, which the user can close at any time. Consent prompts on first-use are explicit; auto-approval requires the user to configure a consent policy in advance.

The cloud registry is the inverse: ScaiGrid holds the credentials and calls out. The credential write path is one-way — values go in via POST /remote-servers or PUT /credentials/{field}, are encrypted, and never come back out through the API. Only the outbound runtime can decrypt to make a call.

User-id forwarding to hosted servers is opt-in. By default, the third party doesn't see internal user IDs; flip forward_user_id on the registration to add X-ScaiGrid-User: {user_id} to outbound headers, useful when the third party needs per-user attribution.

How it differs from raw MCP#

A raw MCP client talks to one server. ScaiLink is the multi-tenant, audited, credential-managing layer in front of many servers and many users:

Concern Raw MCP ScaiLink
Auth Per-call header Stored, encrypted, rotatable
Aggregation Per-app Platform-wide via ScaiMCP
Audit You instrument it Built-in for both surfaces
Health checks You write a cron Built-in every 15 min
Naming collisions You handle them Namespaced slug per server
Consent UI You build it Built into the desktop bridge
Session reuse You manage it Per-(user, server) pool with idle TTL

Per-process and shared state#

A single ScaiGrid worker holds two pieces of in-memory state that other workers don't see:

  • The desktop ConnectionPool — the open WebSocket objects. A REST caller routed to a worker that doesn't own the target WebSocket gets routed internally over Redis pub/sub by the gateway logic, so this isn't a correctness issue, just a transparent indirection.
  • The RemoteSessionPool — warm outbound MCP sessions. Each worker keeps its own; with uvicorn --workers N the cache hit rate scales 1/N. Correctness is unaffected — a miss just pays the handshake.

Everything else is shared: Redis for session, catalog, and grace state; MariaDB for the registry rows; the platform KEK (via ScaiVault in production, settings in dev) for unwrapping credential DEKs at invocation time.

What runs when#

  • At process boot. The module's initialize creates the ConnectionPool on app.state.
  • On every WebSocket connection. The handler authenticates, accepts session_init, mints or resumes a session in Redis, stores the capability catalog, and joins the pool.
  • Every 15 minutes. The refresh_remote_servers cron walks status='active' rows in stale-first order, with a 10-server-per-tenant budget, re-running discovery and upserting capability rows.
  • Every five minutes (per process). The session-pool sweeper closes outbound MCP sessions idle past their TTL.
  • Every minute. cleanup_stale_sessions reaps Redis session rows whose grace period has expired.

What it does not do#

  • ScaiLink does not host or run agents. Agents live in ScaiCore / ScaiBot / external MCP clients and consume ScaiLink's catalog through ScaiMCP.
  • ScaiLink does not store the content of tool results unless audit_detail_level=full was selected at session_init — by default, only metadata (action, target name, arguments outline, status, duration) is retained.
  • ScaiLink does not yet support OAuth2 refresh-token flows for the cloud registry; OAuth2 is on the v1.2 roadmap. JWT auth from ScaiKey is not used outbound because most third-party MCP servers don't trust that issuer.

Background tasks#

ScaiLink contributes two cron jobs to the platform's worker:

  • cleanup_stale_sessions — runs every five minutes. Scans Redis for session rows whose grace period has expired, removes them, and drops their capability catalog so the data doesn't outlive the user's intent.
  • refresh_remote_servers — runs every fifteen minutes. Walks status='active' rows in stale-first order with a per-tenant budget, re-running discovery and upserting capability rows. Three consecutive failures flip a row to status='error'; a successful refresh restores it.

Both tasks are intentionally resilient to single-server failures — one bad endpoint doesn't break the cron tick.

Inside each worker, a 30-second sweeper closes outbound MCP sessions idle past their 5-minute TTL, so the pool stays bounded without depending on the platform cron.

Failure modes worth knowing#

  • Desktop disconnects mid-invocation. The pending REST call surfaces a CLIENT_DISCONNECTED (-32006) error. The session enters its 120-second grace period; if the client reconnects in time the catalog survives, otherwise the session is reaped on the next minute-cron tick.
  • Cloud server returns garbage. Discovery fails with RemoteClientError; the row is committed anyway with status='error' so credentials don't need re-entering. Fix the upstream and call refresh.
  • KEK is missing. Every cloud-registry call returns 503 with SCAILINK_REGISTRY_DISABLED. Set encryption_local_kek in settings (production wires ScaiVault).
  • Worker restart. Open WebSockets drop; clients reconnect into their grace period and resume. Warm outbound MCP sessions are lost; the next call pays the handshake.
Updated 2026-05-18 15:01:30 View source (.md) rev 12