---
summary: How the desktop bridge, the cloud MCP registry, and ScaiGrid's MCP aggregation
  fit together inside one module.
title: Architecture
path: concepts/architecture
status: published
---

ScaiLink is a single ScaiGrid module that runs two complementary surfaces: an inbound WebSocket bridge for desktop MCP clients, and an outbound HTTP client for hosted MCP servers the user has registered. Both surfaces feed into the same audit pipeline and the same `/mcp` aggregation an agent talks to.

## Components

```mermaid
flowchart LR
    DC["Desktop client<br/>local MCP servers<br/>+ consent UI"]
    HMS["Hosted MCP server<br/>(third-party MCP endpoint)"]
    subgraph SG ["ScaiGrid FastAPI process"]
        WS["/v1/scailink/ws"]
        CP["ConnPool"]
        SESS["/v1/modules/scailink/sessions"]
        CAP["/v1/modules/scailink/capabilities"]
        AUD["/v1/modules/scailink/audit"]
        RS["/v1/modules/scailink/remote-servers"]
        REG["registry"]
        RSS["RemoteServerService<br/>RemoteCredentialStore<br/>RemoteSessionPool"]
        MCP["ScaiMCP /v1/modules/scaimcp/<br/>aggregates both"]
    end
    DC <-- "WebSocket JSON-RPC 2.0<br/>session_init, invoke, catalog_update" --> WS
    WS --> CP
    HMS <-- "streamable_http / sse<br/>tools/list, tools/call, resources/list" --> RSS
    RS --> REG
    REG --> RSS
    RSS --> MCP
    CP --> MCP
```

There is no separate ScaiLink deployment. The module lives in the same FastAPI process as the rest of ScaiGrid, with state in the shared MariaDB instance and Redis used for session-pool and capability-catalog caching.

## Desktop bridge flow

1. The user's desktop client opens `wss://scaigrid.scailabs.ai/v1/scailink/ws` with their JWT.
2. The handler authenticates the JWT and waits for a `scailink/session_init` frame containing device name, platform, capability catalog, and audit settings.
3. A session id (`slink_...`) is minted in Redis under `scailink:session:{user_id}:{device_id}`. The capability catalog is mirrored into the catalog store.
4. The WebSocket joins the per-process `ConnectionPool` so REST callers can find it by `(user_id, device_id)`.
5. REST callers (`POST /users/{user_id}/tools/{tool_name}/invoke`) look up the device, push a `scailink/tool_invoke` frame, and wait for the response. Consent flows interleave with `scailink/consent_request` frames the client surfaces to the user.
6. Heartbeats every 30 seconds; on disconnect a 120-second grace period lets a flaky connection reconnect without losing the catalog.

## Cloud registry flow

1. A caller with the right perms `POST /v1/modules/scailink/remote-servers`. The service writes the server row, encrypts credentials field-by-field with AES-256-GCM (per-credential DEK wrapped by the platform KEK), then runs a first discovery.
2. Discovery opens the endpoint over the chosen transport (`streamable_http` by default, `sse` for legacy servers), runs `tools/list` plus `resources/list` plus `prompts/list`, and writes one capability row per item under `(server_id, kind, name)`.
3. The capabilities then appear in the platform `/mcp` catalog under `remote.{user_id}.{slug}.{tool_name}` (personal) or `remote.tenant.{slug}.{tool_name}` (tenant-shared).
4. When ScaiMCP receives `tools/call` on a namespaced name, it resolves the registered server, asks `RemoteSessionPool` for a session (`MAX_LIVE_SESSIONS=50` per worker, idle TTL 5 minutes), and forwards the call.
5. The refresh cron runs every 15 minutes with a per-tenant budget of 10 servers; three consecutive failures flip the server to `status='error'` and remove its tools from the aggregated catalog until a successful refresh recovers it.

## State

- **Sessions, heartbeats, grace periods** — Redis, keyed by `(user_id, device_id)`.
- **Capability catalogs from desktop clients** — Redis, keyed the same way.
- **Audit events** — `mod_scailink_audit_log` table; retained by tenant policy.
- **Remote servers, credentials, capabilities** — `mod_scailink_remote_server`, `mod_scailink_remote_credential`, `mod_scailink_remote_capability` tables.
- **Live outbound MCP sessions** — in-process memory in `RemoteSessionPool`; LRU-capped, idle-swept.

## Where the trust boundary is

The desktop client controls what's exposed. ScaiGrid never reaches into a user's machine on its own — every tool call goes through the open WebSocket, which the user can close at any time. Consent prompts on first-use are explicit; auto-approval requires the user to configure a consent policy in advance.

The cloud registry is the inverse: ScaiGrid holds the credentials and calls out. The credential write path is one-way — values go in via `POST /remote-servers` or `PUT /credentials/{field}`, are encrypted, and never come back out through the API. Only the outbound runtime can decrypt to make a call.

User-id forwarding to hosted servers is opt-in. By default, the third party doesn't see internal user IDs; flip `forward_user_id` on the registration to add `X-ScaiGrid-User: {user_id}` to outbound headers, useful when the third party needs per-user attribution.

## How it differs from raw MCP

A raw MCP client talks to one server. ScaiLink is the multi-tenant, audited, credential-managing layer in front of many servers and many users:

| Concern | Raw MCP | ScaiLink |
|---|---|---|
| Auth | Per-call header | Stored, encrypted, rotatable |
| Aggregation | Per-app | Platform-wide via ScaiMCP |
| Audit | You instrument it | Built-in for both surfaces |
| Health checks | You write a cron | Built-in every 15 min |
| Naming collisions | You handle them | Namespaced slug per server |
| Consent UI | You build it | Built into the desktop bridge |
| Session reuse | You manage it | Per-(user, server) pool with idle TTL |

## Per-process and shared state

A single ScaiGrid worker holds two pieces of in-memory state that other workers don't see:

- The desktop `ConnectionPool` — the open WebSocket objects. A REST caller routed to a worker that doesn't own the target WebSocket gets routed internally over Redis pub/sub by the gateway logic, so this isn't a correctness issue, just a transparent indirection.
- The `RemoteSessionPool` — warm outbound MCP sessions. Each worker keeps its own; with `uvicorn --workers N` the cache hit rate scales 1/N. Correctness is unaffected — a miss just pays the handshake.

Everything else is shared: Redis for session, catalog, and grace state; MariaDB for the registry rows; the platform KEK (via ScaiVault in production, settings in dev) for unwrapping credential DEKs at invocation time.

## What runs when

- **At process boot.** The module's `initialize` creates the `ConnectionPool` on `app.state`.
- **On every WebSocket connection.** The handler authenticates, accepts `session_init`, mints or resumes a session in Redis, stores the capability catalog, and joins the pool.
- **Every 15 minutes.** The `refresh_remote_servers` cron walks `status='active'` rows in stale-first order, with a 10-server-per-tenant budget, re-running discovery and upserting capability rows.
- **Every five minutes (per process).** The session-pool sweeper closes outbound MCP sessions idle past their TTL.
- **Every minute.** `cleanup_stale_sessions` reaps Redis session rows whose grace period has expired.

## What it does not do

- ScaiLink does not host or run agents. Agents live in ScaiCore / ScaiBot / external MCP clients and consume ScaiLink's catalog through ScaiMCP.
- ScaiLink does not store the *content* of tool results unless `audit_detail_level=full` was selected at session_init — by default, only metadata (action, target name, arguments outline, status, duration) is retained.
- ScaiLink does not yet support OAuth2 refresh-token flows for the cloud registry; OAuth2 is on the v1.2 roadmap. JWT auth from ScaiKey is not used outbound because most third-party MCP servers don't trust that issuer.

## Background tasks

ScaiLink contributes two cron jobs to the platform's worker:

- `cleanup_stale_sessions` — runs every five minutes. Scans Redis for session rows whose grace period has expired, removes them, and drops their capability catalog so the data doesn't outlive the user's intent.
- `refresh_remote_servers` — runs every fifteen minutes. Walks `status='active'` rows in stale-first order with a per-tenant budget, re-running discovery and upserting capability rows. Three consecutive failures flip a row to `status='error'`; a successful refresh restores it.

Both tasks are intentionally resilient to single-server failures — one bad endpoint doesn't break the cron tick.

Inside each worker, a 30-second sweeper closes outbound MCP sessions idle past their 5-minute TTL, so the pool stays bounded without depending on the platform cron.

## Failure modes worth knowing

- **Desktop disconnects mid-invocation.** The pending REST call surfaces a `CLIENT_DISCONNECTED` (-32006) error. The session enters its 120-second grace period; if the client reconnects in time the catalog survives, otherwise the session is reaped on the next minute-cron tick.
- **Cloud server returns garbage.** Discovery fails with `RemoteClientError`; the row is committed anyway with `status='error'` so credentials don't need re-entering. Fix the upstream and call refresh.
- **KEK is missing.** Every cloud-registry call returns 503 with `SCAILINK_REGISTRY_DISABLED`. Set `encryption_local_kek` in settings (production wires ScaiVault).
- **Worker restart.** Open WebSockets drop; clients reconnect into their grace period and resume. Warm outbound MCP sessions are lost; the next call pays the handshake.
