Troubleshooting
A short list of things that go wrong and how to fix them. If none of these match, check the request id in the response envelope and grep the ScaiGrid logs.
Bunker stays in pending or provisioning#
The worker hasn't acknowledged the create yet, or the scheduler can't find a worker that fits.
WORKER_UNAVAILABLEon create. No worker has enough free CPU / memory / disk / GPU. Either reduce the request, wait for other bunkers to terminate, or add a worker.NO_SUITABLE_WORKERon create. No worker has the image cachedready. Either setlazy_pull: trueon the image, force aPOST /images/{id}/warm, or wait for fan-out to complete.- Stuck in
provisioningfor more than a minute. The worker took the assignment but never reported back. Check the worker's status withGET /workers/{id}— iflast_heartbeatis stale, the heartbeat monitor will fail the bunker out automatically within ~30 seconds.
BUNKER_QUOTA_EXCEEDED#
The caller's resolved quota would be over a cap.
- Read
GET /quota-profiles/usage— you'll see one row per profile applying to your user, with the bucket, current usage, cap, and headroom. - Terminate idle bunkers; quota is decremented across every bucket they contributed to.
- Ask a tenant admin to assign a more generous profile or raise the existing one.
NETWORK_PROFILE_DENIED or LIFECYCLE_MODE_DENIED#
You picked a profile or lifecycle that needs a permission you don't have.
registryneedsscaibunker:network:registry,allowlistedneedsscaibunker:network:allowlisted, etc.sessionneedsscaibunker:create:session,persistentneedsscaibunker:create:persistent.- Ask a tenant admin to grant the specific key — these are deliberately granular.
INTERFACES_NOT_ALLOWED / TRANSIT_MISSING_INTERFACES#
The interfaces[] array is only valid with network_profile: "transit", and transit requires at least one interface.
- For transit bunkers, name a
bridge_namefor each interface; the bridge must already exist on a worker viaPOST /bridges. - For everything else, drop the
interfacesfield.
exec returns a 0 exit code but stdout is empty#
Two common causes:
- Output went to a file inside the bunker. Read it back with
GET /files/.... - Output was truncated to S3. Check
truncated: trueandfull_output_refon the response; fetch viaGET /storage/output/{bunker_id}/{name}.
exec times out mid-command#
The default timeout_s is 60. For builds, installs, or long-running scripts:
- Bump
timeout_s(no hard ceiling at the API; the bunker's ownmax_lifetime_sis the upper bound). - Use
"stream": trueso you see progress as it happens and can decide when to abort. - Move long work into a snapshot-able session bunker so a partial result survives a controller restart.
Files PUT returns 413 or fails on a large file#
The inline PUT path is fine for files under ~10 MB. For larger:
- Call
POST /files/uploadfor a pre-signed S3 URL andkey. - Upload the file directly to S3.
- Call
POST /files/commitwith{key, dest_path}— the worker injects the file into the bunker.
Image registered but no bunkers can use it#
- Check
GET /images/{id}/cache— if every row ispendingorfailed, the fan-out didn't reach the workers. - Check the image is in an availability group containing your workers (
POST /availability-groups/{group_id}/images). - Re-trigger with
POST /images/{id}/warm(idempotent). - Inspect a failing worker row's
errorfield — usually a registry auth failure, an OOM duringmkfs.ext4, or a size cap that was set too low.
Image scan stuck on pending#
- The scanner runs every 2 minutes as a background task. Wait a couple of minutes.
- If still
pendingafter 10 minutes, check that Trivy is installed on the controller (trivy --versionin the controller container). Missing binary → status flips tofailedwith a "scanner not available" message.
Worker shows offline but its host is up#
- The heartbeat is what makes a worker
online. Default cadence is 10 seconds;WORKER_STALE_THRESHOLD_MULTIPLIER(default 3) means missing 30 seconds flips itoffline. - Check the worker's own logs for heartbeat send failures (wrong
SCAIBUNKER_WORKER_TOKEN, controller URL, network reachability). - The auto-detected status is in Redis; if Redis was wiped, the next heartbeat re-populates it.
Snapshots filling up S3#
- Snapshots default to a 7-day retention (
DEFAULT_SNAPSHOT_RETENTION_DAYS). The cleanup background task runs every 5 minutes. - Anything older than the retention with
expires_atset will be deleted automatically. - Manual snapshots without
expires_atare kept indefinitely — set lifecycle on the S3 bucket if you want hard caps.
Storage proxy 401s#
- The proxy verifies a constant
SCAIBUNKER_WORKER_TOKENplus, optionally, capability tokens minted viaPOST /storage/capabilities. - Mismatch between the worker's token and the controller's is the most common cause — they have to share the same string.
- For capability tokens,
scaibunker_capability_secretmust be set on the proxy as well as the controller.
Bunker can't reach hosts you expected#
- On
isolated, it can't reach anything. Onregistry, only the platform's package mirrors. Onallowlisted, only what you listed. - Allowlist entries are plain hostnames or
*.domain.com. URLs, paths, and double wildcards are rejected at create time. - For one-off "I just need to curl this URL" cases, use
unrestrictedplus enable the egress audit so the run is logged.
Conversations and bunkers double-charged#
- A persistent bunker holds quota for its entire lifetime, even when paused. Pause counts toward
max_concurrent_bunkers, not toward CPU usage on Redis. - If you want to give back resources, terminate (optionally with snapshot) — pause doesn't release.