---
summary: Common symptoms and what they usually mean.
title: Troubleshooting
path: troubleshooting
status: published
---

A short list of things that go wrong and how to fix them. If none of these match, check the request id in the response envelope and grep the ScaiGrid logs.

## Bunker stays in `pending` or `provisioning`

The worker hasn't acknowledged the create yet, or the scheduler can't find a worker that fits.

- **`WORKER_UNAVAILABLE` on create.** No worker has enough free CPU / memory / disk / GPU. Either reduce the request, wait for other bunkers to terminate, or add a worker.
- **`NO_SUITABLE_WORKER` on create.** No worker has the image cached `ready`. Either set `lazy_pull: true` on the image, force a `POST /images/{id}/warm`, or wait for fan-out to complete.
- **Stuck in `provisioning` for more than a minute.** The worker took the assignment but never reported back. Check the worker's status with `GET /workers/{id}` — if `last_heartbeat` is stale, the heartbeat monitor will fail the bunker out automatically within ~30 seconds.

## `BUNKER_QUOTA_EXCEEDED`

The caller's resolved quota would be over a cap.

- Read `GET /quota-profiles/usage` — you'll see one row per profile applying to your user, with the bucket, current usage, cap, and headroom.
- Terminate idle bunkers; quota is decremented across every bucket they contributed to.
- Ask a tenant admin to assign a more generous profile or raise the existing one.

## `NETWORK_PROFILE_DENIED` or `LIFECYCLE_MODE_DENIED`

You picked a profile or lifecycle that needs a permission you don't have.

- `registry` needs `scaibunker:network:registry`, `allowlisted` needs `scaibunker:network:allowlisted`, etc.
- `session` needs `scaibunker:create:session`, `persistent` needs `scaibunker:create:persistent`.
- Ask a tenant admin to grant the specific key — these are deliberately granular.

## `INTERFACES_NOT_ALLOWED` / `TRANSIT_MISSING_INTERFACES`

The `interfaces[]` array is only valid with `network_profile: "transit"`, and transit requires at least one interface.

- For transit bunkers, name a `bridge_name` for each interface; the bridge must already exist on a worker via `POST /bridges`.
- For everything else, drop the `interfaces` field.

## `exec` returns a 0 exit code but `stdout` is empty

Two common causes:

- **Output went to a file inside the bunker.** Read it back with `GET /files/...`.
- **Output was truncated to S3.** Check `truncated: true` and `full_output_ref` on the response; fetch via `GET /storage/output/{bunker_id}/{name}`.

## `exec` times out mid-command

The default `timeout_s` is 60. For builds, installs, or long-running scripts:

- Bump `timeout_s` (no hard ceiling at the API; the bunker's own `max_lifetime_s` is the upper bound).
- Use `"stream": true` so you see progress as it happens and can decide when to abort.
- Move long work into a snapshot-able session bunker so a partial result survives a controller restart.

## Files PUT returns 413 or fails on a large file

The inline PUT path is fine for files under ~10 MB. For larger:

- Call `POST /files/upload` for a pre-signed S3 URL and `key`.
- Upload the file directly to S3.
- Call `POST /files/commit` with `{key, dest_path}` — the worker injects the file into the bunker.

## Image registered but no bunkers can use it

- Check `GET /images/{id}/cache` — if every row is `pending` or `failed`, the fan-out didn't reach the workers.
- Check the image is in an availability group containing your workers (`POST /availability-groups/{group_id}/images`).
- Re-trigger with `POST /images/{id}/warm` (idempotent).
- Inspect a failing worker row's `error` field — usually a registry auth failure, an OOM during `mkfs.ext4`, or a size cap that was set too low.

## Image scan stuck on `pending`

- The scanner runs every 2 minutes as a background task. Wait a couple of minutes.
- If still `pending` after 10 minutes, check that Trivy is installed on the controller (`trivy --version` in the controller container). Missing binary → status flips to `failed` with a "scanner not available" message.

## Worker shows `offline` but its host is up

- The heartbeat is what makes a worker `online`. Default cadence is 10 seconds; `WORKER_STALE_THRESHOLD_MULTIPLIER` (default 3) means missing 30 seconds flips it `offline`.
- Check the worker's own logs for heartbeat send failures (wrong `SCAIBUNKER_WORKER_TOKEN`, controller URL, network reachability).
- The auto-detected status is in Redis; if Redis was wiped, the next heartbeat re-populates it.

## Snapshots filling up S3

- Snapshots default to a 7-day retention (`DEFAULT_SNAPSHOT_RETENTION_DAYS`). The cleanup background task runs every 5 minutes.
- Anything older than the retention with `expires_at` set will be deleted automatically.
- Manual snapshots without `expires_at` are kept indefinitely — set lifecycle on the S3 bucket if you want hard caps.

## Storage proxy 401s

- The proxy verifies a constant `SCAIBUNKER_WORKER_TOKEN` plus, optionally, capability tokens minted via `POST /storage/capabilities`.
- Mismatch between the worker's token and the controller's is the most common cause — they have to share the same string.
- For capability tokens, `scaibunker_capability_secret` must be set on the proxy as well as the controller.

## Bunker can't reach hosts you expected

- On `isolated`, it can't reach anything. On `registry`, only the platform's package mirrors. On `allowlisted`, only what you listed.
- Allowlist entries are plain hostnames or `*.domain.com`. URLs, paths, and double wildcards are rejected at create time.
- For one-off "I just need to curl this URL" cases, use `unrestricted` plus enable the egress audit so the run is logged.

## Conversations and bunkers double-charged

- A persistent bunker holds quota for its entire lifetime, even when paused. Pause counts toward `max_concurrent_bunkers`, not toward CPU usage on Redis.
- If you want to give back resources, terminate (optionally with snapshot) — pause doesn't release.
