Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Troubleshooting

A short list of things that go wrong and how to fix them. If none of these match, check the request id in the response envelope and grep the ScaiGrid logs.

Bunker stays in pending or provisioning#

The worker hasn't acknowledged the create yet, or the scheduler can't find a worker that fits.

  • WORKER_UNAVAILABLE on create. No worker has enough free CPU / memory / disk / GPU. Either reduce the request, wait for other bunkers to terminate, or add a worker.
  • NO_SUITABLE_WORKER on create. No worker has the image cached ready. Either set lazy_pull: true on the image, force a POST /images/{id}/warm, or wait for fan-out to complete.
  • Stuck in provisioning for more than a minute. The worker took the assignment but never reported back. Check the worker's status with GET /workers/{id} — if last_heartbeat is stale, the heartbeat monitor will fail the bunker out automatically within ~30 seconds.

BUNKER_QUOTA_EXCEEDED#

The caller's resolved quota would be over a cap.

  • Read GET /quota-profiles/usage — you'll see one row per profile applying to your user, with the bucket, current usage, cap, and headroom.
  • Terminate idle bunkers; quota is decremented across every bucket they contributed to.
  • Ask a tenant admin to assign a more generous profile or raise the existing one.

NETWORK_PROFILE_DENIED or LIFECYCLE_MODE_DENIED#

You picked a profile or lifecycle that needs a permission you don't have.

  • registry needs scaibunker:network:registry, allowlisted needs scaibunker:network:allowlisted, etc.
  • session needs scaibunker:create:session, persistent needs scaibunker:create:persistent.
  • Ask a tenant admin to grant the specific key — these are deliberately granular.

INTERFACES_NOT_ALLOWED / TRANSIT_MISSING_INTERFACES#

The interfaces[] array is only valid with network_profile: "transit", and transit requires at least one interface.

  • For transit bunkers, name a bridge_name for each interface; the bridge must already exist on a worker via POST /bridges.
  • For everything else, drop the interfaces field.

exec returns a 0 exit code but stdout is empty#

Two common causes:

  • Output went to a file inside the bunker. Read it back with GET /files/....
  • Output was truncated to S3. Check truncated: true and full_output_ref on the response; fetch via GET /storage/output/{bunker_id}/{name}.

exec times out mid-command#

The default timeout_s is 60. For builds, installs, or long-running scripts:

  • Bump timeout_s (no hard ceiling at the API; the bunker's own max_lifetime_s is the upper bound).
  • Use "stream": true so you see progress as it happens and can decide when to abort.
  • Move long work into a snapshot-able session bunker so a partial result survives a controller restart.

Files PUT returns 413 or fails on a large file#

The inline PUT path is fine for files under ~10 MB. For larger:

  • Call POST /files/upload for a pre-signed S3 URL and key.
  • Upload the file directly to S3.
  • Call POST /files/commit with {key, dest_path} — the worker injects the file into the bunker.

Image registered but no bunkers can use it#

  • Check GET /images/{id}/cache — if every row is pending or failed, the fan-out didn't reach the workers.
  • Check the image is in an availability group containing your workers (POST /availability-groups/{group_id}/images).
  • Re-trigger with POST /images/{id}/warm (idempotent).
  • Inspect a failing worker row's error field — usually a registry auth failure, an OOM during mkfs.ext4, or a size cap that was set too low.

Image scan stuck on pending#

  • The scanner runs every 2 minutes as a background task. Wait a couple of minutes.
  • If still pending after 10 minutes, check that Trivy is installed on the controller (trivy --version in the controller container). Missing binary → status flips to failed with a "scanner not available" message.

Worker shows offline but its host is up#

  • The heartbeat is what makes a worker online. Default cadence is 10 seconds; WORKER_STALE_THRESHOLD_MULTIPLIER (default 3) means missing 30 seconds flips it offline.
  • Check the worker's own logs for heartbeat send failures (wrong SCAIBUNKER_WORKER_TOKEN, controller URL, network reachability).
  • The auto-detected status is in Redis; if Redis was wiped, the next heartbeat re-populates it.

Snapshots filling up S3#

  • Snapshots default to a 7-day retention (DEFAULT_SNAPSHOT_RETENTION_DAYS). The cleanup background task runs every 5 minutes.
  • Anything older than the retention with expires_at set will be deleted automatically.
  • Manual snapshots without expires_at are kept indefinitely — set lifecycle on the S3 bucket if you want hard caps.

Storage proxy 401s#

  • The proxy verifies a constant SCAIBUNKER_WORKER_TOKEN plus, optionally, capability tokens minted via POST /storage/capabilities.
  • Mismatch between the worker's token and the controller's is the most common cause — they have to share the same string.
  • For capability tokens, scaibunker_capability_secret must be set on the proxy as well as the controller.

Bunker can't reach hosts you expected#

  • On isolated, it can't reach anything. On registry, only the platform's package mirrors. On allowlisted, only what you listed.
  • Allowlist entries are plain hostnames or *.domain.com. URLs, paths, and double wildcards are rejected at create time.
  • For one-off "I just need to curl this URL" cases, use unrestricted plus enable the egress audit so the run is logged.

Conversations and bunkers double-charged#

  • A persistent bunker holds quota for its entire lifetime, even when paused. Pause counts toward max_concurrent_bunkers, not toward CPU usage on Redis.
  • If you want to give back resources, terminate (optionally with snapshot) — pause doesn't release.
Updated 2026-05-18 15:01:27 View source (.md) rev 12