Troubleshooting

A short list of things that go wrong and how to fix them. ScaiMind is a bridge — most "problems" are coordinator-side and surface as gRPC error codes mapped onto HTTP. If none of these match, capture the request id from the response envelope and grep the ScaiGrid logs for scaimind_coordinator_error.

All ScaiMind requests fail with 503#

UNAVAILABLE from the coordinator. Either the MindCoordinator process is down, the network path from ScaiGrid to the coordinator is broken, or the channel has been wedged for a long time.

Check the coordinator process is running and listening on the configured scaimind_grpc_host:scaimind_grpc_port.
Confirm DNS and routing from the ScaiGrid host to the coordinator.
If TLS is enabled (scaimind_tls_enabled=true), verify cert paths exist and certs haven't expired.
Restart ScaiGrid if the channel itself looks wedged — the client is created at module init and held for the process lifetime.

All ScaiMind requests fail with 500 and "internal error"#

INTERNAL or UNKNOWN from the coordinator. The raw detail is intentionally hidden from the caller; look at the ScaiGrid server logs for the scaimind_coordinator_error structured log line — it will have the original raw_detail.

Job submission returns 400 with "INVALID_ARGUMENT"#

The coordinator rejected the body. The detail is passed through; common causes:

base_model.model_id empty or not resolvable.
gpu_count 0 or gpu_type not in node inventory.
training_type is one of the seven valid values but the framework doesn't support it.
data_config.sources empty.
Hyperparameter dict contains a key the framework doesn't recognise — most frameworks validate names.

Job submission returns 504#

DEADLINE_EXCEEDED. The coordinator took longer than scaimind_grpc_timeout_s to accept the submission. Submission should be cheap — if it's timing out, the coordinator is overloaded. Check queue depth (GET /queue) and cluster utilisation (GET /cluster).

Job stays in `PENDING` or `QUEUED` indefinitely#

No capacity matches the resource request.

Compare resource_config.gpu_count and gpu_type to GET /cluster and GET /cluster/nodes. If you asked for H100 and there are only A100s, the job will queue forever.
Check scheduling_config.queue matches a queue the coordinator actually services.
Watch the queue: GET /queue shows position and an estimated wait.

Job reaches `FAILED` immediately#

Look at the job detail — error_type and error_message are populated. Most early failures:

Data fetch failed. ScaiDrive / ScaiAtlas token missing or expired, or the path doesn't exist. Pre-validate with POST /data/validate.
Model not found. base_model.model_id couldn't be downloaded. For private HF repos you need to set base_model.trust_remote_code and/or supply a hub token.
Framework misconfig. Distributed config inconsistent with world_size or gpu_count.

Job fails mid-training with OOM#

Common, not always recoverable. Strategies in order:

Drop data_config.batch_size, raise gradient_accumulation_steps.
Lower max_seq_length.
For LoRA, lower lora_r / lora_alpha.
For full SFT, switch to LORA or QLORA.
Retry with modify_resources and bump to a bigger GPU type.

Streams (logs or metrics) close immediately#

The endpoint emits an error event with {"error": "stream ended"} whenever the upstream gRPC stream raises. Usually:

Job is in a terminal state — COMPLETED, FAILED, CANCELLED, PREEMPTED. Streams don't follow past the end.
Coordinator restarted mid-stream — reconnect.
Stream timeout (scaimind_grpc_stream_timeout_s) elapsed.

Reopen the stream; do not assume any data was lost — for logs, use tail= to backfill on reconnect.

SSE stream works locally but not in production#

Reverse proxies and load balancers often buffer responses. To prevent buffering:

Ensure your proxy forwards text/event-stream without buffering (Nginx: proxy_buffering off; proxy_cache off;).
Ensure your proxy doesn't enforce a short upstream read timeout — log streams can be quiet for minutes between entries.
For browser clients, use the EventSource API rather than fetch for built-in reconnect handling.

`GET /data/cache` or `POST /data/validate` return permission errors#

These two endpoints forward ScaiDrive and ScaiAtlas tokens minted via TokenExchangeDep. If the caller's identity doesn't have the scopes those services require, the coordinator-side fetch will fail with a 403. Check the user has the right downstream permissions in ScaiKey, not just scaimind:manage.

Local dashboard shows stale job state#

The mod_scaimind_jobs table is refreshed best-effort during GET /jobs list calls — not by GETs of individual jobs and not by lifecycle calls. A dashboard reading directly from the local cache may lag.

The admin UI's Training Monitor reads the coordinator directly for the open-job view; the cache only powers the list.
If you've built a custom dashboard hitting the DB, trigger a list call (or wait for the next one) to refresh.

Drain command returned 200 but jobs still running on the node#

force: false is the default — the node stops accepting new jobs but lets in-flight work finish. If you need an immediate stop, drain with force: true. Be aware that forcibly aborting in-flight training jobs marks them FAILED or PREEMPTED and may waste GPU time.

Token exchange dependency raises 500 on submission#

The TokenExchangeDep couldn't mint a downstream token — usually because the caller authenticated via an API key path that doesn't carry the scopes needed for token exchange, or because ScaiKey is unreachable. Check the audit log for the dependency error and the ScaiKey health.