Troubleshooting
A short list of things that go wrong and how to fix them. ScaiMind is a bridge — most "problems" are coordinator-side and surface as gRPC error codes mapped onto HTTP. If none of these match, capture the request id from the response envelope and grep the ScaiGrid logs for scaimind_coordinator_error.
All ScaiMind requests fail with 503#
UNAVAILABLE from the coordinator. Either the MindCoordinator process is down, the network path from ScaiGrid to the coordinator is broken, or the channel has been wedged for a long time.
- Check the coordinator process is running and listening on the configured
scaimind_grpc_host:scaimind_grpc_port. - Confirm DNS and routing from the ScaiGrid host to the coordinator.
- If TLS is enabled (
scaimind_tls_enabled=true), verify cert paths exist and certs haven't expired. - Restart ScaiGrid if the channel itself looks wedged — the client is created at module init and held for the process lifetime.
All ScaiMind requests fail with 500 and "internal error"#
INTERNAL or UNKNOWN from the coordinator. The raw detail is intentionally hidden from the caller; look at the ScaiGrid server logs for the scaimind_coordinator_error structured log line — it will have the original raw_detail.
Job submission returns 400 with "INVALID_ARGUMENT"#
The coordinator rejected the body. The detail is passed through; common causes:
base_model.model_idempty or not resolvable.gpu_count0 orgpu_typenot in node inventory.training_typeis one of the seven valid values but the framework doesn't support it.data_config.sourcesempty.- Hyperparameter dict contains a key the framework doesn't recognise — most frameworks validate names.
Job submission returns 504#
DEADLINE_EXCEEDED. The coordinator took longer than scaimind_grpc_timeout_s to accept the submission. Submission should be cheap — if it's timing out, the coordinator is overloaded. Check queue depth (GET /queue) and cluster utilisation (GET /cluster).
Job stays in PENDING or QUEUED indefinitely#
No capacity matches the resource request.
- Compare
resource_config.gpu_countandgpu_typetoGET /clusterandGET /cluster/nodes. If you asked forH100and there are onlyA100s, the job will queue forever. - Check
scheduling_config.queuematches a queue the coordinator actually services. - Watch the queue:
GET /queueshows position and an estimated wait.
Job reaches FAILED immediately#
Look at the job detail — error_type and error_message are populated. Most early failures:
- Data fetch failed. ScaiDrive / ScaiAtlas token missing or expired, or the path doesn't exist. Pre-validate with
POST /data/validate. - Model not found.
base_model.model_idcouldn't be downloaded. For private HF repos you need to setbase_model.trust_remote_codeand/or supply a hub token. - Framework misconfig. Distributed config inconsistent with
world_sizeorgpu_count.
Job fails mid-training with OOM#
Common, not always recoverable. Strategies in order:
- Drop
data_config.batch_size, raisegradient_accumulation_steps. - Lower
max_seq_length. - For LoRA, lower
lora_r/lora_alpha. - For full SFT, switch to LORA or QLORA.
- Retry with
modify_resourcesand bump to a bigger GPU type.
Streams (logs or metrics) close immediately#
The endpoint emits an error event with {"error": "stream ended"} whenever the upstream gRPC stream raises. Usually:
- Job is in a terminal state —
COMPLETED,FAILED,CANCELLED,PREEMPTED. Streams don't follow past the end. - Coordinator restarted mid-stream — reconnect.
- Stream timeout (
scaimind_grpc_stream_timeout_s) elapsed.
Reopen the stream; do not assume any data was lost — for logs, use tail= to backfill on reconnect.
SSE stream works locally but not in production#
Reverse proxies and load balancers often buffer responses. To prevent buffering:
- Ensure your proxy forwards
text/event-streamwithout buffering (Nginx:proxy_buffering off; proxy_cache off;). - Ensure your proxy doesn't enforce a short upstream read timeout — log streams can be quiet for minutes between entries.
- For browser clients, use the EventSource API rather than
fetchfor built-in reconnect handling.
GET /data/cache or POST /data/validate return permission errors#
These two endpoints forward ScaiDrive and ScaiAtlas tokens minted via TokenExchangeDep. If the caller's identity doesn't have the scopes those services require, the coordinator-side fetch will fail with a 403. Check the user has the right downstream permissions in ScaiKey, not just scaimind:manage.
Local dashboard shows stale job state#
The mod_scaimind_jobs table is refreshed best-effort during GET /jobs list calls — not by GETs of individual jobs and not by lifecycle calls. A dashboard reading directly from the local cache may lag.
- The admin UI's Training Monitor reads the coordinator directly for the open-job view; the cache only powers the list.
- If you've built a custom dashboard hitting the DB, trigger a list call (or wait for the next one) to refresh.
Drain command returned 200 but jobs still running on the node#
force: false is the default — the node stops accepting new jobs but lets in-flight work finish. If you need an immediate stop, drain with force: true. Be aware that forcibly aborting in-flight training jobs marks them FAILED or PREEMPTED and may waste GPU time.
Token exchange dependency raises 500 on submission#
The TokenExchangeDep couldn't mint a downstream token — usually because the caller authenticated via an API key path that doesn't carry the scopes needed for token exchange, or because ScaiKey is unreachable. Check the audit log for the dependency error and the ScaiKey health.