---
summary: Common symptoms when ScaiMind requests fail or jobs misbehave, and what they
  usually mean.
title: Troubleshooting
path: troubleshooting
status: published
---

A short list of things that go wrong and how to fix them. ScaiMind is a bridge — most "problems" are coordinator-side and surface as gRPC error codes mapped onto HTTP. If none of these match, capture the request id from the response envelope and grep the ScaiGrid logs for `scaimind_coordinator_error`.

## All ScaiMind requests fail with 503

`UNAVAILABLE` from the coordinator. Either the MindCoordinator process is down, the network path from ScaiGrid to the coordinator is broken, or the channel has been wedged for a long time.

- Check the coordinator process is running and listening on the configured `scaimind_grpc_host:scaimind_grpc_port`.
- Confirm DNS and routing from the ScaiGrid host to the coordinator.
- If TLS is enabled (`scaimind_tls_enabled=true`), verify cert paths exist and certs haven't expired.
- Restart ScaiGrid if the channel itself looks wedged — the client is created at module init and held for the process lifetime.

## All ScaiMind requests fail with 500 and "internal error"

`INTERNAL` or `UNKNOWN` from the coordinator. The raw detail is intentionally hidden from the caller; look at the ScaiGrid server logs for the `scaimind_coordinator_error` structured log line — it will have the original `raw_detail`.

## Job submission returns 400 with "INVALID_ARGUMENT"

The coordinator rejected the body. The detail is passed through; common causes:

- `base_model.model_id` empty or not resolvable.
- `gpu_count` 0 or `gpu_type` not in node inventory.
- `training_type` is one of the seven valid values but the framework doesn't support it.
- `data_config.sources` empty.
- Hyperparameter dict contains a key the framework doesn't recognise — most frameworks validate names.

## Job submission returns 504

`DEADLINE_EXCEEDED`. The coordinator took longer than `scaimind_grpc_timeout_s` to accept the submission. Submission should be cheap — if it's timing out, the coordinator is overloaded. Check queue depth (`GET /queue`) and cluster utilisation (`GET /cluster`).

## Job stays in `PENDING` or `QUEUED` indefinitely

No capacity matches the resource request.

- Compare `resource_config.gpu_count` and `gpu_type` to `GET /cluster` and `GET /cluster/nodes`. If you asked for `H100` and there are only `A100`s, the job will queue forever.
- Check `scheduling_config.queue` matches a queue the coordinator actually services.
- Watch the queue: `GET /queue` shows position and an estimated wait.

## Job reaches `FAILED` immediately

Look at the job detail — `error_type` and `error_message` are populated. Most early failures:

- **Data fetch failed.** ScaiDrive / ScaiAtlas token missing or expired, or the path doesn't exist. Pre-validate with `POST /data/validate`.
- **Model not found.** `base_model.model_id` couldn't be downloaded. For private HF repos you need to set `base_model.trust_remote_code` and/or supply a hub token.
- **Framework misconfig.** Distributed config inconsistent with `world_size` or `gpu_count`.

## Job fails mid-training with OOM

Common, not always recoverable. Strategies in order:

- Drop `data_config.batch_size`, raise `gradient_accumulation_steps`.
- Lower `max_seq_length`.
- For LoRA, lower `lora_r` / `lora_alpha`.
- For full SFT, switch to LORA or QLORA.
- Retry with `modify_resources` and bump to a bigger GPU type.

## Streams (logs or metrics) close immediately

The endpoint emits an `error` event with `{"error": "stream ended"}` whenever the upstream gRPC stream raises. Usually:

- Job is in a terminal state — `COMPLETED`, `FAILED`, `CANCELLED`, `PREEMPTED`. Streams don't follow past the end.
- Coordinator restarted mid-stream — reconnect.
- Stream timeout (`scaimind_grpc_stream_timeout_s`) elapsed.

Reopen the stream; do not assume any data was lost — for logs, use `tail=` to backfill on reconnect.

## SSE stream works locally but not in production

Reverse proxies and load balancers often buffer responses. To prevent buffering:

- Ensure your proxy forwards `text/event-stream` without buffering (Nginx: `proxy_buffering off; proxy_cache off;`).
- Ensure your proxy doesn't enforce a short upstream read timeout — log streams can be quiet for minutes between entries.
- For browser clients, use the EventSource API rather than `fetch` for built-in reconnect handling.

## `GET /data/cache` or `POST /data/validate` return permission errors

These two endpoints forward ScaiDrive and ScaiAtlas tokens minted via `TokenExchangeDep`. If the caller's identity doesn't have the scopes those services require, the coordinator-side fetch will fail with a 403. Check the user has the right downstream permissions in ScaiKey, not just `scaimind:manage`.

## Local dashboard shows stale job state

The `mod_scaimind_jobs` table is refreshed best-effort during `GET /jobs` list calls — not by GETs of individual jobs and not by lifecycle calls. A dashboard reading directly from the local cache may lag.

- The admin UI's Training Monitor reads the coordinator directly for the open-job view; the cache only powers the list.
- If you've built a custom dashboard hitting the DB, trigger a list call (or wait for the next one) to refresh.

## Drain command returned 200 but jobs still running on the node

`force: false` is the default — the node stops accepting new jobs but lets in-flight work finish. If you need an immediate stop, drain with `force: true`. Be aware that forcibly aborting in-flight training jobs marks them `FAILED` or `PREEMPTED` and may waste GPU time.

## Token exchange dependency raises 500 on submission

The `TokenExchangeDep` couldn't mint a downstream token — usually because the caller authenticated via an API key path that doesn't carry the scopes needed for token exchange, or because ScaiKey is unreachable. Check the audit log for the dependency error and the ScaiKey health.
