---
title: Rate Limiting
path: advanced/rate-limiting
status: published
---

# Rate Limiting

ScaiGrid enforces rate limits at four independent levels. This page covers how they work, how to interpret 429 responses, and how to design around them.

## The four levels

| Level | Default | Purpose |
|-------|---------|---------|
| API key | 60 req/min | Isolate runaway keys |
| User | 120 req/min | Prevent one user from monopolizing tenant capacity |
| Tenant | 1000 req/min | Fair sharing across tenants |
| Partner | 5000 req/min | Protect partner-level capacity |

All four enforce independently. A request only succeeds if it fits under every applicable limit. Most restrictive wins.

Default values are configurable per tenant by a partner admin or super admin.

## Identifier resolution

The rate limiter identifies each request:

- **API key** — exact key
- **User** — user ID (resolved from token)
- **Tenant** — tenant ID
- **Partner** — partner ID

For unauthenticated requests (rare — only public assets like model avatars), IP-based limiting applies.

## Window algorithm

Sliding window counters implemented in Redis. Each level has a 60-second rolling window (configurable) and a per-level counter.

At request time:

1. Compute (level, identifier, current_second).
2. `ZINCRBY` the counter.
3. Prune entries older than the window.
4. `ZCARD` the counter — if above the limit, reject.

This gives smooth enforcement without the "burst at window edge" problem of fixed-window limiters.

## 429 response

When a level's limit is crossed:

```http
HTTP/1.1 429 Too Many Requests
Retry-After: 15
X-Scaigrid-Ratelimit-Limit: 60
X-Scaigrid-Ratelimit-Remaining: 0
X-Scaigrid-Ratelimit-Reset: 1713888715

{
  "status": "error",
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded",
    "retry_after": 15
  }
}
```

Honor `Retry-After` — it's the seconds until enough window has passed for another request.

Headers on successful requests also expose rate-limit state, so clients can slow down before hitting the wall:

```
X-Scaigrid-Ratelimit-Limit: 60
X-Scaigrid-Ratelimit-Remaining: 12
X-Scaigrid-Ratelimit-Reset: 1713888720
```

## Designing around limits

### Back off on 429

A simple exponential backoff with respect for `Retry-After`:

```python
import time, random

def request_with_backoff(func, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return func()
        except RateLimited as e:
            if attempt == max_attempts - 1:
                raise
            delay = e.retry_after or (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
```

The `+ random` (jitter) prevents thundering-herd when multiple clients all back off simultaneously.

### Spread load over time

If you have 500 embedding requests to run, don't fire them all at once:

```python
# BAD
for text in texts_500:
    embed(text)  # slams the limiter

# GOOD
import asyncio

async def throttled_batch(texts, concurrency=5):
    sem = asyncio.Semaphore(concurrency)
    async def one(t):
        async with sem:
            return await embed(t)
    return await asyncio.gather(*[one(t) for t in texts])
```

Or better, use `/v1/inference/embed` with `input` as an array — one request, many embeddings.

### Batch inference for large workloads

For > 1000 requests, [Batch Inference](../04-api-guides/06-batch-inference.md) has its own rate limit category (larger, async). Don't hammer the sync API for bulk work.

### Separate keys for separate services

A shared key between three services gets rate-limited by the first one to spike. Create per-service keys to isolate:

```
Service A: sgk_key_A  (its own 60 req/min bucket)
Service B: sgk_key_B  (its own bucket)
Service C: sgk_key_C  (its own bucket)
```

All three still count against the user's and tenant's limits, but one service's burst can't starve another.

## Budget vs rate limit

Both return 429 but for different reasons:

- `RATE_LIMITED` — short-term burst protection (per-minute).
- `BUDGET_EXCEEDED` — long-term cost cap (per day/week/month).

Treat them differently. Rate-limited requests can retry after `Retry-After`; budget-exceeded requests need admin intervention (raise the budget, wait for the period to roll over).

## Configuration

Partner admins can tune per-tenant limits:

```bash
curl -X PUT https://scaigrid.scailabs.ai/v1/tenants/{tenant_id} \
  -H "Authorization: Bearer $PARTNER_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "rate_limit_per_user": 200,
      "rate_limit_per_tenant": 2000
    }
  }'
```

Super admins can set platform-wide defaults via environment config.

## Redis failure mode

ScaiGrid uses Redis for rate-limit counters. What happens if Redis is unreachable?

Two modes, configurable via `RATE_LIMIT_REDIS_FAILURE_MODE`:

- **`reject`** (default, safe) — fail requests with 503. No inference during Redis outages. Conservative, ensures limits are always enforced.
- **`allow`** (permissive) — allow requests through; no limit enforcement until Redis recovers. Prioritizes availability over enforcement.

Production deployments should pick based on cost tolerance — high-cost tenants favor `reject`; high-availability tenants favor `allow` paired with alerting.

## Monitoring

Tenant admins see live rate-limit metrics in the admin UI: current request rate per level, proximity to limits, 429 count in the last hour.

Platform admins see the full Redis counter state via `/v1/admin/rate-limit-state`.

## Related

- [Accounting and Budgets](../03-core-concepts/04-accounting-and-budgets.md) — cost-based enforcement
- [Errors](../03-core-concepts/07-errors.md) — 429 handling patterns
- [Your First Integration](../02-getting-started/03-your-first-integration.md) — retry code examples