Rate Limiting

ScaiGrid enforces rate limits at four independent levels. This page covers how they work, how to interpret 429 responses, and how to design around them.

The four levels#

Level	Default	Purpose
API key	60 req/min	Isolate runaway keys
User	120 req/min	Prevent one user from monopolizing tenant capacity
Tenant	1000 req/min	Fair sharing across tenants
Partner	5000 req/min	Protect partner-level capacity

All four enforce independently. A request only succeeds if it fits under every applicable limit. Most restrictive wins.

Default values are configurable per tenant by a partner admin or super admin.

Identifier resolution#

The rate limiter identifies each request:

API key — exact key
User — user ID (resolved from token)
Tenant — tenant ID
Partner — partner ID

For unauthenticated requests (rare — only public assets like model avatars), IP-based limiting applies.

Window algorithm#

Sliding window counters implemented in Redis. Each level has a 60-second rolling window (configurable) and a per-level counter.

At request time:

Compute (level, identifier, current_second).
ZINCRBY the counter.
Prune entries older than the window.
ZCARD the counter — if above the limit, reject.

This gives smooth enforcement without the "burst at window edge" problem of fixed-window limiters.

429 response#

When a level's limit is crossed:

http
HTTP/1.1 429 Too Many Requests
Retry-After: 15
X-Scaigrid-Ratelimit-Limit: 60
X-Scaigrid-Ratelimit-Remaining: 0
X-Scaigrid-Ratelimit-Reset: 1713888715

{
  "status": "error",
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded",
    "retry_after": 15
  }
}

Honor Retry-After — it's the seconds until enough window has passed for another request.

Headers on successful requests also expose rate-limit state, so clients can slow down before hitting the wall:

text

1
2
3

X-Scaigrid-Ratelimit-Limit: 60
X-Scaigrid-Ratelimit-Remaining: 12
X-Scaigrid-Ratelimit-Reset: 1713888720

Designing around limits#

Back off on 429#

A simple exponential backoff with respect for Retry-After:

python
import time, random

def request_with_backoff(func, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return func()
        except RateLimited as e:
            if attempt == max_attempts - 1:
                raise
            delay = e.retry_after or (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

The + random (jitter) prevents thundering-herd when multiple clients all back off simultaneously.

Spread load over time#

If you have 500 embedding requests to run, don't fire them all at once:

python
# BAD
for text in texts_500:
    embed(text)  # slams the limiter

# GOOD
import asyncio

async def throttled_batch(texts, concurrency=5):
    sem = asyncio.Semaphore(concurrency)
    async def one(t):
        async with sem:
            return await embed(t)
    return await asyncio.gather(*[one(t) for t in texts])

Or better, use /v1/inference/embed with input as an array — one request, many embeddings.

Batch inference for large workloads#

For > 1000 requests, Batch Inference has its own rate limit category (larger, async). Don't hammer the sync API for bulk work.

Separate keys for separate services#

A shared key between three services gets rate-limited by the first one to spike. Create per-service keys to isolate:

scdoc

1
2
3

Service A: sgk_key_A  (its own 60 req/min bucket)
Service B: sgk_key_B  (its own bucket)
Service C: sgk_key_C  (its own bucket)

All three still count against the user's and tenant's limits, but one service's burst can't starve another.

Budget vs rate limit#

Both return 429 but for different reasons:

RATE_LIMITED — short-term burst protection (per-minute).
BUDGET_EXCEEDED — long-term cost cap (per day/week/month).

Treat them differently. Rate-limited requests can retry after Retry-After; budget-exceeded requests need admin intervention (raise the budget, wait for the period to roll over).

Configuration#

Partner admins can tune per-tenant limits:

bash
curl -X PUT https://scaigrid.scailabs.ai/v1/tenants/{tenant_id} \
  -H "Authorization: Bearer $PARTNER_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "rate_limit_per_user": 200,
      "rate_limit_per_tenant": 2000
    }
  }'

Super admins can set platform-wide defaults via environment config.

Redis failure mode#

ScaiGrid uses Redis for rate-limit counters. What happens if Redis is unreachable?

Two modes, configurable via RATE_LIMIT_REDIS_FAILURE_MODE:

reject (default, safe) — fail requests with 503. No inference during Redis outages. Conservative, ensures limits are always enforced.
allow (permissive) — allow requests through; no limit enforcement until Redis recovers. Prioritizes availability over enforcement.

Production deployments should pick based on cost tolerance — high-cost tenants favor reject; high-availability tenants favor allow paired with alerting.

Monitoring#

Tenant admins see live rate-limit metrics in the admin UI: current request rate per level, proximity to limits, 429 count in the last hour.

Platform admins see the full Redis counter state via /v1/admin/rate-limit-state.

Accounting and Budgets — cost-based enforcement
Errors — 429 handling patterns
Your First Integration — retry code examples