Rate Limiting
ScaiGrid enforces rate limits at four independent levels. This page covers how they work, how to interpret 429 responses, and how to design around them.
The four levels#
| Level | Default | Purpose |
|---|---|---|
| API key | 60 req/min | Isolate runaway keys |
| User | 120 req/min | Prevent one user from monopolizing tenant capacity |
| Tenant | 1000 req/min | Fair sharing across tenants |
| Partner | 5000 req/min | Protect partner-level capacity |
All four enforce independently. A request only succeeds if it fits under every applicable limit. Most restrictive wins.
Default values are configurable per tenant by a partner admin or super admin.
Identifier resolution#
The rate limiter identifies each request:
- API key — exact key
- User — user ID (resolved from token)
- Tenant — tenant ID
- Partner — partner ID
For unauthenticated requests (rare — only public assets like model avatars), IP-based limiting applies.
Window algorithm#
Sliding window counters implemented in Redis. Each level has a 60-second rolling window (configurable) and a per-level counter.
At request time:
- Compute (level, identifier, current_second).
ZINCRBYthe counter.- Prune entries older than the window.
ZCARDthe counter — if above the limit, reject.
This gives smooth enforcement without the "burst at window edge" problem of fixed-window limiters.
429 response#
When a level's limit is crossed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Honor Retry-After — it's the seconds until enough window has passed for another request.
Headers on successful requests also expose rate-limit state, so clients can slow down before hitting the wall:
1 2 3 | |
Designing around limits#
Back off on 429#
A simple exponential backoff with respect for Retry-After:
1 2 3 4 5 6 7 8 9 10 11 | |
The + random (jitter) prevents thundering-herd when multiple clients all back off simultaneously.
Spread load over time#
If you have 500 embedding requests to run, don't fire them all at once:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Or better, use /v1/inference/embed with input as an array — one request, many embeddings.
Batch inference for large workloads#
For > 1000 requests, Batch Inference has its own rate limit category (larger, async). Don't hammer the sync API for bulk work.
Separate keys for separate services#
A shared key between three services gets rate-limited by the first one to spike. Create per-service keys to isolate:
1 2 3 | |
All three still count against the user's and tenant's limits, but one service's burst can't starve another.
Budget vs rate limit#
Both return 429 but for different reasons:
RATE_LIMITED— short-term burst protection (per-minute).BUDGET_EXCEEDED— long-term cost cap (per day/week/month).
Treat them differently. Rate-limited requests can retry after Retry-After; budget-exceeded requests need admin intervention (raise the budget, wait for the period to roll over).
Configuration#
Partner admins can tune per-tenant limits:
1 2 3 4 5 6 7 8 9 | |
Super admins can set platform-wide defaults via environment config.
Redis failure mode#
ScaiGrid uses Redis for rate-limit counters. What happens if Redis is unreachable?
Two modes, configurable via RATE_LIMIT_REDIS_FAILURE_MODE:
reject(default, safe) — fail requests with 503. No inference during Redis outages. Conservative, ensures limits are always enforced.allow(permissive) — allow requests through; no limit enforcement until Redis recovers. Prioritizes availability over enforcement.
Production deployments should pick based on cost tolerance — high-cost tenants favor reject; high-availability tenants favor allow paired with alerting.
Monitoring#
Tenant admins see live rate-limit metrics in the admin UI: current request rate per level, proximity to limits, 429 count in the last hour.
Platform admins see the full Redis counter state via /v1/admin/rate-limit-state.
Related#
- Accounting and Budgets — cost-based enforcement
- Errors — 429 handling patterns
- Your First Integration — retry code examples