Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Rate Limiting

ScaiGrid enforces rate limits at four independent levels. This page covers how they work, how to interpret 429 responses, and how to design around them.

The four levels#

Level Default Purpose
API key 60 req/min Isolate runaway keys
User 120 req/min Prevent one user from monopolizing tenant capacity
Tenant 1000 req/min Fair sharing across tenants
Partner 5000 req/min Protect partner-level capacity

All four enforce independently. A request only succeeds if it fits under every applicable limit. Most restrictive wins.

Default values are configurable per tenant by a partner admin or super admin.

Identifier resolution#

The rate limiter identifies each request:

  • API key — exact key
  • User — user ID (resolved from token)
  • Tenant — tenant ID
  • Partner — partner ID

For unauthenticated requests (rare — only public assets like model avatars), IP-based limiting applies.

Window algorithm#

Sliding window counters implemented in Redis. Each level has a 60-second rolling window (configurable) and a per-level counter.

At request time:

  1. Compute (level, identifier, current_second).
  2. ZINCRBY the counter.
  3. Prune entries older than the window.
  4. ZCARD the counter — if above the limit, reject.

This gives smooth enforcement without the "burst at window edge" problem of fixed-window limiters.

429 response#

When a level's limit is crossed:

http
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
HTTP/1.1 429 Too Many Requests
Retry-After: 15
X-Scaigrid-Ratelimit-Limit: 60
X-Scaigrid-Ratelimit-Remaining: 0
X-Scaigrid-Ratelimit-Reset: 1713888715

{
  "status": "error",
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded",
    "retry_after": 15
  }
}

Honor Retry-After — it's the seconds until enough window has passed for another request.

Headers on successful requests also expose rate-limit state, so clients can slow down before hitting the wall:

text
1
2
3
X-Scaigrid-Ratelimit-Limit: 60
X-Scaigrid-Ratelimit-Remaining: 12
X-Scaigrid-Ratelimit-Reset: 1713888720

Designing around limits#

Back off on 429#

A simple exponential backoff with respect for Retry-After:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import time, random

def request_with_backoff(func, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return func()
        except RateLimited as e:
            if attempt == max_attempts - 1:
                raise
            delay = e.retry_after or (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

The + random (jitter) prevents thundering-herd when multiple clients all back off simultaneously.

Spread load over time#

If you have 500 embedding requests to run, don't fire them all at once:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# BAD
for text in texts_500:
    embed(text)  # slams the limiter

# GOOD
import asyncio

async def throttled_batch(texts, concurrency=5):
    sem = asyncio.Semaphore(concurrency)
    async def one(t):
        async with sem:
            return await embed(t)
    return await asyncio.gather(*[one(t) for t in texts])

Or better, use /v1/inference/embed with input as an array — one request, many embeddings.

Batch inference for large workloads#

For > 1000 requests, Batch Inference has its own rate limit category (larger, async). Don't hammer the sync API for bulk work.

Separate keys for separate services#

A shared key between three services gets rate-limited by the first one to spike. Create per-service keys to isolate:

scdoc
1
2
3
Service A: sgk_key_A  (its own 60 req/min bucket)
Service B: sgk_key_B  (its own bucket)
Service C: sgk_key_C  (its own bucket)

All three still count against the user's and tenant's limits, but one service's burst can't starve another.

Budget vs rate limit#

Both return 429 but for different reasons:

  • RATE_LIMITED — short-term burst protection (per-minute).
  • BUDGET_EXCEEDED — long-term cost cap (per day/week/month).

Treat them differently. Rate-limited requests can retry after Retry-After; budget-exceeded requests need admin intervention (raise the budget, wait for the period to roll over).

Configuration#

Partner admins can tune per-tenant limits:

bash
1
2
3
4
5
6
7
8
9
curl -X PUT https://scaigrid.scailabs.ai/v1/tenants/{tenant_id} \
  -H "Authorization: Bearer $PARTNER_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "rate_limit_per_user": 200,
      "rate_limit_per_tenant": 2000
    }
  }'

Super admins can set platform-wide defaults via environment config.

Redis failure mode#

ScaiGrid uses Redis for rate-limit counters. What happens if Redis is unreachable?

Two modes, configurable via RATE_LIMIT_REDIS_FAILURE_MODE:

  • reject (default, safe) — fail requests with 503. No inference during Redis outages. Conservative, ensures limits are always enforced.
  • allow (permissive) — allow requests through; no limit enforcement until Redis recovers. Prioritizes availability over enforcement.

Production deployments should pick based on cost tolerance — high-cost tenants favor reject; high-availability tenants favor allow paired with alerting.

Monitoring#

Tenant admins see live rate-limit metrics in the admin UI: current request rate per level, proximity to limits, 429 count in the last hour.

Platform admins see the full Redis counter state via /v1/admin/rate-limit-state.

Updated 2026-05-18 15:01:28 View source (.md) rev 17