Accounting and Budgets
ScaiGrid counts every completion. Usage rolls up the tenancy hierarchy. Budgets enforce at the gateway, before upstream spend.
What gets tracked#
For every inference call (streaming or not), ScaiGrid records:
- Frontend model — what the caller asked for
- Backend model — where the request actually went
- Prompt tokens — input count
- Completion tokens — output count
- Latency — end-to-end milliseconds
- Cost — computed from per-model pricing (input + output rates per million tokens)
- User, tenant, partner — the full scope chain
- Request ID — for trace-back
These flow through a two-stage pipeline: Redis counters are incremented immediately (fast, low-latency), and a background worker flushes to MariaDB every 30 seconds (durable, queryable). You get near-real-time usage visibility with a small commit delay.
Usage queries#
The /v1/accounting/usage endpoint supports slicing by any combination of scope, model, user, time window:
1 2 | |
Summary form aggregates:
1 2 | |
Returns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
total_cost is what the caller paid you (frontend pricing). backend_cost is what you paid the upstream provider. The difference is your margin.
Permission requirements:
accounting:view_own— your own usage onlyaccounting:view_tenant— full tenant usageaccounting:view_partner— full partner usage across tenants
Pricing models#
Pricing lives on frontend models as input_price_per_mtok and output_price_per_mtok (decimals, currency-agnostic — set via currency_code setting). Backends have their own cost_input_per_mtok and cost_output_per_mtok for internal cost tracking.
1 2 3 4 5 6 7 | |
A prompt of 1,200 tokens and completion of 400 tokens on this model costs:
1 2 3 | |
Recorded per request in decimal form. No rounding.
Budgets#
Budgets cap spend or token count across a scope. Hitting a budget blocks new requests with BUDGET_EXCEEDED (HTTP 429) — existing in-flight requests complete.
1 2 3 4 5 6 7 8 9 10 11 | |
scope—partner,tenant,user, orgroup.period—daily,weekly,monthly,total(lifetime).cost_limit— max spend in the period (decimal).token_limit— or limit by tokens instead.request_limit— or limit by raw request count.soft_limit_pct— at this fraction of the hard limit, trigger webhooks / alerts (no blocking yet).hard_action—block(return 429),notify(only warn),throttle(reduce rate limits).
Budgets can stack. A user-level budget under a tenant-level budget under a partner-level budget — all three enforce simultaneously. Most restrictive wins.
Soft limits and alerts#
When usage crosses soft_limit_pct, ScaiGrid fires a budget.soft_limit_reached event on the event bus. Subscribe via a webhook:
1 2 3 4 5 6 7 8 | |
Your operations team gets Slack'd before the hard block kicks in.
Accounting modes#
ScaiGrid supports two failure modes for the Redis counter pipeline:
reject(default, safer) — if Redis is unreachable when checking budget, reject the request. No free inference during Redis outages.allow(available, looser) — if Redis is unreachable, allow the request. Useful if you value availability over exact cost enforcement.
Set via ACCOUNTING_REDIS_FAILURE_MODE env var.
Exporting usage#
For external billing, export raw usage records:
1 2 3 | |
Formats: csv, json, ndjson. Useful for feeding into QuickBooks, Stripe metered billing, or your own data warehouse.
Streaming reservations#
Streaming completions don't know their final token count until they finish. To avoid over-committing budget, ScaiGrid reserves tokens up-front based on the request's max_tokens, then settles to the actual count when the stream completes. If the reservation exceeds budget, the stream is rejected before it starts.
This is transparent — you don't need to do anything special for streaming. It just works.
What's next#
- Webhooks — subscribe to budget events.
- Rate Limiting — complementary to budgets, protects against bursty abuse.
- Accounting Reference — full endpoint list.