Embeddings
Convert text into dense vectors for semantic search, clustering, recommendation, and anything else that benefits from distance-based similarity.
Endpoint: POST /v1/inference/embed
Basic request#
1 2 3 4 5 6 7 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Input formats#
The input field accepts either a string or an array of strings:
1 2 | |
Batching is strongly preferred when you have multiple texts — one API call for 100 texts is dramatically cheaper (in latency and cost) than 100 separate calls. Most embedding models accept 1000+ texts per request.
Response shape#
1 2 3 4 5 6 7 8 9 10 11 | |
Vector dimensions depend on the model:
| Model | Dimensions |
|---|---|
openai/text-embedding-3-small |
1536 |
openai/text-embedding-3-large |
3072 |
openai/text-embedding-ada-002 |
1536 |
google/text-embedding-004 |
768 |
mistral/mistral-embed |
1024 |
Check your tenant's model list (GET /v1/models?modality=embedding) to see what's available.
Computing cosine similarity#
Embeddings are typically compared with cosine similarity:
1 2 3 4 5 6 7 8 9 10 | |
1 2 3 4 5 6 7 8 9 | |
Some embedding models normalize vectors to unit length — for those, cosine similarity equals the dot product.
Building a search index#
For production workloads, use ScaiMatrix — it runs embeddings, stores vectors in Weaviate, and exposes a search API. You don't reimplement indexing yourself.
If you want to manage your own index (pgvector, Qdrant, Faiss, etc.), the flow is:
- Split documents into chunks (typically 200–800 tokens each).
- Embed each chunk with
POST /v1/inference/embed. - Store
(vector, chunk_text, metadata)rows in your vector store. - At query time, embed the user's query and search for nearest neighbors.
1 2 3 4 5 6 7 8 9 10 11 | |
Rate limits and batching strategy#
Embedding requests are rate-limited like any other inference call — per API key, per user, per tenant. For large ingestion jobs, batch aggressively (up to a few hundred texts per request) and add a small delay between batches to stay within the per-minute rate limit. See Rate Limiting.
If you're indexing large corpora (> 100K documents), consider Batch Inference — async jobs with higher throughput and lower cost.
Dimensional reduction#
Some models support returning reduced-dimension embeddings via a dimensions parameter:
1 2 3 4 5 6 7 8 9 | |
Provider-dependent — only openai/text-embedding-3-* supports this today. Returns embeddings with the specified dimensions, trading some quality for smaller storage and faster search.
What's next#
- ScaiMatrix — full-stack search (embeddings + vector store + query API).
- Batch Inference — efficient bulk processing.
- OpenAI Compatibility —
/oai/v1/embeddingsworks identically.