Embeddings

Convert text into dense vectors for semantic search, clustering, recommendation, and anything else that benefits from distance-based similarity.

Endpoint: POST /v1/inference/embed

Basic request#

bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/embed \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/text-embedding-3-small",
    "input": ["The quick brown fox", "jumps over the lazy dog"]
  }'

python
import httpx, os

resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/embed",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "model": "openai/text-embedding-3-small",
        "input": ["The quick brown fox", "jumps over the lazy dog"],
    },
)
data = resp.json()["data"]
for item in data["data"]:
    print(f"Index {item['index']}: {len(item['embedding'])} dimensions")

typescript
const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/embed", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/text-embedding-3-small",
    input: ["The quick brown fox", "jumps over the lazy dog"],
  }),
});
const { data } = await resp.json();
for (const item of data.data) {
  console.log(`Index ${item.index}: ${item.embedding.length} dimensions`);
}

Input formats#

The input field accepts either a string or an array of strings:

json
{"model": "...", "input": "single string"}
{"model": "...", "input": ["first", "second", "third"]}

Batching is strongly preferred when you have multiple texts — one API call for 100 texts is dramatically cheaper (in latency and cost) than 100 separate calls. Most embedding models accept 1000+ texts per request.

Response shape#

json
{
  "status": "ok",
  "data": {
    "model": "openai/text-embedding-3-small",
    "data": [
      {"index": 0, "embedding": [0.023, -0.101, ...]},
      {"index": 1, "embedding": [-0.045, 0.082, ...]}
    ],
    "usage": {"prompt_tokens": 8, "total_tokens": 8}
  }
}

Vector dimensions depend on the model:

Model	Dimensions
`openai/text-embedding-3-small`	1536
`openai/text-embedding-3-large`	3072
`openai/text-embedding-ada-002`	1536
`google/text-embedding-004`	768
`mistral/mistral-embed`	1024

Check your tenant's model list (GET /v1/models?modality=embedding) to see what's available.

Computing cosine similarity#

Embeddings are typically compared with cosine similarity:

python
import numpy as np

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

emb_a = data["data"][0]["embedding"]
emb_b = data["data"][1]["embedding"]
print(cosine_similarity(emb_a, emb_b))  # -1.0 (opposite) to 1.0 (identical)

typescript
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

Some embedding models normalize vectors to unit length — for those, cosine similarity equals the dot product.

Building a search index#

For production workloads, use ScaiMatrix — it runs embeddings, stores vectors in Weaviate, and exposes a search API. You don't reimplement indexing yourself.

If you want to manage your own index (pgvector, Qdrant, Faiss, etc.), the flow is:

Split documents into chunks (typically 200–800 tokens each).
Embed each chunk with POST /v1/inference/embed.
Store (vector, chunk_text, metadata) rows in your vector store.
At query time, embed the user's query and search for nearest neighbors.

python
# Bulk embed a document
chunks = ["First paragraph...", "Second paragraph...", "Third paragraph..."]

resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/embed",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"model": "openai/text-embedding-3-small", "input": chunks},
).json()["data"]

vectors = [item["embedding"] for item in resp["data"]]
# Insert into your vector store with the corresponding chunks

Rate limits and batching strategy#

Embedding requests are rate-limited like any other inference call — per API key, per user, per tenant. For large ingestion jobs, batch aggressively (up to a few hundred texts per request) and add a small delay between batches to stay within the per-minute rate limit. See Rate Limiting.

If you're indexing large corpora (> 100K documents), consider Batch Inference — async jobs with higher throughput and lower cost.

Dimensional reduction#

Some models support returning reduced-dimension embeddings via a dimensions parameter:

python
resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/embed",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "openai/text-embedding-3-large",
        "input": "hello world",
        "dimensions": 512,  # reduce from 3072 to 512
    },
)

Provider-dependent — only openai/text-embedding-3-* supports this today. Returns embeddings with the specified dimensions, trading some quality for smaller storage and faster search.

What's next#

ScaiMatrix — full-stack search (embeddings + vector store + query API).
Batch Inference — efficient bulk processing.
OpenAI Compatibility — /oai/v1/embeddings works identically.