---
title: Embeddings
path: api-guides/embeddings
status: published
---

# Embeddings

Convert text into dense vectors for semantic search, clustering, recommendation, and anything else that benefits from distance-based similarity.

**Endpoint:** `POST /v1/inference/embed`

## Basic request

```bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/embed \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/text-embedding-3-small",
    "input": ["The quick brown fox", "jumps over the lazy dog"]
  }'
```

```python
import httpx, os

resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/embed",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "model": "openai/text-embedding-3-small",
        "input": ["The quick brown fox", "jumps over the lazy dog"],
    },
)
data = resp.json()["data"]
for item in data["data"]:
    print(f"Index {item['index']}: {len(item['embedding'])} dimensions")
```

```typescript
const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/embed", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/text-embedding-3-small",
    input: ["The quick brown fox", "jumps over the lazy dog"],
  }),
});
const { data } = await resp.json();
for (const item of data.data) {
  console.log(`Index ${item.index}: ${item.embedding.length} dimensions`);
}
```

## Input formats

The `input` field accepts either a string or an array of strings:

```json
{"model": "...", "input": "single string"}
{"model": "...", "input": ["first", "second", "third"]}
```

Batching is strongly preferred when you have multiple texts — one API call for 100 texts is dramatically cheaper (in latency and cost) than 100 separate calls. Most embedding models accept 1000+ texts per request.

## Response shape

```json
{
  "status": "ok",
  "data": {
    "model": "openai/text-embedding-3-small",
    "data": [
      {"index": 0, "embedding": [0.023, -0.101, ...]},
      {"index": 1, "embedding": [-0.045, 0.082, ...]}
    ],
    "usage": {"prompt_tokens": 8, "total_tokens": 8}
  }
}
```

Vector dimensions depend on the model:

| Model | Dimensions |
|-------|-----------:|
| `openai/text-embedding-3-small` | 1536 |
| `openai/text-embedding-3-large` | 3072 |
| `openai/text-embedding-ada-002` | 1536 |
| `google/text-embedding-004` | 768 |
| `mistral/mistral-embed` | 1024 |

Check your tenant's model list (`GET /v1/models?modality=embedding`) to see what's available.

## Computing cosine similarity

Embeddings are typically compared with cosine similarity:

```python
import numpy as np

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

emb_a = data["data"][0]["embedding"]
emb_b = data["data"][1]["embedding"]
print(cosine_similarity(emb_a, emb_b))  # -1.0 (opposite) to 1.0 (identical)
```

```typescript
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}
```

Some embedding models normalize vectors to unit length — for those, cosine similarity equals the dot product.

## Building a search index

For production workloads, use [ScaiMatrix](/docs/scaigrid/scaimatrix) — it runs embeddings, stores vectors in Weaviate, and exposes a search API. You don't reimplement indexing yourself.

If you want to manage your own index (pgvector, Qdrant, Faiss, etc.), the flow is:

1. Split documents into chunks (typically 200–800 tokens each).
2. Embed each chunk with `POST /v1/inference/embed`.
3. Store `(vector, chunk_text, metadata)` rows in your vector store.
4. At query time, embed the user's query and search for nearest neighbors.

```python
# Bulk embed a document
chunks = ["First paragraph...", "Second paragraph...", "Third paragraph..."]

resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/embed",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"model": "openai/text-embedding-3-small", "input": chunks},
).json()["data"]

vectors = [item["embedding"] for item in resp["data"]]
# Insert into your vector store with the corresponding chunks
```

## Rate limits and batching strategy

Embedding requests are rate-limited like any other inference call — per API key, per user, per tenant. For large ingestion jobs, batch aggressively (up to a few hundred texts per request) and add a small delay between batches to stay within the per-minute rate limit. See [Rate Limiting](../07-advanced/05-rate-limiting.md).

If you're indexing large corpora (> 100K documents), consider [Batch Inference](./06-batch-inference.md) — async jobs with higher throughput and lower cost.

## Dimensional reduction

Some models support returning reduced-dimension embeddings via a `dimensions` parameter:

```python
resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/embed",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "openai/text-embedding-3-large",
        "input": "hello world",
        "dimensions": 512,  # reduce from 3072 to 512
    },
)
```

Provider-dependent — only `openai/text-embedding-3-*` supports this today. Returns embeddings with the specified dimensions, trading some quality for smaller storage and faster search.

## What's next

- [ScaiMatrix](/docs/scaigrid/scaimatrix) — full-stack search (embeddings + vector store + query API).
- [Batch Inference](./06-batch-inference.md) — efficient bulk processing.
- [OpenAI Compatibility](./07-openai-compatibility.md) — `/oai/v1/embeddings` works identically.
