---
title: Vectorization providers and policies
path: administration/vectorization-providers
status: published
---

Semantic search lets users find files by meaning, not just keywords. "Q4 revenue forecast" can match a document titled `2026 Annual Plan.xlsx` if the contents are about revenue. It's powered by a vector embedding model that runs over your file content and a Weaviate vector database that stores the embeddings.

Both pieces are pluggable. ScaiDrive ships with provider integrations for common embedding APIs, and lets you scope which content gets indexed via vectorization policies.

## Providers

A **provider** is the embedding API. AI → Vectorization Providers → **New provider**:

| Provider type | What it is |
|---|---|
| **ScaiGrid** | Your in-house ScaiLabs embedding API (default; usually pre-configured). |
| **OpenAI** | `text-embedding-3-large` / `text-embedding-3-small`. |
| **Cohere** | `embed-english-v3.0` and the multilingual variant. |
| **Bedrock** | AWS-hosted Titan or Cohere via Bedrock. |
| **Hugging Face Inference** | Any embedding model hosted on HF Inference. |
| **Custom OpenAI-compatible** | Any endpoint speaking the OpenAI `/v1/embeddings` shape (local models behind a vLLM proxy, Azure OpenAI, etc.). |

For each provider supply: API endpoint, API key (encrypted at rest with ScaiDrive's secret key), model name, embedding dimension, and max-tokens-per-request. **Test** issues a sample embedding to make sure credentials work.

You can have multiple providers configured; one is the default (used unless a policy overrides). Useful patterns:

- **Default**: ScaiGrid for low-latency, cheap, in-house.
- **Override**: OpenAI for a specific share that needs higher-quality recall.

## Health check

Each provider's detail page shows:

- **Status** — last health check pass/fail.
- **Latency** — p50/p95 of recent embedding calls.
- **Throughput** — embeddings per minute.
- **Cost estimate** — if the provider has known per-token pricing.

The health check runs every 5 minutes and on save. If a provider fails health, the system falls back to the next provider (if any) and logs a `SECURITY` event.

## Policies

A **vectorization policy** scopes which files are indexed:

- **Scope** — shares, path patterns, or specific sensitivity labels.
- **File type filter** — MIME prefixes (`text/`, `application/pdf`, …). Defaults to all extractable types.
- **Chunking strategy** — `semantic` (preferred), `fixed` (size-based), or `whole` (one chunk per file; for short docs).
- **Chunk size** / **chunk overlap** — bytes per chunk and overlap between adjacent chunks. Defaults are sensible.
- **Provider override** — use a non-default provider for this scope.
- **Enabled** — toggle on/off.

AI → Vectorization Policies → **New policy** to create. You can have many policies; the most-specific match wins (file path > share > tenant default).

### When to create a policy

You don't *need* policies — the default behavior indexes everything indexable. Create a policy when you want to:

- **Exclude** a share (e.g., a noisy build-artifact share that would pollute search results).
- **Use a beefier model** for a high-value share (e.g., `text-embedding-3-large` for the legal team's documents).
- **Tune chunking** for a specific content type (e.g., longer chunks for prose, shorter for code).

## Storage

Vectors live in Weaviate, in a multi-tenant collection. Each ScaiDrive tenant maps to a Weaviate tenant — so even with shared infrastructure, no two ScaiDrive tenants can read each other's chunks.

The Weaviate connection is configured at System → Settings → Vectorization → Weaviate URL. ScaiDrive will create the collection schema on first connect.

## Indexing backlog

When you onboard a large share for the first time, the indexer chips through the backlog: every 5 minutes a worker batch processes up to 100 pending files. Backlog size is visible at AI → Vectorization → **Queue status**.

For files that can't be fully indexed (videos, images, archives, anything that doesn't have extractable text), ScaiDrive falls back to indexing **just the filename** — so a video named `2026-Q1-allhands-recording.mp4` is still findable by "all hands Q1." Files larger than 200 MB with extractable text are indexed via a **head sample** (first ~5 MB); results from those chunks are tagged `truncated: true` so callers know they're partial. See [Search](/docs/scaidrive/api-guides/search) for the developer-facing detail.

## Cost control

Embedding APIs charge per token. If you use a paid provider:

- Set **chunk size** to the upper end of the model's input window so you minimize embed-call overhead.
- Use a policy to **scope semantic search to specific shares** rather than everything.
- Monitor **cost estimate** on the provider page and set up a webhook for `compliance.budget_threshold_reached` if your provider supports budgets.

## Disabling semantic search

To turn off semantic search entirely: System → Settings → Features → uncheck **Semantic search**. The keyword (BM25) search continues to work; semantic-search API endpoints return `503` for the period it's disabled.

Existing embeddings stay in Weaviate while disabled; re-enabling resumes from where you stopped without re-indexing.

## What's next

- [Search](/docs/scaidrive/api-guides/search) — using the search API.
- [Compliance policies](/docs/scaidrive/administration/compliance-policies) — sensitivity labels can scope policies.