Vectorization providers and policies
Semantic search lets users find files by meaning, not just keywords. "Q4 revenue forecast" can match a document titled 2026 Annual Plan.xlsx if the contents are about revenue. It's powered by a vector embedding model that runs over your file content and a Weaviate vector database that stores the embeddings.
Both pieces are pluggable. ScaiDrive ships with provider integrations for common embedding APIs, and lets you scope which content gets indexed via vectorization policies.
Providers#
A provider is the embedding API. AI → Vectorization Providers → New provider:
| Provider type | What it is |
|---|---|
| ScaiGrid | Your in-house ScaiLabs embedding API (default; usually pre-configured). |
| OpenAI | text-embedding-3-large / text-embedding-3-small. |
| Cohere | embed-english-v3.0 and the multilingual variant. |
| Bedrock | AWS-hosted Titan or Cohere via Bedrock. |
| Hugging Face Inference | Any embedding model hosted on HF Inference. |
| Custom OpenAI-compatible | Any endpoint speaking the OpenAI /v1/embeddings shape (local models behind a vLLM proxy, Azure OpenAI, etc.). |
For each provider supply: API endpoint, API key (encrypted at rest with ScaiDrive's secret key), model name, embedding dimension, and max-tokens-per-request. Test issues a sample embedding to make sure credentials work.
You can have multiple providers configured; one is the default (used unless a policy overrides). Useful patterns:
- Default: ScaiGrid for low-latency, cheap, in-house.
- Override: OpenAI for a specific share that needs higher-quality recall.
Health check#
Each provider's detail page shows:
- Status — last health check pass/fail.
- Latency — p50/p95 of recent embedding calls.
- Throughput — embeddings per minute.
- Cost estimate — if the provider has known per-token pricing.
The health check runs every 5 minutes and on save. If a provider fails health, the system falls back to the next provider (if any) and logs a SECURITY event.
Policies#
A vectorization policy scopes which files are indexed:
- Scope — shares, path patterns, or specific sensitivity labels.
- File type filter — MIME prefixes (
text/,application/pdf, …). Defaults to all extractable types. - Chunking strategy —
semantic(preferred),fixed(size-based), orwhole(one chunk per file; for short docs). - Chunk size / chunk overlap — bytes per chunk and overlap between adjacent chunks. Defaults are sensible.
- Provider override — use a non-default provider for this scope.
- Enabled — toggle on/off.
AI → Vectorization Policies → New policy to create. You can have many policies; the most-specific match wins (file path > share > tenant default).
When to create a policy#
You don't need policies — the default behavior indexes everything indexable. Create a policy when you want to:
- Exclude a share (e.g., a noisy build-artifact share that would pollute search results).
- Use a beefier model for a high-value share (e.g.,
text-embedding-3-largefor the legal team's documents). - Tune chunking for a specific content type (e.g., longer chunks for prose, shorter for code).
Storage#
Vectors live in Weaviate, in a multi-tenant collection. Each ScaiDrive tenant maps to a Weaviate tenant — so even with shared infrastructure, no two ScaiDrive tenants can read each other's chunks.
The Weaviate connection is configured at System → Settings → Vectorization → Weaviate URL. ScaiDrive will create the collection schema on first connect.
Indexing backlog#
When you onboard a large share for the first time, the indexer chips through the backlog: every 5 minutes a worker batch processes up to 100 pending files. Backlog size is visible at AI → Vectorization → Queue status.
For files that can't be fully indexed (videos, images, archives, anything that doesn't have extractable text), ScaiDrive falls back to indexing just the filename — so a video named 2026-Q1-allhands-recording.mp4 is still findable by "all hands Q1." Files larger than 200 MB with extractable text are indexed via a head sample (first ~5 MB); results from those chunks are tagged truncated: true so callers know they're partial. See Search for the developer-facing detail.
Cost control#
Embedding APIs charge per token. If you use a paid provider:
- Set chunk size to the upper end of the model's input window so you minimize embed-call overhead.
- Use a policy to scope semantic search to specific shares rather than everything.
- Monitor cost estimate on the provider page and set up a webhook for
compliance.budget_threshold_reachedif your provider supports budgets.
Disabling semantic search#
To turn off semantic search entirely: System → Settings → Features → uncheck Semantic search. The keyword (BM25) search continues to work; semantic-search API endpoints return 503 for the period it's disabled.
Existing embeddings stay in Weaviate while disabled; re-enabling resumes from where you stopped without re-indexing.
What's next#
- Search — using the search API.
- Compliance policies — sensitivity labels can scope policies.