Search
Three search modes in ScaiDrive: keyword (BM25 over filenames and content), semantic (vector similarity over embeddings), and hybrid (blend the two). Plus a RAG context endpoint that packages results for an LLM.
Base path: /api/v1/search/
All search results respect the caller's permissions — you only see files you can read.
Keyword search
Full-text search over filenames (always) and extracted content (where available).
| curl -G $SCAIDRIVE_URL/api/v1/search \
-H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
--data-urlencode "q=annual report 2025" \
--data-urlencode "share_id=shr_01J3H" \
--data-urlencode "file_type=application/pdf" \
--data-urlencode "limit=20"
|
| resp = httpx.get(
f"{url}/api/v1/search",
headers={"Authorization": f"Bearer {token}"},
params={"q": "annual report 2025", "share_id": "shr_01J3H", "limit": 20},
)
for hit in resp.json()["results"]:
print(hit["score"], hit["path"])
|
| const params = new URLSearchParams({
q: "annual report 2025",
share_id: "shr_01J3H",
limit: "20",
});
const resp = await fetch(`${url}/api/v1/search?${params}`, {
headers: { Authorization: `Bearer ${token}` },
});
for (const hit of (await resp.json()).results) {
console.log(hit.score, hit.path);
}
|
Response:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 | {
"query": "annual report 2025",
"results": [
{
"id": "fil_01J3K",
"name": "annual-report-2025.pdf",
"type": "file",
"share_id": "shr_01J3H",
"folder_id": "fld_01J3I",
"path": "/Finance/annual-report-2025.pdf",
"mime_type": "application/pdf",
"size": 5820193,
"modified_at": "2026-04-20T10:00:00Z",
"score": 8.42
}
],
"total": 1,
"has_more": false
}
|
Parameters:
| Param |
Notes |
q |
Search query |
share_id |
Scope to one share |
file_type |
MIME pattern: application/pdf, image/* |
recursive |
Default true |
limit |
1–100, default 50 |
offset |
For pagination |
POST form is also available for longer queries:
| curl -X POST $SCAIDRIVE_URL/api/v1/search \
-H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "annual report 2025", "share_id": "shr_01J3H", "limit": 20}'
|
Semantic search
Vector search over indexed chunks. Requires the tenant to have a vectorization provider configured.
| curl -G $SCAIDRIVE_URL/api/v1/search/semantic \
-H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
--data-urlencode "q=what were our Q4 bookings last year?" \
--data-urlencode "share_id=shr_01J3H" \
--data-urlencode "limit=10"
|
| resp = httpx.get(
f"{url}/api/v1/search/semantic",
headers={"Authorization": f"Bearer {token}"},
params={"q": "what were our Q4 bookings last year?", "share_id": "shr_01J3H", "limit": 10},
)
for hit in resp.json()["semantic_results"]:
print(hit["score"], hit["file_name"], hit["chunk_content"][:100])
|
| const params = new URLSearchParams({
q: "what were our Q4 bookings last year?",
share_id: "shr_01J3H",
limit: "10",
});
const resp = await fetch(`${url}/api/v1/search/semantic?${params}`, {
headers: { Authorization: `Bearer ${token}` },
});
for (const hit of (await resp.json()).semantic_results) {
console.log(hit.score, hit.file_name, hit.chunk_content.slice(0, 100));
}
|
Response:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 | {
"semantic_results": [
{
"file_id": "fil_01J3K",
"file_name": "annual-report-2025.pdf",
"share_id": "shr_01J3H",
"path": "/Finance/annual-report-2025.pdf",
"chunk_content": "Q4 2024 bookings totaled $42.3M, representing a 28% increase year-over-year...",
"chunk_index": 17,
"page": 12,
"section": "Financial Highlights",
"score": 0.89,
"distance": 0.11
}
]
}
|
Results are chunks, not files — the same file can produce multiple hits if different sections match. chunk_content is the matching passage (typically a paragraph); score is semantic similarity (higher is better); distance is the raw vector distance (lower is better).
POST form with richer filters:
| curl -X POST $SCAIDRIVE_URL/api/v1/search/semantic \
-H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "what were our Q4 bookings?",
"share_id": "shr_01J3H",
"file_types": ["application/pdf", "text/markdown"],
"path_prefix": "/Finance",
"limit": 10
}'
|
Hybrid search
Blend BM25 and vector scores. Alpha controls the blend: 0.0 is pure BM25, 1.0 is pure vector, 0.7 is the default.
| curl -X POST $SCAIDRIVE_URL/api/v1/search/hybrid \
-H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "Q4 bookings",
"share_id": "shr_01J3H",
"alpha": 0.7,
"limit": 20
}'
|
Use hybrid when your users' queries mix specific terms ("Q4") with conceptual phrasing ("bookings performance"). BM25 alone misses the concept; pure vector misses the exact term.
RAG context
For LLM workflows, /api/v1/search/context returns search results already formatted as a context string with citations, plus a token estimate.
| curl -X POST $SCAIDRIVE_URL/api/v1/search/context \
-H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "what were our Q4 bookings last year?",
"share_id": "shr_01J3H",
"max_tokens": 2000,
"max_chunks": 10
}'
|
Response:
| {
"context": "[1] From annual-report-2025.pdf, page 12: Q4 2024 bookings totaled $42.3M...\n\n[2] From q4-narrative.docx: Strong enterprise momentum drove...",
"chunks": [
{"content": "Q4 2024 bookings...", "file_id": "fil_01J3K", "file_name": "annual-report-2025.pdf", "path": "/Finance/annual-report-2025.pdf", "page": 12, "section": "Financial Highlights", "score": 0.89}
],
"estimated_tokens": 1234
}
|
Pass the context string into your LLM prompt. The [N] markers and the chunks array let your UI link "source 1" back to a specific file.
Checking index status
To check whether a file has been indexed:
| curl -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
$SCAIDRIVE_URL/api/v1/search/index-status/fil_01J3K
|
| {
"is_indexed": true,
"chunk_count": 42,
"last_indexed": "2026-04-20T10:30:00Z",
"error": null
}
|
Tenant-wide statistics:
| curl -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
$SCAIDRIVE_URL/api/v1/search/stats
|
Queue status:
| curl -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
"$SCAIDRIVE_URL/api/v1/search/queue?limit=20"
|
Configuring vectorization
Semantic and hybrid search require a vectorization policy to tell ScaiDrive what to index, how to chunk it, and which embedding model to use. Policies are tenant-level admin objects.
Minimal setup:
- Configure an embedding provider (OpenAI, Cohere, Bedrock, Hugging Face, or local model).
- Create a policy scoping what to index.
- ScaiDrive indexes existing files in the background, and all new uploads automatically.
See Enterprise Reference for the policy API.
Permissions
Search only returns files the caller can read. If you don't have permission on a file, it doesn't appear — even if it semantically matches. This works at chunk granularity: even if one share contains information relevant to your query, you won't see chunks from files you can't read.
Health check
Before your application depends on semantic search, verify the vectorization stack is up:
| curl $SCAIDRIVE_URL/api/v1/search/health
|
| {
"weaviate_connected": true,
"embedding_service_available": true,
"status": "healthy",
"provider_name": "openai"
}
|
When unhealthy, semantic endpoints return 503 SERVICE_UNAVAILABLE while keyword search continues to work.
What's next