Search

Three search modes in ScaiDrive: keyword (BM25 over filenames and content), semantic (vector similarity over embeddings), and hybrid (blend the two). Plus a RAG context endpoint that packages results for an LLM.

Base path: /api/v1/search/

All search results respect the caller's permissions — you only see files you can read.

Keyword search#

Full-text search over filenames (always) and extracted content (where available).

bash
curl -G $SCAIDRIVE_URL/api/v1/search \
  -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
  --data-urlencode "q=annual report 2025" \
  --data-urlencode "share_id=shr_01J3H" \
  --data-urlencode "file_type=application/pdf" \
  --data-urlencode "limit=20"

python
resp = httpx.get(
    f"{url}/api/v1/search",
    headers={"Authorization": f"Bearer {token}"},
    params={"q": "annual report 2025", "share_id": "shr_01J3H", "limit": 20},
)
for hit in resp.json()["results"]:
    print(hit["score"], hit["path"])

typescript
const params = new URLSearchParams({
  q: "annual report 2025",
  share_id: "shr_01J3H",
  limit: "20",
});
const resp = await fetch(`${url}/api/v1/search?${params}`, {
  headers: { Authorization: `Bearer ${token}` },
});
for (const hit of (await resp.json()).results) {
  console.log(hit.score, hit.path);
}

Response:

json
{
  "query": "annual report 2025",
  "results": [
    {
      "id": "fil_01J3K",
      "name": "annual-report-2025.pdf",
      "type": "file",
      "share_id": "shr_01J3H",
      "folder_id": "fld_01J3I",
      "path": "/Finance/annual-report-2025.pdf",
      "mime_type": "application/pdf",
      "size": 5820193,
      "modified_at": "2026-04-20T10:00:00Z",
      "score": 8.42
    }
  ],
  "total": 1,
  "has_more": false
}

Parameters:

Param	Notes
`q`	Search query
`share_id`	Scope to one share
`file_type`	MIME pattern: `application/pdf`, `image/*`
`recursive`	Default true
`limit`	1–100, default 50
`offset`	For pagination

POST form is also available for longer queries:

bash
curl -X POST $SCAIDRIVE_URL/api/v1/search \
  -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query": "annual report 2025", "share_id": "shr_01J3H", "limit": 20}'

Semantic search#

Vector search over indexed chunks. Requires the tenant to have a vectorization provider configured.

bash
curl -G $SCAIDRIVE_URL/api/v1/search/semantic \
  -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
  --data-urlencode "q=what were our Q4 bookings last year?" \
  --data-urlencode "share_id=shr_01J3H" \
  --data-urlencode "limit=10"

python
resp = httpx.get(
    f"{url}/api/v1/search/semantic",
    headers={"Authorization": f"Bearer {token}"},
    params={"q": "what were our Q4 bookings last year?", "share_id": "shr_01J3H", "limit": 10},
)
for hit in resp.json()["semantic_results"]:
    print(hit["score"], hit["file_name"], hit["chunk_content"][:100])

typescript
const params = new URLSearchParams({
  q: "what were our Q4 bookings last year?",
  share_id: "shr_01J3H",
  limit: "10",
});
const resp = await fetch(`${url}/api/v1/search/semantic?${params}`, {
  headers: { Authorization: `Bearer ${token}` },
});
for (const hit of (await resp.json()).semantic_results) {
  console.log(hit.score, hit.file_name, hit.chunk_content.slice(0, 100));
}

Response:

json
{
  "semantic_results": [
    {
      "file_id": "fil_01J3K",
      "file_name": "annual-report-2025.pdf",
      "share_id": "shr_01J3H",
      "path": "/Finance/annual-report-2025.pdf",
      "chunk_content": "Q4 2024 bookings totaled $42.3M, representing a 28% increase year-over-year...",
      "chunk_index": 17,
      "page": 12,
      "section": "Financial Highlights",
      "score": 0.89,
      "distance": 0.11
    }
  ]
}

Results are chunks, not files — the same file can produce multiple hits if different sections match. chunk_content is the matching passage (typically a paragraph); score is semantic similarity (higher is better); distance is the raw vector distance (lower is better).

POST form with richer filters:

bash
curl -X POST $SCAIDRIVE_URL/api/v1/search/semantic \
  -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what were our Q4 bookings?",
    "share_id": "shr_01J3H",
    "file_types": ["application/pdf", "text/markdown"],
    "path_prefix": "/Finance",
    "limit": 10
  }'

Hybrid search#

Blend BM25 and vector scores. Alpha controls the blend: 0.0 is pure BM25, 1.0 is pure vector, 0.7 is the default.

bash
curl -X POST $SCAIDRIVE_URL/api/v1/search/hybrid \
  -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Q4 bookings",
    "share_id": "shr_01J3H",
    "alpha": 0.7,
    "limit": 20
  }'

Use hybrid when your users' queries mix specific terms ("Q4") with conceptual phrasing ("bookings performance"). BM25 alone misses the concept; pure vector misses the exact term.

RAG context#

For LLM workflows, /api/v1/search/context returns search results already formatted as a context string with citations, plus a token estimate.

bash
curl -X POST $SCAIDRIVE_URL/api/v1/search/context \
  -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what were our Q4 bookings last year?",
    "share_id": "shr_01J3H",
    "max_tokens": 2000,
    "max_chunks": 10
  }'

Response:

json
{
  "context": "[1] From annual-report-2025.pdf, page 12: Q4 2024 bookings totaled $42.3M...\n\n[2] From q4-narrative.docx: Strong enterprise momentum drove...",
  "chunks": [
    {"content": "Q4 2024 bookings...", "file_id": "fil_01J3K", "file_name": "annual-report-2025.pdf", "path": "/Finance/annual-report-2025.pdf", "page": 12, "section": "Financial Highlights", "score": 0.89}
  ],
  "estimated_tokens": 1234
}

Pass the context string into your LLM prompt. The [N] markers and the chunks array let your UI link "source 1" back to a specific file.

Checking index status#

To check whether a file has been indexed:

bash
curl -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
     $SCAIDRIVE_URL/api/v1/search/index-status/fil_01J3K

json
{
  "is_indexed": true,
  "chunk_count": 42,
  "last_indexed": "2026-04-20T10:30:00Z",
  "error": null
}

Tenant-wide statistics:

bash
curl -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
     $SCAIDRIVE_URL/api/v1/search/stats

Queue status:

bash
curl -H "Authorization: Bearer $SCAIDRIVE_TOKEN" \
     "$SCAIDRIVE_URL/api/v1/search/queue?limit=20"

Configuring vectorization#

Semantic and hybrid search require a vectorization policy to tell ScaiDrive what to index, how to chunk it, and which embedding model to use. Policies are tenant-level admin objects.

Minimal setup:

Configure an embedding provider (OpenAI, Cohere, Bedrock, Hugging Face, or local model).
Create a policy scoping what to index.
ScaiDrive indexes existing files in the background, and all new uploads automatically.

See Enterprise Reference for the policy API.

Permissions#

Search only returns files the caller can read. If you don't have permission on a file, it doesn't appear — even if it semantically matches. This works at chunk granularity: even if one share contains information relevant to your query, you won't see chunks from files you can't read.

Health check#

Before your application depends on semantic search, verify the vectorization stack is up:

bash

1	`curl $SCAIDRIVE_URL/api/v1/search/health`

json
{
  "weaviate_connected": true,
  "embedding_service_available": true,
  "status": "healthy",
  "provider_name": "openai"
}

When unhealthy, semantic endpoints return 503 SERVICE_UNAVAILABLE while keyword search continues to work.

What's next#

Search Reference — all endpoints.
Enterprise Reference — vectorization policy management.
MCP Server — expose search to Claude.