Crawl a docs site on a schedule
Most knowledge bases need to stay in sync with a website — your product docs, an internal wiki, a partner's release notes. A crawl config tells ScaiMatrix where to start, how far to walk, and when to run.
You can also expose a webhook URL so a CI pipeline or a CMS can fire a crawl when content changes.
1. Create a target collection
If you don't already have one:
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Acme Docs (live)",
"embedding_model": "openai/text-embedding-3-small",
"chunking_strategy": "markdown",
"default_access": "tenant"
}'
|
markdown chunking is a good default for crawled HTML — the crawler converts pages to Markdown before chunking so headings become semantic boundaries.
2. Define the crawl config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 | curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Nightly docs sync",
"seed_url": "https://docs.acme.example",
"max_depth": 3,
"max_pages": 500,
"max_total_bytes": 52428800,
"follow_external": false,
"schedule": {
"type": "daily",
"time": "03:00"
},
"webhook": {"enabled": true}
}'
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | cfg = httpx.post(
f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawls",
headers=H,
json={
"name": "Nightly docs sync",
"seed_url": "https://docs.acme.example",
"max_depth": 3,
"max_pages": 500,
"max_total_bytes": 50 * 1024 * 1024,
"follow_external": False,
"schedule": {"type": "daily", "time": "03:00"},
"webhook": {"enabled": True},
},
).json()["data"]
print(cfg["id"], cfg["webhook_secret"])
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 | const r = await fetch(
`${HOST}/v1/modules/scaimatrix/collections/${COLLECTION_ID}/crawls`,
{
method: "POST",
headers: H,
body: JSON.stringify({
name: "Nightly docs sync",
seed_url: "https://docs.acme.example",
max_depth: 3,
max_pages: 500,
max_total_bytes: 50 * 1024 * 1024,
follow_external: false,
schedule: { type: "daily", time: "03:00" },
webhook: { enabled: true },
}),
},
);
const { data: cfg } = await r.json();
|
Schedule types: hourly (with interval_hours), daily (with time as HH:MM), weekly (with day_of_week 0-6), or custom (with cron). All times are UTC.
If you enable the webhook, the response includes a webhook_secret — copy it now, it's only returned once.
3. Run it once to verify
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID/run" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
You get back a CrawlJobRead with status: pending. The first transition to running is usually within a few seconds.
4. Watch the live stream
| curl -N "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawl/$JOB_ID/stream" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
| with httpx.stream(
"GET",
f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawl/{job_id}/stream",
headers=H,
timeout=600,
) as r:
for line in r.iter_lines():
if line.startswith("data: "):
print(line[6:])
|
| const r = await fetch(
`${HOST}/v1/modules/scaimatrix/collections/${COLLECTION_ID}/crawl/${jobId}/stream`,
{ headers: H },
);
const reader = r.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
console.log(decoder.decode(value));
}
|
You'll get a progress event every two seconds with pages_crawled, pages_failed, total_bytes_fetched, and a final done event when the job hits a terminal status.
5. List the documents the crawl produced
| curl "$SCAIGRID_HOST/v1/modules/scaimatrix/crawl-jobs/$JOB_ID/documents?limit=50" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
Each row is a Document with source_type: "crawl". Their normal ingestion pipeline runs to completion afterwards (chunking, embedding, optional graph extraction).
6. Wire up a webhook trigger from your CI
If you enabled the webhook in step 2, your CI can fire a crawl when docs change. Sign the request with HMAC-SHA256 over the body and a timestamp:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | import hashlib, hmac, time, httpx, json
WEBHOOK_SECRET = "<webhook_secret from step 2>"
payload = json.dumps({"reason": "docs-deploy", "commit": "abcd1234"}).encode()
ts = str(int(time.time()))
signature = hmac.new(
WEBHOOK_SECRET.encode(),
ts.encode() + b"." + payload,
hashlib.sha256,
).hexdigest()
r = httpx.post(
f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawls/{cfg['id']}/trigger",
headers={
"X-Crawl-Signature": signature,
"X-Crawl-Timestamp": ts,
"Content-Type": "application/json",
},
content=payload,
)
print(r.status_code, r.json())
|
The endpoint takes no JWT — HMAC verification is the only auth — so you can call it from any process holding the secret. If the signature is missing, malformed, or older than the verifier's clock skew, the call 401s. If a crawl is already running for the collection, you get 409.
7. Edit or pause the config
| # Pause without deleting
curl -X PUT "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{"enabled": false}'
# Tighten the crawl scope
curl -X PUT "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{"max_pages": 250, "max_depth": 2}'
|
enabled: false stops the scheduler but keeps the webhook live. Pass "clear_schedule": true to drop the schedule entirely, or "clear_webhook": true to disable the webhook.
8. Inspect run history
| curl "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID/jobs?limit=20" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
Every run — scheduled, manual, or webhook — produces a job row tagged with the config id. The Crawl Manager admin page surfaces the same data as a timeline.
Tips
- Set
max_total_bytes even when max_pages is the headline limit — a single huge media file can blow the budget faster than you'd expect.
follow_external: false is the safe default. Turn it on only when you have control of every linked-out domain.
- Re-running a crawl is additive: existing documents are updated in place when their URL is rediscovered; new pages get new document rows.
- Webhook secrets rotate via PUT with
webhook: {enabled: true} after clear_webhook: true — old secret is invalidated, a new one is returned.