Crawl a docs site on a schedule

Most knowledge bases need to stay in sync with a website — your product docs, an internal wiki, a partner's release notes. A crawl config tells ScaiMatrix where to start, how far to walk, and when to run.

You can also expose a webhook URL so a CI pipeline or a CMS can fire a crawl when content changes.

1. Create a target collection#

If you don't already have one:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Docs (live)",
    "embedding_model": "openai/text-embedding-3-small",
    "chunking_strategy": "markdown",
    "default_access": "tenant"
  }'

markdown chunking is a good default for crawled HTML — the crawler converts pages to Markdown before chunking so headings become semantic boundaries.

2. Define the crawl config#

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Nightly docs sync",
    "seed_url": "https://docs.acme.example",
    "max_depth": 3,
    "max_pages": 500,
    "max_total_bytes": 52428800,
    "follow_external": false,
    "schedule": {
      "type": "daily",
      "time": "03:00"
    },
    "webhook": {"enabled": true}
  }'

python
cfg = httpx.post(
    f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawls",
    headers=H,
    json={
        "name": "Nightly docs sync",
        "seed_url": "https://docs.acme.example",
        "max_depth": 3,
        "max_pages": 500,
        "max_total_bytes": 50 * 1024 * 1024,
        "follow_external": False,
        "schedule": {"type": "daily", "time": "03:00"},
        "webhook": {"enabled": True},
    },
).json()["data"]
print(cfg["id"], cfg["webhook_secret"])

javascript
const r = await fetch(
  `${HOST}/v1/modules/scaimatrix/collections/${COLLECTION_ID}/crawls`,
  {
    method: "POST",
    headers: H,
    body: JSON.stringify({
      name: "Nightly docs sync",
      seed_url: "https://docs.acme.example",
      max_depth: 3,
      max_pages: 500,
      max_total_bytes: 50 * 1024 * 1024,
      follow_external: false,
      schedule: { type: "daily", time: "03:00" },
      webhook: { enabled: true },
    }),
  },
);
const { data: cfg } = await r.json();

Schedule types: hourly (with interval_hours), daily (with time as HH:MM), weekly (with day_of_week 0-6), or custom (with cron). All times are UTC.

If you enable the webhook, the response includes a webhook_secret — copy it now, it's only returned once.

3. Run it once to verify#

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID/run" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

You get back a CrawlJobRead with status: pending. The first transition to running is usually within a few seconds.

4. Watch the live stream#

bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawl/$JOB_ID/stream" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

python
with httpx.stream(
    "GET",
    f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawl/{job_id}/stream",
    headers=H,
    timeout=600,
) as r:
    for line in r.iter_lines():
        if line.startswith("data: "):
            print(line[6:])

javascript
const r = await fetch(
  `${HOST}/v1/modules/scaimatrix/collections/${COLLECTION_ID}/crawl/${jobId}/stream`,
  { headers: H },
);
const reader = r.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  console.log(decoder.decode(value));
}

You'll get a progress event every two seconds with pages_crawled, pages_failed, total_bytes_fetched, and a final done event when the job hits a terminal status.

5. List the documents the crawl produced#

bash
curl "$SCAIGRID_HOST/v1/modules/scaimatrix/crawl-jobs/$JOB_ID/documents?limit=50" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Each row is a Document with source_type: "crawl". Their normal ingestion pipeline runs to completion afterwards (chunking, embedding, optional graph extraction).

6. Wire up a webhook trigger from your CI#

If you enabled the webhook in step 2, your CI can fire a crawl when docs change. Sign the request with HMAC-SHA256 over the body and a timestamp:

python
import hashlib, hmac, time, httpx, json

WEBHOOK_SECRET = "<webhook_secret from step 2>"
payload = json.dumps({"reason": "docs-deploy", "commit": "abcd1234"}).encode()
ts = str(int(time.time()))
signature = hmac.new(
    WEBHOOK_SECRET.encode(),
    ts.encode() + b"." + payload,
    hashlib.sha256,
).hexdigest()

r = httpx.post(
    f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawls/{cfg['id']}/trigger",
    headers={
        "X-Crawl-Signature": signature,
        "X-Crawl-Timestamp": ts,
        "Content-Type": "application/json",
    },
    content=payload,
)
print(r.status_code, r.json())

The endpoint takes no JWT — HMAC verification is the only auth — so you can call it from any process holding the secret. If the signature is missing, malformed, or older than the verifier's clock skew, the call 401s. If a crawl is already running for the collection, you get 409.

7. Edit or pause the config#

bash
# Pause without deleting
curl -X PUT "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

# Tighten the crawl scope
curl -X PUT "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"max_pages": 250, "max_depth": 2}'

enabled: false stops the scheduler but keeps the webhook live. Pass "clear_schedule": true to drop the schedule entirely, or "clear_webhook": true to disable the webhook.

8. Inspect run history#

bash
curl "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID/jobs?limit=20" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Every run — scheduled, manual, or webhook — produces a job row tagged with the config id. The Crawl Manager admin page surfaces the same data as a timeline.

Tips#

Set max_total_bytes even when max_pages is the headline limit — a single huge media file can blow the budget faster than you'd expect.
follow_external: false is the safe default. Turn it on only when you have control of every linked-out domain.
Re-running a crawl is additive: existing documents are updated in place when their URL is rediscovered; new pages get new document rows.
Webhook secrets rotate via PUT with webhook: {enabled: true} after clear_webhook: true — old secret is invalidated, a new one is returned.