---
summary: "Crawl configs \u2014 define a seed, depth, schedule, and webhook trigger\
  \ so a collection stays in sync with a website."
title: Crawl a docs site on a schedule
path: tutorials/crawl-on-a-schedule
status: published
---

# Crawl a docs site on a schedule

Most knowledge bases need to stay in sync with a website — your product docs, an internal wiki, a partner's release notes. A crawl config tells ScaiMatrix where to start, how far to walk, and when to run.

You can also expose a webhook URL so a CI pipeline or a CMS can fire a crawl when content changes.

## 1. Create a target collection

If you don't already have one:

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Docs (live)",
    "embedding_model": "openai/text-embedding-3-small",
    "chunking_strategy": "markdown",
    "default_access": "tenant"
  }'
```

`markdown` chunking is a good default for crawled HTML — the crawler converts pages to Markdown before chunking so headings become semantic boundaries.

## 2. Define the crawl config

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Nightly docs sync",
    "seed_url": "https://docs.acme.example",
    "max_depth": 3,
    "max_pages": 500,
    "max_total_bytes": 52428800,
    "follow_external": false,
    "schedule": {
      "type": "daily",
      "time": "03:00"
    },
    "webhook": {"enabled": true}
  }'
```

```python
cfg = httpx.post(
    f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawls",
    headers=H,
    json={
        "name": "Nightly docs sync",
        "seed_url": "https://docs.acme.example",
        "max_depth": 3,
        "max_pages": 500,
        "max_total_bytes": 50 * 1024 * 1024,
        "follow_external": False,
        "schedule": {"type": "daily", "time": "03:00"},
        "webhook": {"enabled": True},
    },
).json()["data"]
print(cfg["id"], cfg["webhook_secret"])
```

```javascript
const r = await fetch(
  `${HOST}/v1/modules/scaimatrix/collections/${COLLECTION_ID}/crawls`,
  {
    method: "POST",
    headers: H,
    body: JSON.stringify({
      name: "Nightly docs sync",
      seed_url: "https://docs.acme.example",
      max_depth: 3,
      max_pages: 500,
      max_total_bytes: 50 * 1024 * 1024,
      follow_external: false,
      schedule: { type: "daily", time: "03:00" },
      webhook: { enabled: true },
    }),
  },
);
const { data: cfg } = await r.json();
```

Schedule types: `hourly` (with `interval_hours`), `daily` (with `time` as `HH:MM`), `weekly` (with `day_of_week` 0-6), or `custom` (with `cron`). All times are UTC.

If you enable the webhook, the response includes a `webhook_secret` — copy it now, it's only returned once.

## 3. Run it once to verify

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID/run" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

You get back a `CrawlJobRead` with `status: pending`. The first transition to `running` is usually within a few seconds.

## 4. Watch the live stream

```bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawl/$JOB_ID/stream" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

```python
with httpx.stream(
    "GET",
    f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawl/{job_id}/stream",
    headers=H,
    timeout=600,
) as r:
    for line in r.iter_lines():
        if line.startswith("data: "):
            print(line[6:])
```

```javascript
const r = await fetch(
  `${HOST}/v1/modules/scaimatrix/collections/${COLLECTION_ID}/crawl/${jobId}/stream`,
  { headers: H },
);
const reader = r.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  console.log(decoder.decode(value));
}
```

You'll get a `progress` event every two seconds with `pages_crawled`, `pages_failed`, `total_bytes_fetched`, and a final `done` event when the job hits a terminal status.

## 5. List the documents the crawl produced

```bash
curl "$SCAIGRID_HOST/v1/modules/scaimatrix/crawl-jobs/$JOB_ID/documents?limit=50" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Each row is a `Document` with `source_type: "crawl"`. Their normal ingestion pipeline runs to completion afterwards (chunking, embedding, optional graph extraction).

## 6. Wire up a webhook trigger from your CI

If you enabled the webhook in step 2, your CI can fire a crawl when docs change. Sign the request with HMAC-SHA256 over the body and a timestamp:

```python
import hashlib, hmac, time, httpx, json

WEBHOOK_SECRET = "<webhook_secret from step 2>"
payload = json.dumps({"reason": "docs-deploy", "commit": "abcd1234"}).encode()
ts = str(int(time.time()))
signature = hmac.new(
    WEBHOOK_SECRET.encode(),
    ts.encode() + b"." + payload,
    hashlib.sha256,
).hexdigest()

r = httpx.post(
    f"{HOST}/v1/modules/scaimatrix/collections/{COLLECTION_ID}/crawls/{cfg['id']}/trigger",
    headers={
        "X-Crawl-Signature": signature,
        "X-Crawl-Timestamp": ts,
        "Content-Type": "application/json",
    },
    content=payload,
)
print(r.status_code, r.json())
```

The endpoint takes no JWT — HMAC verification is the only auth — so you can call it from any process holding the secret. If the signature is missing, malformed, or older than the verifier's clock skew, the call 401s. If a crawl is already running for the collection, you get 409.

## 7. Edit or pause the config

```bash
# Pause without deleting
curl -X PUT "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

# Tighten the crawl scope
curl -X PUT "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"max_pages": 250, "max_depth": 2}'
```

`enabled: false` stops the scheduler but keeps the webhook live. Pass `"clear_schedule": true` to drop the schedule entirely, or `"clear_webhook": true` to disable the webhook.

## 8. Inspect run history

```bash
curl "$SCAIGRID_HOST/v1/modules/scaimatrix/collections/$COLLECTION_ID/crawls/$CONFIG_ID/jobs?limit=20" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Every run — scheduled, manual, or webhook — produces a job row tagged with the config id. The Crawl Manager admin page surfaces the same data as a timeline.

## Tips

- Set `max_total_bytes` even when `max_pages` is the headline limit — a single huge media file can blow the budget faster than you'd expect.
- `follow_external: false` is the safe default. Turn it on only when you have control of every linked-out domain.
- Re-running a crawl is additive: existing documents are updated in place when their URL is rediscovered; new pages get new document rows.
- Webhook secrets rotate via PUT with `webhook: {enabled: true}` after `clear_webhook: true` — old secret is invalidated, a new one is returned.
