CRAWLING

Crawl whole sites. In one job.

The /v1/crawl endpoint walks the link graph of a site, obeys robots.txt by default, unrolls the sitemap, and streams clean results back to your webhook when it is done. Depth, page-budget, and concurrency are all yours to set.

Start free Read the full crawl recipe

One family in the full 145-endpoint surface — see the scraping bundle for the endpoints that pair with it, or follow the full crawl recipe end to end.

Depth and budget caps

Every crawl carries a max_pages budget (default 500) and a link-depth ceiling (default 3). Raise both for large knowledge bases or lower them for a quick scoped pass. The crawler stops the moment either cap is hit, so a run can never silently balloon past what you asked for.

Robots-aware by default

respect_robots is true unless you turn it off, and you should only turn it off for sites you own or have permission to crawl. The walker reads the target robots.txt before it starts and stays polite — single-server sites are automatically slowed so a crawl never looks like an attack.

Sitemap unrolling

Pair the crawl with the sitemap and robots endpoints to discover every declared URL up front. Seed the run from the sitemap instead of blindly following links, so you crawl exactly the pages you mean to and skip the ones the site already told you to leave alone.

Webhook delivery

Pass a webhook_url and the job runs detached — your process is not holding a socket open for an hour. When the crawl finishes we POST the full result and the job id to your endpoint. Prefer polling? Hit the jobs endpoint with the returned job_id instead.

Examples that work today.

Start a crawl, then collect the result by webhook or by polling.

Start a crawlPOST /v1/crawl
export OLLAGRAPH_API_KEY="osk_..."

curl -X POST https://api.ollagraph.com/v1/crawl \
  -H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "max_pages": 500,
    "depth": 3,
    "concurrency": 10,
    "respect_robots": true,
    "webhook_url": "https://yourapp.com/ollagraph-callback"
  }'

# Returns: { "status": "queued", "job_id": "..." }
Poll the jobGET /v1/jobs/{job_id}
export OLLAGRAPH_API_KEY="osk_..."

# No webhook? Poll the job by id until it is done.
curl https://api.ollagraph.com/v1/jobs/{job_id} \
  -H "Authorization: Bearer $OLLAGRAPH_API_KEY"

# Response when ready:
# { "job_id": "...", "status": "completed",
#   "result": { "pages_crawled": 487, "urls": [...] } }
Seed from sitemap + robotsPOST /v1/intel/sitemap
export OLLAGRAPH_API_KEY="osk_..."

# Read the sitemap and robots policy first, then seed the crawl
# only with the paths you actually want.
curl -X POST https://api.ollagraph.com/v1/intel/sitemap \
  -H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
  -d '{"url": "https://docs.example.com"}'

curl -X POST https://api.ollagraph.com/v1/intel/robots \
  -H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
  -d '{"url": "https://docs.example.com"}'

Watching a crawl as it runs is part of the per-URL observability surface.

Crawler questions

What does /v1/crawl actually do?

It walks the link graph of a site starting from one seed URL, returns clean content for every page it discovers, and respects depth, page-budget, concurrency, and robots controls along the way. It is asynchronous: the call returns a job_id immediately and the result arrives by webhook or by polling the jobs endpoint.

How big can a crawl be?

The defaults are 500 pages and depth 3. You can raise both — production customers regularly crawl tens of thousands of pages per job. For very large or scheduled crawls, talk to us about enterprise capacity on the enterprise page.

Does the crawler respect robots.txt?

Yes, by default. Each crawl honors the robots.txt of the target site. You can override that with respect_robots set to false, but only do so where you have explicit permission to crawl, such as your own site or a partner's. Check what a site declares first with the robots endpoint.

How do I seed a crawl from a sitemap?

Call the sitemap endpoint to pull the site's declared URLs, then start the crawl from the seed URL. Reading the sitemap and robots policy up front lets you scope the run to the paths you actually want rather than discovering them by depth-first link following.

How does webhook delivery work?

Provide a webhook_url with the request. When the job completes, we send a single POST to your URL with the full result body and the job id. The job runs detached so your application is never holding a long-lived connection open while the crawl runs.

What if I do not want to set up a webhook?

Skip the webhook_url and poll instead. The crawl call returns a job_id; pass it to GET /v1/jobs/{job_id} and you will get the status, and the full result once the status is completed. This is the simplest pattern for a one-off run from a notebook or a terminal.

How is a crawl billed?

Crawling is metered per call like the rest of the surface — one credit per call, and failed calls are auto-refunded so a crawl that errors out never costs you a credit. New accounts start with 1,000 free credits to prove the value first.

Start your first crawl free.

1,000 free credits, one bearer token, failed calls auto-refund.

Start free Browse recipes