The /v1/crawl endpoint walks the link graph of a site, obeys robots.txt by default, unrolls the sitemap, and streams clean results back to your webhook when it is done. Depth, page-budget, and concurrency are all yours to set.
One family in the full 145-endpoint surface — see the scraping bundle for the endpoints that pair with it, or follow the full crawl recipe end to end.
Every crawl carries a max_pages budget (default 500) and a link-depth ceiling (default 3). Raise both for large knowledge bases or lower them for a quick scoped pass. The crawler stops the moment either cap is hit, so a run can never silently balloon past what you asked for.
respect_robots is true unless you turn it off, and you should only turn it off for sites you own or have permission to crawl. The walker reads the target robots.txt before it starts and stays polite — single-server sites are automatically slowed so a crawl never looks like an attack.
Pair the crawl with the sitemap and robots endpoints to discover every declared URL up front. Seed the run from the sitemap instead of blindly following links, so you crawl exactly the pages you mean to and skip the ones the site already told you to leave alone.
Pass a webhook_url and the job runs detached — your process is not holding a socket open for an hour. When the crawl finishes we POST the full result and the job id to your endpoint. Prefer polling? Hit the jobs endpoint with the returned job_id instead.
Start a crawl, then collect the result by webhook or by polling.
export OLLAGRAPH_API_KEY="osk_..."
curl -X POST https://api.ollagraph.com/v1/crawl \
-H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"max_pages": 500,
"depth": 3,
"concurrency": 10,
"respect_robots": true,
"webhook_url": "https://yourapp.com/ollagraph-callback"
}'
# Returns: { "status": "queued", "job_id": "..." }export OLLAGRAPH_API_KEY="osk_..."
# No webhook? Poll the job by id until it is done.
curl https://api.ollagraph.com/v1/jobs/{job_id} \
-H "Authorization: Bearer $OLLAGRAPH_API_KEY"
# Response when ready:
# { "job_id": "...", "status": "completed",
# "result": { "pages_crawled": 487, "urls": [...] } }export OLLAGRAPH_API_KEY="osk_..."
# Read the sitemap and robots policy first, then seed the crawl
# only with the paths you actually want.
curl -X POST https://api.ollagraph.com/v1/intel/sitemap \
-H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
-d '{"url": "https://docs.example.com"}'
curl -X POST https://api.ollagraph.com/v1/intel/robots \
-H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
-d '{"url": "https://docs.example.com"}'Watching a crawl as it runs is part of the per-URL observability surface.
It walks the link graph of a site starting from one seed URL, returns clean content for every page it discovers, and respects depth, page-budget, concurrency, and robots controls along the way. It is asynchronous: the call returns a job_id immediately and the result arrives by webhook or by polling the jobs endpoint.
The defaults are 500 pages and depth 3. You can raise both — production customers regularly crawl tens of thousands of pages per job. For very large or scheduled crawls, talk to us about enterprise capacity on the enterprise page.
Yes, by default. Each crawl honors the robots.txt of the target site. You can override that with respect_robots set to false, but only do so where you have explicit permission to crawl, such as your own site or a partner's. Check what a site declares first with the robots endpoint.
Call the sitemap endpoint to pull the site's declared URLs, then start the crawl from the seed URL. Reading the sitemap and robots policy up front lets you scope the run to the paths you actually want rather than discovering them by depth-first link following.
Provide a webhook_url with the request. When the job completes, we send a single POST to your URL with the full result body and the job id. The job runs detached so your application is never holding a long-lived connection open while the crawl runs.
Skip the webhook_url and poll instead. The crawl call returns a job_id; pass it to GET /v1/jobs/{job_id} and you will get the status, and the full result once the status is completed. This is the simplest pattern for a one-off run from a notebook or a terminal.
Crawling is metered per call like the rest of the surface — one credit per call, and failed calls are auto-refunded so a crawl that errors out never costs you a credit. New accounts start with 1,000 free credits to prove the value first.
1,000 free credits, one bearer token, failed calls auto-refund.