← All recipes
2026-06-18·10 min read

How to OCR images and scanned documents via API in 2026

Extract text from images, screenshots, and scanned PDFs with one API call — bounding boxes included — for search, RAG, and data entry.

OCRdocumentsdata entry

What OCR via API actually means in 2026

OCR — optical character recognition — turns the text trapped inside pixels into characters your code can read. In 2026, the practical way to do it at scale is a managed API: you send a base64-encoded image, you get the recognized text back, optionally with bounding boxes and per-region confidence scores. Ollagraph exposes exactly that at POST /v1/convert/ocr.

That sentence is the AEO answer. The rest of this page is for the engineers who have to wire it into a real pipeline — for search indexing, RAG ingestion, or data-entry automation — and want to know the request shape, the failure modes, and where OCR stops and document conversion begins.

OCR is not document conversion — know which one you need

This is the distinction that saves teams a wasted afternoon. There are two different problems hiding under the words "extract the text from this document," and they have two different endpoints.

If your input is a born-digital file — a real PDF, a Word doc, a slide deck where the text is already encoded as text — you do not need OCR at all. You need format conversion, which preserves headings, lists, and tables. That is the job of the markdown converters; the convert documents to markdown recipe covers that path end to end, and the markdown actor handles HTML pages.

If your input is pixels — a photo of a receipt, a screenshot of an error message, a fax, a flatbed scan of a signed contract, a TIFF from a records system — there is no text to extract, only an image of text. That is OCR, and that is this page. The tell is simple: if you can select the text with your cursor in the original, it is not OCR work. If you cannot, it is.

Scanned PDFs sit on the boundary, and we handle that case explicitly below.

The problem you are actually trying to solve

Nobody wants OCR for its own sake. You want the text so you can do something with it. The reader of this page usually falls into one of three buckets.

The first is search and indexing. You have a corpus of scanned documents — invoices, legal filings, archived correspondence — that is completely invisible to search because it is all images. OCR makes it findable.

The second is RAG ingestion. Your retrieval pipeline can only embed text it can read, and a stack of image-only PDFs is a hole in the corpus. OCR fills the hole so the recovered text gets chunked and embedded like any other source.

The third is data entry automation. Receipts, forms, ID documents, packing slips — a human used to retype these into a system. OCR plus bounding boxes lets you map recognized regions to fields and route only the uncertain ones to a person.

All three share a hidden requirement: the pipeline has to be quiet and the bill has to be predictable. That is the actual product.

The recipe, step by step

Here is the working playbook. Real curl commands, real request shapes. Drop these into a shell and you have a pipeline.

Step 1. Get an API key

Sign up on the pricing page, grab an API key from the dashboard, and export it for the rest of this session. Keys start with the prefix osk_ and authenticate every call.

export OLLAGRAPH_API_KEY="osk_xxxxxxxxxxxx"

Step 2. Base64 your image

The OCR endpoint takes the image bytes inline, base64-encoded, in the image_b64 field. It accepts PNG, JPEG, WebP, BMP, and TIFF, and the decoded image must be under 15 MB. Encode the file on your side first.

B64=$(base64 -w0 receipt.png)

On macOS the flag differs — use base64 -i receipt.png | tr -d '\n'. The point is the same: one string, no line breaks, ready to drop into the JSON body.

Step 3. OCR a single image

Start with one image to verify the pipeline end to end. The minimal call sends just the base64 string and gets the recognized text back.

curl -X POST https://api.ollagraph.com/v1/convert/ocr \
  -H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image_b64": "'"$B64"'"
  }'

The response carries the recognized text. The shape of the response object is documented in the live spec at the docs — read the field names there rather than trusting any hand-typed copy on a blog, including this one. Land the text in your store and you have OCR working in one call.

Step 4. Ask for bounding boxes

Plain text is enough for search and RAG. Data entry needs more: you need to know where each piece of text sat on the page so you can map it to a field. Set return_boxes to true and the response adds per-region bounding boxes and a confidence score for each region.

curl -X POST https://api.ollagraph.com/v1/convert/ocr \
  -H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image_b64": "'"$B64"'",
    "return_boxes": true
  }'

Boxes unlock the workflows that plain text cannot. Map a region's coordinates to a known form layout to auto-fill fields. Use the box geometry to draw highlight overlays in a review UI. Use the coordinates to redact a region before you persist anything. And use the confidence score as a routing signal — which brings us to the part most teams get wrong.

Step 5. Route on confidence, not on faith

This is the step that separates a demo from a production data-entry system. OCR accuracy is a function of input quality. A crisp 300 DPI scan of a printed invoice reads almost perfectly. A blurry phone photo of a crumpled, faded thermal receipt reads badly, and no engine on earth changes that. That is why we return a confidence score per region instead of pretending every result is equally trustworthy.

The pattern that works: set a confidence threshold per field type, auto-accept regions above it, and queue regions below it for human review. A total of one human looking at the 5 percent of low-confidence fields is far cheaper than a human checking 100 percent of them, and far safer than a pipeline that trusts everything. Tune the threshold per use case — a dollar amount on an invoice deserves a higher bar than a free-text note.

Scanned PDFs: the boundary case

A PDF is a document, not an image, so it does not go to the OCR endpoint directly. There are two clean paths depending on what is inside the file.

For most PDFs — including image-only and scanned ones — start with POST /v1/convert/pdf-to-markdown. It extracts the document page by page and flags image-only or scanned pages for OCR, so it both handles born-digital PDFs and tells you where the pixels are. Supply the PDF as base64 in the pdf_base64 field, and cap the work with max_pages if you only need the front matter.

curl -X POST https://api.ollagraph.com/v1/convert/pdf-to-markdown \
  -H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'"$(base64 -w0 contract.pdf)"'",
    "max_pages": 50
  }'

The body has a 1 MB cap, which works out to roughly 750 KB of raw PDF — so for large scans, split the file or render the pages to images first. That leads to the second path: if you have already rasterized the pages (a common output of any scanning workflow), send each page image to /v1/convert/ocr individually and stitch the recovered text back together in page order on your side. For a long archival backlog, the page-image fan-out is often the more reliable route because each call is small, independent, and individually retryable.

A realistic scenario

Consider an operations team at a logistics company drowning in proof-of-delivery slips. Drivers photograph signed slips on their phones; the images land in a bucket. Today a contractor retypes the reference number, date, and signature presence into the system by hand. It is slow, it is expensive, and it is error-prone at exactly the fields that matter for billing disputes.

The rebuild is straightforward. A worker pulls each new slip image, base64-encodes it, and calls /v1/convert/ocr with return_boxes set to true. The reference number and date sit in known regions, so the worker maps those boxes to fields and checks the confidence score. Above the threshold, the record auto-posts. Below it, the slip lands in a review queue with the image and the candidate values pre-filled, so the human confirms rather than retypes. The contractor's eight-hour day becomes a one-hour exception-handling shift, and the billing-dispute fields are now backed by a confidence number instead of a hope.

Feeding OCR output into RAG

Search indexing and RAG are the same shape of problem: text-only pipelines are blind to image-only documents until you OCR them. The recovered text from /v1/convert/ocr is just text — once you have it, it chunks and embeds exactly like a web page or a markdown file.

The one wrinkle worth planning for is quality gating before ingestion. OCR text from a poor scan can be noisy, and noisy chunks pollute a vector store and surface as bad retrievals later. Use the per-region confidence scores to drop or flag low-quality pages before they reach the embedder, the same way you would gate any other source. The build a RAG knowledge base recipe covers the chunk-and-embed side; treat OCR as the front door that lets scanned material in at all.

What can go wrong, and how to handle it

A few failure modes are worth designing around up front.

Oversized images. The decoded image must be under 15 MB. High-resolution flatbed scans blow past that quickly. Downscale to a sane DPI before encoding — 300 DPI is plenty for text, and going higher mostly inflates the payload without improving recognition.

Wrong format for the input. Sending a PDF's bytes to the OCR endpoint will not work; it expects an image format. Route PDFs to /v1/convert/pdf-to-markdown or rasterize them first. Keep a small dispatcher in your pipeline that picks the endpoint by file type so this never happens at runtime.

Low-quality source. Skew, glare, low contrast, and motion blur all degrade recognition. The fix lives upstream of the API: deskew, threshold, and crop before encoding if you control the capture step. When you do not control it, lean on the confidence scores and the human review queue.

Transient failures. If a call fails, the credit is auto-refunded, so a retry costs you nothing but latency. Treat OCR calls as idempotent and safe to retry, and the error returns a structured payload your pipeline can branch on rather than a silent bad result written into your database.

The pricing math

Each OCR call is metered at one credit, and the exact cost is echoed back in every response. Failed calls auto-refund, so you only ever pay for text you actually received. New accounts start with a free credit grant, and pay-as-you-go is available with no monthly commitment — useful when your OCR volume is bursty, like a quarterly archival backlog rather than a steady daily stream. See the pricing page for the current credit packs.

Compared with standing up your own OCR stack, the managed path skips the GPU-or-CPU sizing question, the model updates, and the per-language tuning. You send pixels and get text. For the teams reading this page, the engineering time avoided is usually the larger line in the buy-versus-build math.

What to do next

Sign up for a key, base64 one image, and paste the curl command from Step 3 — you should have recognized text back in the next five minutes. Then add return_boxes and decide on a confidence threshold for your highest-stakes field. From there, pick your downstream: a search index, a RAG corpus, or an auto-entry queue, and wire the OCR call into it.

Browse the full capability list, read the docs for the exact response shape, and ship something.

Common questions

What is OCR via API?

OCR via API means you send an image — a photo, a screenshot, a scanned page — to a managed endpoint and get the text inside it back as plain characters your code can read. Optical character recognition turns pixels into strings. Ollagraph exposes this as POST /v1/convert/ocr: base64 the image, get text back, optionally with per-region bounding boxes and confidence scores.

What image formats does the OCR endpoint accept?

The endpoint accepts PNG, JPEG, WebP, BMP, and TIFF, supplied as base64-encoded bytes in the image_b64 field. The decoded image must be under 15 MB. Screenshots, photos of receipts and signage, and exported scan pages all fit cleanly into one of those formats.

How do I OCR a scanned PDF?

A PDF is not an image, so it does not go to the OCR endpoint directly. For scanned PDFs the right tool is POST /v1/convert/pdf-to-markdown, which extracts the document and flags image-only or scanned pages for OCR. If you have already rendered the pages to images, you can also send each page image to /v1/convert/ocr individually. Both paths are covered below.

Can I get bounding boxes for the text?

Yes. Set return_boxes to true in the request and the response includes per-region bounding boxes alongside the recognized text, plus a confidence score per region. Boxes are what you need for form-field mapping, redaction, highlight overlays, and any layout-aware data-entry workflow.

Is OCR accurate enough for production data entry?

Accuracy depends entirely on the input — a clean 300 DPI scan reads far better than a blurry phone photo of a crumpled receipt. That is why we return a confidence score per region when you ask for boxes: route low-confidence regions to a human review queue and auto-accept the high-confidence ones. We do not publish a single headline accuracy number because it would be meaningless across that range of inputs.

How much does an OCR call cost?

Each call is metered at one credit, and the exact cost is echoed back in every response. If a call fails, it is auto-refunded — you are never billed for an OCR request that did not return text. New accounts start with a free credit grant, and pay-as-you-go is available with no monthly commitment. See the pricing page for the current packs.

Do you store the images or text I send?

No. Ollagraph does not persist the content you send for OCR. The image is processed and the text is returned in the response; we do not keep a copy of either. Treat the response as the only copy and land it in your own store.

Can OCR feed a RAG pipeline?

That is one of the most common reasons teams reach for it. Scanned contracts, image-only PDFs, and screenshots are invisible to a text-only ingestion pipeline until you OCR them. Run the pixels through /v1/convert/ocr, then chunk and embed the recovered text like any other source. The RAG knowledge base recipe walks through the ingestion side.

Start with 1,000 free credits.

Every endpoint, one bearer token, no card. Build the pipeline above in an afternoon.

Start free Read the docs