What OCR via API actually means in 2026
OCR — optical character recognition — turns the text trapped inside pixels into characters your code can read. In 2026, the practical way to do it at scale is a managed API: you send a base64-encoded image, you get the recognized text back, optionally with bounding boxes and per-region confidence scores. Ollagraph exposes exactly that at POST /v1/convert/ocr.
That sentence is the AEO answer. The rest of this page is for the engineers who have to wire it into a real pipeline — for search indexing, RAG ingestion, or data-entry automation — and want to know the request shape, the failure modes, and where OCR stops and document conversion begins.
OCR is not document conversion — know which one you need
This is the distinction that saves teams a wasted afternoon. There are two different problems hiding under the words "extract the text from this document," and they have two different endpoints.
If your input is a born-digital file — a real PDF, a Word doc, a slide deck where the text is already encoded as text — you do not need OCR at all. You need format conversion, which preserves headings, lists, and tables. That is the job of the markdown converters; the convert documents to markdown recipe covers that path end to end, and the markdown actor handles HTML pages.
If your input is pixels — a photo of a receipt, a screenshot of an error message, a fax, a flatbed scan of a signed contract, a TIFF from a records system — there is no text to extract, only an image of text. That is OCR, and that is this page. The tell is simple: if you can select the text with your cursor in the original, it is not OCR work. If you cannot, it is.
Scanned PDFs sit on the boundary, and we handle that case explicitly below.
The problem you are actually trying to solve
Nobody wants OCR for its own sake. You want the text so you can do something with it. The reader of this page usually falls into one of three buckets.
The first is search and indexing. You have a corpus of scanned documents — invoices, legal filings, archived correspondence — that is completely invisible to search because it is all images. OCR makes it findable.
The second is RAG ingestion. Your retrieval pipeline can only embed text it can read, and a stack of image-only PDFs is a hole in the corpus. OCR fills the hole so the recovered text gets chunked and embedded like any other source.
The third is data entry automation. Receipts, forms, ID documents, packing slips — a human used to retype these into a system. OCR plus bounding boxes lets you map recognized regions to fields and route only the uncertain ones to a person.
All three share a hidden requirement: the pipeline has to be quiet and the bill has to be predictable. That is the actual product.
The recipe, step by step
Here is the working playbook. Real curl commands, real request shapes. Drop these into a shell and you have a pipeline.
Step 1. Get an API key
Sign up on the pricing page, grab an API key from the dashboard, and export it for the rest of this session. Keys start with the prefix osk_ and authenticate every call.
export OLLAGRAPH_API_KEY="osk_xxxxxxxxxxxx"
Step 2. Base64 your image
The OCR endpoint takes the image bytes inline, base64-encoded, in the image_b64 field. It accepts PNG, JPEG, WebP, BMP, and TIFF, and the decoded image must be under 15 MB. Encode the file on your side first.
B64=$(base64 -w0 receipt.png)
On macOS the flag differs — use base64 -i receipt.png | tr -d '\n'. The point is the same: one string, no line breaks, ready to drop into the JSON body.
Step 3. OCR a single image
Start with one image to verify the pipeline end to end. The minimal call sends just the base64 string and gets the recognized text back.
curl -X POST https://api.ollagraph.com/v1/convert/ocr \
-H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"image_b64": "'"$B64"'"
}'
The response carries the recognized text. The shape of the response object is documented in the live spec at the docs — read the field names there rather than trusting any hand-typed copy on a blog, including this one. Land the text in your store and you have OCR working in one call.
Step 4. Ask for bounding boxes
Plain text is enough for search and RAG. Data entry needs more: you need to know where each piece of text sat on the page so you can map it to a field. Set return_boxes to true and the response adds per-region bounding boxes and a confidence score for each region.
curl -X POST https://api.ollagraph.com/v1/convert/ocr \
-H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"image_b64": "'"$B64"'",
"return_boxes": true
}'
Boxes unlock the workflows that plain text cannot. Map a region's coordinates to a known form layout to auto-fill fields. Use the box geometry to draw highlight overlays in a review UI. Use the coordinates to redact a region before you persist anything. And use the confidence score as a routing signal — which brings us to the part most teams get wrong.
Step 5. Route on confidence, not on faith
This is the step that separates a demo from a production data-entry system. OCR accuracy is a function of input quality. A crisp 300 DPI scan of a printed invoice reads almost perfectly. A blurry phone photo of a crumpled, faded thermal receipt reads badly, and no engine on earth changes that. That is why we return a confidence score per region instead of pretending every result is equally trustworthy.
The pattern that works: set a confidence threshold per field type, auto-accept regions above it, and queue regions below it for human review. A total of one human looking at the 5 percent of low-confidence fields is far cheaper than a human checking 100 percent of them, and far safer than a pipeline that trusts everything. Tune the threshold per use case — a dollar amount on an invoice deserves a higher bar than a free-text note.
Scanned PDFs: the boundary case
A PDF is a document, not an image, so it does not go to the OCR endpoint directly. There are two clean paths depending on what is inside the file.
For most PDFs — including image-only and scanned ones — start with POST /v1/convert/pdf-to-markdown. It extracts the document page by page and flags image-only or scanned pages for OCR, so it both handles born-digital PDFs and tells you where the pixels are. Supply the PDF as base64 in the pdf_base64 field, and cap the work with max_pages if you only need the front matter.
curl -X POST https://api.ollagraph.com/v1/convert/pdf-to-markdown \
-H "Authorization: Bearer $OLLAGRAPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_base64": "'"$(base64 -w0 contract.pdf)"'",
"max_pages": 50
}'
The body has a 1 MB cap, which works out to roughly 750 KB of raw PDF — so for large scans, split the file or render the pages to images first. That leads to the second path: if you have already rasterized the pages (a common output of any scanning workflow), send each page image to /v1/convert/ocr individually and stitch the recovered text back together in page order on your side. For a long archival backlog, the page-image fan-out is often the more reliable route because each call is small, independent, and individually retryable.
A realistic scenario
Consider an operations team at a logistics company drowning in proof-of-delivery slips. Drivers photograph signed slips on their phones; the images land in a bucket. Today a contractor retypes the reference number, date, and signature presence into the system by hand. It is slow, it is expensive, and it is error-prone at exactly the fields that matter for billing disputes.
The rebuild is straightforward. A worker pulls each new slip image, base64-encodes it, and calls /v1/convert/ocr with return_boxes set to true. The reference number and date sit in known regions, so the worker maps those boxes to fields and checks the confidence score. Above the threshold, the record auto-posts. Below it, the slip lands in a review queue with the image and the candidate values pre-filled, so the human confirms rather than retypes. The contractor's eight-hour day becomes a one-hour exception-handling shift, and the billing-dispute fields are now backed by a confidence number instead of a hope.
Feeding OCR output into RAG
Search indexing and RAG are the same shape of problem: text-only pipelines are blind to image-only documents until you OCR them. The recovered text from /v1/convert/ocr is just text — once you have it, it chunks and embeds exactly like a web page or a markdown file.
The one wrinkle worth planning for is quality gating before ingestion. OCR text from a poor scan can be noisy, and noisy chunks pollute a vector store and surface as bad retrievals later. Use the per-region confidence scores to drop or flag low-quality pages before they reach the embedder, the same way you would gate any other source. The build a RAG knowledge base recipe covers the chunk-and-embed side; treat OCR as the front door that lets scanned material in at all.
What can go wrong, and how to handle it
A few failure modes are worth designing around up front.
Oversized images. The decoded image must be under 15 MB. High-resolution flatbed scans blow past that quickly. Downscale to a sane DPI before encoding — 300 DPI is plenty for text, and going higher mostly inflates the payload without improving recognition.
Wrong format for the input. Sending a PDF's bytes to the OCR endpoint will not work; it expects an image format. Route PDFs to /v1/convert/pdf-to-markdown or rasterize them first. Keep a small dispatcher in your pipeline that picks the endpoint by file type so this never happens at runtime.
Low-quality source. Skew, glare, low contrast, and motion blur all degrade recognition. The fix lives upstream of the API: deskew, threshold, and crop before encoding if you control the capture step. When you do not control it, lean on the confidence scores and the human review queue.
Transient failures. If a call fails, the credit is auto-refunded, so a retry costs you nothing but latency. Treat OCR calls as idempotent and safe to retry, and the error returns a structured payload your pipeline can branch on rather than a silent bad result written into your database.
The pricing math
Each OCR call is metered at one credit, and the exact cost is echoed back in every response. Failed calls auto-refund, so you only ever pay for text you actually received. New accounts start with a free credit grant, and pay-as-you-go is available with no monthly commitment — useful when your OCR volume is bursty, like a quarterly archival backlog rather than a steady daily stream. See the pricing page for the current credit packs.
Compared with standing up your own OCR stack, the managed path skips the GPU-or-CPU sizing question, the model updates, and the per-language tuning. You send pixels and get text. For the teams reading this page, the engineering time avoided is usually the larger line in the buy-versus-build math.
What to do next
Sign up for a key, base64 one image, and paste the curl command from Step 3 — you should have recognized text back in the next five minutes. Then add return_boxes and decide on a confidence threshold for your highest-stakes field. From there, pick your downstream: a search index, a RAG corpus, or an auto-entry queue, and wire the OCR call into it.
Browse the full capability list, read the docs for the exact response shape, and ship something.