Why AI Can't Read Your Scans — And How We Fixed It

⚡ Quick Summary

A scan is a photo. AI reads text, not images
Rotated pages, low quality, and merged text are the three main reasons indexing fails
We taught the service to automatically detect orientation and recognize text via Vision OCR
Now the client uploads any scan — the service figures it out on its own
But there's a limit: output quality depends on input quality

📚 Table of Contents

How it started: the client sent 20 pages
Why a scan is not a document for AI
Three problems we found in a single file
How we taught the service to read scans automatically
What this means for business
The limits of the technology: when even Vision OCR doesn't help
Conclusion: no guarantees, but honesty

How it started: the client sent a test package

A few weeks ago I was contacted by a lawyer specializing in construction law — he works with an archive of 10,000+ files and was looking for a solution to quickly search for information across documents. After a brief conversation he sent a test package: 21 pages from his archive.

I uploaded the file, started indexing, and prepared a list of 18 test questions — the kind whose answers were definitely in the document and which I could verify manually. The result was unexpected:

Answer Category	Count	%
Correct answers	3	17%
Inaccurate or partial	7	39%
Confident answers with non-existent facts	8	44%

An important clarification: these numbers are not a measure of the service, but of the input document quality. The same pipeline on a clean text PDF delivers 95–99% accuracy. The problem was not in the model or the search logic — but in the fact that most pages entered the index as unreadable garbage before AI even had a chance to answer. That is exactly what this case study is about.

But there was also a second problem — behavioral: the service never said "I don't know". It answered confidently, referenced a specific fragment — but 44% of answers contained numbers and facts that did not exist in the document. For legal documents where every number matters — this is critical.

I began investigating exactly where quality was being lost. The first suspicion — the model or search logic. But the problem turned out to be earlier in the pipeline:

Scanned PDF
  ↓
OCR                    // noise injection layer
  ↓
Chunking               // text already corrupted
  ↓
Embeddings             // semantic distortion
  ↓
Vector DB              // irreversible error propagation

Out of 21 pages, only 5–6 were indexed properly. The rest entered the database as garbage — and it was from this garbage that AI built confident answers. Below I break down exactly what went wrong and how we fixed it.

Why a scan is not a document for AI

The first thing I did after receiving the file was open it and try to select text with the cursor. It didn't work. The document turned out to be a scan: paper pages saved in a PDF wrapper with no text layer inside whatsoever.

Most people don't understand the difference between a "scanned PDF" and a "text PDF." They look the same. They open the same way. But for an AI system — these are fundamentally different things:

Characteristic	Text PDF	Scanned PDF
What's inside	Text layer — characters, words, sentences	Raster image — a set of pixels
Simple test	You can select a word with the cursor	Text cannot be selected
What AI sees	Text — reads it directly	An image — OCR is required
Indexing accuracy	95–99%	Depends on scan quality and OCR method
Who handles OCR	—	The service automatically, upon upload

For more detail on preparing different document types for indexing — see the article How to Prepare Documents for an AI Assistant 2026 and the overview OCR in Modern AI Systems: From Scanned Documents to RAG.

The solution seems simple: run OCR and upload the result. But that's when an unexpected problem emerged: most pages had been scanned at an angle — 90°, 180°, or 270° relative to the normal text orientation. This is a typical situation for old archives where documents were scanned in batches without checking the orientation of each sheet.

When standard OCR encounters a rotated page — it either returns an empty result, or produces unreadable garbage that looks like text from the outside. The system was receiving lines like:

аМЫМ "9a18 40 S¥3IAVT ONIHLY3HS N33ML3E

And was building answers based on them. AI didn't know this was garbage — it simply used whatever it received from the index.

This is not an edge case. According to analyst estimates, the intelligent document processing market is growing at 30%+ per year — precisely because most companies face this problem at scale: archives where a significant portion of files is either unreadable for AI, or read with critical errors due to incorrect orientation and poor scan quality.

Three problems we found in a single file

To understand exactly where quality was being lost, I ran an SQL query against the database and looked at the contents of the chunks that actually made it into the index. The picture was clear: the document contained three independent problems — and each one alone is enough to ruin the result.

Problem	What happens in the index	Impact on AI answers
Rotated pages (90°/180°/270°)	OCR returns unreadable garbage or an empty result	AI builds answers from non-existent data
Low scan quality (<300 DPI)	Characters are misread: "28.28" → "28.23" or "2B.28"	Wrong numbers in answers even when retrieval works correctly
Merged text without spaces	Vision OCR returns "[IMAGE - no text]", page drops out of the index	AI can't find answers or fabricates them from general knowledge

Problem 1: Rotated pages

Most pages were scanned at angles of 90°, 180°, and 270° — a typical situation for batch scanning of paper archives where the operator places sheets in a stack without checking the orientation of each one.

Standard OCR in this case produces garbage that looks like text from the outside:

аМЫМ "9a18 40 S¥3IAVT ONIHLY3HS N33ML3E

Vector search found these chunks as relevant — they formally existed in the database. AI received them as context and synthesized an answer. Result: confident answers with numbers that were never in the document.

Problem 2: Low scan quality

Some pages were scanned under poor lighting or with uneven contrast. According to OCR system technical standards, recognition accuracy at 300+ DPI is 98–99% — and drops significantly below this threshold. This applies to any OCR solution, including Vision AI models.

An error in a single character of a number completely changes the answer. If the document says "28.28 perms" but OCR read "28.23" — AI will answer incorrectly even if the search logic worked perfectly.

Problem 3: Merged text without spaces

Several pages contained a standard confidential notice where, due to printing or scanning artifacts, all words merged into a single continuous string:

containconfidentialinformation,proprietary,and/orprivilegedmaterial

GPT-4o-mini does not recognize such text as readable and returns "[IMAGE - no text]." The page drops out of the index entirely. These pages contained specifications and technical characteristics — exactly what the client wanted to search for.

Diagnostic summary:

Metric	Result
Total pages in document	21
Successfully indexed	5–6 (26–29%)
Indexed as garbage	~10 pages
Completely skipped	~5 pages
Answer accuracy on test questions	17%

These numbers characterize this specific problematic document, not the service. The same pipeline on a clean text PDF delivers 95–99%. But exactly these kinds of "difficult" archives are the norm for most companies that have been accumulating documents for years without scanning standards.

How we taught the service to read scans automatically

After the diagnosis it became clear: the problem had to be solved not at the level of document preparation by the client, but at the level of the service itself. The client shouldn't have to think about the angle at which their archive was scanned. They simply upload a file — and get a result.

I broke the solution down into four sequential steps. Here is what the updated pipeline looks like:

File upload
  ↓
Text extraction (Apache Tika)
  ↓
Garbage detector            // >40% ALL CAPS → scan
  ↓ (if garbage)
Vision OCR (GPT-4o-mini)   // reads the page as an image
  ↓ (if [IMAGE - no text])
Auto orientation correction // 90° → 180° → 270° → retry
  ↓
Chunking → Embeddings → Vector DB
  ↓
Response with strict prompt  // "I don't know" instead of hallucinations

Step 1: garbage detector

The standard parser (Apache Tika) extracted text from PDFs but could not distinguish normal text from scan garbage. Both looked like "there is text" — and both went into the index.

I added a simple detector: the service analyzes the first 1,000 characters of the extracted text and counts the share of words written entirely in uppercase.

Text type	Share of ALL CAPS words	Decision
Normal business text	10–20% (abbreviations, headings)	Standard indexing
Garbage from a rotated scan	50–70%+	→ Vision OCR pipeline

Trigger threshold: >40% ALL CAPS → the document is flagged as a scan and passed further along. This filters out the primary cause of hallucinations before indexing.

Step 2: Vision OCR via GPT-4o-mini

Standard OCR reads an image as a set of shapes — it doesn't understand context and cannot reconstruct a table structure if the scan is blurry. GPT-4o-mini receives a page as an image and understands it like a human: it sees a table as a table, a column of numbers as a column of numbers.

A prompt with specific rules for technical and legal documents:

- Extract all text without omissions
- Preserve table structure using the "|" separator
- Empty cell → dash
- Numbers exactly as written, without rounding
- If the page contains only an image → "[IMAGE - no text]"

Without clear rules the model could round numbers or merge columns — the answer would look plausible but would be inaccurate.

Step 3: automatic orientation correction

If on the first pass a page returned "[IMAGE - no text]" — the service automatically runs three additional attempts:

0° → [IMAGE - no text]
  ↓ auto-rotate
90° → [IMAGE - no text]
  ↓ auto-rotate
180° → readable text ✓ → saved

As soon as the model finds readable text — we stop. If none of the four angles produced a result — the page is marked as an image and skipped without an error.

Speed is the only real trade-off: each rotated page makes up to 4 API requests instead of one. For a 21-page document — about 5 minutes. But this is a one-time operation during upload — after indexing, answers come instantly.

Step 4: honesty instead of hallucinations

The technical steps addressed indexing quality. But a behavioral problem remained: when no relevant fragments were found — the model filled the gaps from general knowledge instead of saying "I don't know."

I updated the system prompt and added a strict rule:

If the retrieved fragments contain no clear answer →
  "No precise information on this question was found in the document.
   Try rephrasing or specify the section of the document."

Banned hallucination-indicator phrases:
  - "usually"
  - "as a rule"
  - "typical for such documents"

If these phrases appear in a response — it almost always means the model is answering from its training data, not from the document.

Results after all the changes

Metric	Before	After	What changed
Answer accuracy	17%	50%	Vision OCR + auto-rotation
Hallucinated answers	44%	~0%	Strict system prompt
Document processing time	~30 sec	~5 min	Price for quality, one-time
Pages indexed	5–6 out of 21	14–16 out of 21	Garbage detector + auto-rotation

A reminder of the context: these numbers are results on a low-quality problematic scan. The same pipeline on a clean text PDF delivers 95–99%. The main achievement here is not the accuracy increase from 17% to 50%, but the complete elimination of confidently wrong answers. For legal documents this matters more than any other metric.

What this means for business

After all the changes, the client no longer thinks about file format or page orientation. They simply upload a document — the service figures out what to do with it on its own: determines whether Vision OCR is needed, corrects orientation, indexes the result. This is a one-time operation. After that — standard document search with no additional costs.

Which industries this applies to

The problem of scanned archives is not unique to lawyers. It exists in any industry where documents have accumulated over the years in paper form — and fast search is now needed.

Industry	Typical documents	What Vision OCR provides	Without Vision OCR
Legal	Contracts, court rulings, orders, minutes	Search across scanned archives without manual conversion	17–30% accuracy on old archives
Medicine	Patient records, protocols, regulations	Indexing paper records without preparation	Missed pages, incorrect dosages in answers
Construction	Work completion acts, specifications, project documentation	Search for technical parameters from scans	Numbers in specifications read with errors
Distribution / logistics	Invoices, customs declarations, compliance certificates	Automated processing of incoming documents	Manual review of every scan before upload
Franchising	Standards, instructions, internal regulations	Single knowledge base instantly available to the entire network	Different document versions at different network locations

What all these scenarios have in common: the documents already exist, there are many of them, and they vary in quality. Nobody is going to rescan an archive just to implement AI. That's exactly why the service must be able to work with what's there — not demand perfect files as input.

How much Vision OCR costs: a real calculation

The key thing to understand: Vision OCR is a one-time cost at upload. Once a document is indexed, every question asked about it requires no additional OCR costs. You pay once for processing — and then search as many times as you want.

The cost is simple to calculate. GPT-4o-mini via OpenRouter costs $0.15 per million input tokens and $0.60 per million output tokens. One A4 page in Vision OCR is approximately 1,500–2,000 input tokens (image) and 300–600 output tokens (extracted text).

Cost of processing one page:

Scenario	API requests	Cost
Page reads on the first try	1	~$0.0003–0.0005
Page is rotated — auto-rotation needed	up to 4	~$0.001–0.002

Even in the worst case — when every page is rotated and requires four attempts — processing cost stays below half a cent per page.

Archive processing cost (calculated by pages):

Archive size	Normal scans	Problematic scans (auto-rotation)	Manual processing ($15/hr, ~3 min/page)
1,000 pages	$0.30–0.50	$1–2	~$750
10,000 pages	$3–5	$10–20	~$7,500
50,000 pages	$15–25	$50–100	~$37,500
100,000 pages	$30–50	$100–200	~$75,000

For the lawyer client whose story started all this — an archive of 10,000+ files where the average document is 15–20 pages — that's roughly 150,000–200,000 pages. Vision OCR processing cost: $150–400 depending on scan quality. Manual processing of the same volume at $15/hr would take years and cost hundreds of thousands of dollars.

An important note: these figures represent the cost of the OCR API call itself. The cost of the service subscription and data storage is added on top. But even accounting for that — automated processing is two orders of magnitude cheaper than manual.

What Vision OCR does not replace

Automated processing lowers the barrier to entry — but does not eliminate the need for human oversight where it is critical. For medical and legal documents where every number has legal consequences — Vision OCR speeds up the work, but does not replace human verification.

The right usage model: AI finds the needed fragment in seconds, the lawyer or doctor verifies the answer in the original document. This is not "AI instead of a human" — it's "AI instead of an hour of manual searching."

The limits of the technology: when even Vision OCR doesn't help

Vision OCR is not magic. It reads what it sees. Input image quality directly determines output quality — and no model can recover information that is physically impossible to read.

Even after all the improvements, pages remained in the test document that the service could not read properly. Here is the real picture by document type:

Document type	OCR accuracy	Practical consequence	Recommendation
Clear print, 300+ DPI	98–99%	Suitable for full automation	Upload without preparation
Standard scan, 200–300 DPI	90–97%	Acceptable for most tasks	Spot-check numerical data
Low quality, uneven contrast	60–70%	Critical for numerical data	Rescan or verify manually
Merged text without spaces	~0%	Page drops out of the index entirely	Manual entry only, or rescan
Handwritten text	50–80%	Unstable, depends on handwriting	Human verification mandatory

A 30% error rate in a technical or legal document is not "slightly inaccurate." It means wrong numbers in specifications, wrong amounts in contracts, wrong dosages in medical protocols. That's why input scan quality is not a technical detail, but a business decision.

GPT-4o-mini vs GPT-4o: when to upgrade

GPT-4o-mini is the model I use by default for Vision OCR. It's fast and economical, but has limits on complex documents. GPT-4o delivers noticeably better results — but costs 5–10 times more to run.

Scenario	GPT-4o-mini	GPT-4o
Standard business documents	✓ Sufficient	Overkill
Complex tables, technical specifications	Partial results	✓ Better
Old documents, non-standard fonts	Up to 10% character error rate	✓ Up to 3% character error rate
Processing cost (relative)	1×	5–10×

Character Error Rate (CER) — the share of characters the model read incorrectly. A CER of 10% means roughly every tenth word contains an error. For a contract stating "28.28 sq.m" — this could become "28.23 sq.m" or "2B.28 sq.m." Legally — different documents.

My approach: start with GPT-4o-mini across the entire archive. Pages where mini returned "[IMAGE - no text]" or clearly unreadable results — reprocess through GPT-4o. This balances quality and cost: 90–95% of pages are processed cheaply, difficult cases — accurately.

How to test an archive before scaling

The most costly mistake when deploying RAG on scanned archives is uploading all documents at once and discovering problems only afterward. The right approach: measure on a small sample first, then scale.

Here is the step-by-step process I recommend to clients before full deployment:

Step 1: a representative sample — 15–20 documents.
Don't pick your best files — pick typical ones. An archive usually contains several quality tiers: fresh scans from an office scanner, old scans from the 1990s, smartphone photos of documents, faxes. The sample should include 3–5 documents of each type. If the archive contains handwritten documents or documents with stamps — they should be in the sample too.

Step 2: questions with known answers — 20–30 of them.
Not general ones ("what is this document about"), but specific and verifiable:

"What is the total amount under the contract dated 12.03.2021?"
"What is the deadline for work completion stated in act No. 47?"
"What is the contractor's surname in the specification on page 4?"

These are the questions that reveal real accuracy — especially on numbers and proper names where OCR errors are most critical.

Step 3: measure three metrics, not two.
Most tests only count "correct / incorrect." That's not enough. It's important to distinguish three answer categories:

Category	What it means technically	What to do next
✓ Correct answer	Page indexed correctly	Scale this document type
○ "Information not found"	Page skipped or not indexed	Check scan quality, rescan, or upgrade to GPT-4o
✗ Confidently wrong answer	Garbage entered the index as text	Tune the garbage detector or system prompt

The third category is the most dangerous. "Not found" is an acceptable system response. A confidently wrong answer is a problem that must be solved before scaling.

Step 4: action based on results.
After testing, three scenarios are possible:

Test result	Recommendation
Accuracy satisfactory, no hallucinations	Scale to the full archive with GPT-4o-mini
Some pages are being skipped	Reprocess problematic pages through GPT-4o
Many confidently wrong answers	Rescan problematic documents at 300+ DPI before uploading

This is exactly what I advised the lawyer client whose story started all this. Not "upload all 10,000 files and we'll see" — but "test on 20 representative files and measure accuracy on questions that matter specifically for your practice." That's the only way to understand the real suitability of an archive before investing time and budget in full deployment.

For more detail on preparing documents of various formats for indexing — see the article How to Prepare Documents for an AI Assistant: Formats, OCR, Checklist 2026. On how OCR errors affect the entire RAG pipeline — see OCR in Modern AI Systems: From Scanned Documents to RAG.

Conclusion: no guarantees, but honesty

After all the changes were complete, I wrote the client an honest report: the documents are complex, some pages are rotated, some are skipped due to merged text. Here is what was achieved on that same problematic file:

Metric	Before	After	What changed
Answer accuracy	17%	50%	Vision OCR + auto-rotation
Hallucinated answers	44%	~0%	Strict system prompt
Pages indexed	5–6 out of 21	14–16 out of 21	Garbage detector + auto-rotation
Document processing time	~30 sec	~5 min	Price for quality, one-time

I didn't promise the client that everything would work perfectly. Context matters: 17% at the input is not a verdict on the service, it's a verdict on the quality of the input document. The same pipeline on a clean text PDF delivers 95–99%. But for a real archive of mixed-quality scans — 50% accuracy and zero hallucinations is already a tool you can actually use. And that kind of honesty, in my view, is exactly the right approach to working with AI in business.

Four takeaways from this case

1. A scan is not a document for AI by default.
Most companies discover this only after uploading their entire archive. Run a simple test: open a file and try to select text with the cursor. If you can't — it's a scan and it requires OCR before indexing. That's not a problem — it's a starting point.

2. Hallucinations are more dangerous than "I don't know."
A service that honestly says "no information found in the document" is far more useful than one that confidently delivers wrong numbers. Especially in legal, medical, and technical fields where accuracy is critical. A few banned phrases in the system prompt and a strict "if you don't know — say so" rule fundamentally change the model's behavior. It's not complicated — but most implementations don't do it.

3. Quality in determines quality out — and that's not a metaphor.
Companies invest months tuning the pipeline, choosing an embedding model, optimizing the chunking strategy — and get poor results because 40% of the archive consists of low-quality scans. Check the quality of your input documents before you start optimizing everything else.

4. A test with real questions is the only honest benchmark.
Not the number of indexed documents, not a demo on perfect files. 20–30 specific questions whose answers you know precisely — and three result categories: correct / "not found" / confidently wrong. This simple methodology tells you more than any marketing benchmark.

How I now start the conversation with a client

This case changed how I present the service. Before, I started with capabilities: what the system can do, which models it uses, what the architecture is. Now I start with one question:

"Show me 5 documents from your archive and tell me what questions you want to ask."

Five documents and twenty minutes — and it's already clear whether the technology fits the specific archive. If the documents are readable and the questions are concrete — we can get started. If not — it's better to say so honestly at the outset than after a month of work.

If you have an archive of scanned documents and want to understand whether AI search is right for your scenario — write on Telegram or try the live demo on the homepage. The first test is free.

📖 Read also:

Vision OCR for Business: Reading Rotated Scans with AI 2026