Question 1

Why does the page take a few seconds to start the first time?

Accepted Answer

Recognition uses Tesseract.js, an open-source OCR engine compiled to WebAssembly. The engine bundle is about 10MB and is only loaded the first time you click Recognize — it never downloads if you don't use the tool. After the first run, the browser caches the engine, so subsequent files start much faster. We deliberately don't load it up-front because most visitors don't end up running OCR.

Question 2

What languages are supported?

Accepted Answer

Twelve languages cover most users: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese (Simplified), Japanese, Korean, and Arabic. Pick the language from the dropdown in the toolbar after you load a PDF. Two mixed packs are also available — English + Spanish and English + French — for documents that switch between languages on the same page. Each language pack is a separate Tesseract traineddata file (~10–30MB) that downloads the first time you select it and is then cached by your browser for next time. The underlying Tesseract engine supports 100+ languages — if you need one we don't list, drop us a note and we'll add it.

Question 3

How accurate is OCR in the browser?

Accepted Answer

It depends almost entirely on the scan quality. A clean 300 DPI scan of typed text? Usually 95–99% accurate. A low-resolution photo of a printed page taken at an angle? Maybe 70–85%. Faded carbon copies, small fonts, multi-column layouts, and skewed pages all reduce accuracy. The engine is the same Tesseract LSTM model used by many commercial tools — quality is fundamentally limited by the input image, not by browser execution.

Question 4

Does it work on handwriting?

Accepted Answer

No. Tesseract is trained on printed and typeset text. Handwriting recognition needs a different class of model (typically a transformer trained on handwritten samples) which doesn't exist in a browser-runnable form yet. If you have mixed printed + handwritten content, the printed parts will recognize fine and the handwritten parts will produce garbage that you should ignore.

Question 5

What's the difference between the .txt output and the searchable PDF?

Accepted Answer

The .txt file is just the recognized text — useful if you want to copy it into a document, search it, or feed it to another tool. The searchable PDF keeps the original page images visually identical to the source, but adds an invisible text layer on top so the document is selectable, searchable in any PDF viewer, and indexable by search engines. Choose .txt for content extraction, choose searchable PDF for archival or for making a scanned document Ctrl-F-friendly.

Extract Text From a Scanned PDF

How It Works

When to Use This

Supported Languages

What This Tool Doesn’t Do

Honest Trade-offs vs Server-Side OCR

Pair With Other Tools

Privacy

Frequently Asked Questions