Extract Text From a Scanned PDF
No account. No upload. Just the tool.
You have a PDF that won’t let you select text. The pages are images — maybe a scan, maybe an export from an old fax, maybe a photo of a printed document someone emailed you. You need the text out: to quote it, search it, paste it, or just make the file searchable. Signegy runs OCR (optical character recognition) on every page, in your browser, and gives you back the text as a plain .txt file or as a searchable PDF that looks identical to the original but supports Ctrl-F.
How It Works
Each page of your PDF is rendered to a high-resolution canvas (about 144 DPI) inside the browser. That canvas is fed to Tesseract.js, the WebAssembly port of the open-source Tesseract OCR engine that Google maintains. Tesseract returns the recognized text along with bounding boxes for every word — a small JSON structure that says “the word Invoice is located at pixels (120, 80) to (240, 110) on this page.”
We use those boxes two ways:
- Plain text output — concatenate every word in reading order, separated by page markers, and download as
.txt. - Searchable PDF — embed each page image into a new PDF, then overlay an invisible text layer on top, positioned to match where the words actually appear in the image. Visually, the searchable PDF looks identical to the source. But text selection, copy-paste, and Ctrl-F all work because the invisible layer is real PDF text.
The whole pipeline runs locally. The PDF you drop in never leaves your tab. The engine is downloaded once from the same CDN that hosts the rest of Signegy, then cached in your browser for next time.
When to Use This
- Scanned contracts or invoices — make the PDF searchable so you (or your filing system) can find clauses by keyword later.
- Old PDFs from copy machines — many older office scanners produce image-only PDFs that look fine but can’t be searched. Run them through OCR once and the searchable version is permanently usable.
- Research papers, court records, government filings — public documents are often scanned PDFs that need text extraction for citation, quoting, or analysis.
- Image-heavy PDFs with embedded text — even modern PDFs sometimes have key text rendered as raster images (logos, form labels). OCR pulls them out.
What This Tool Doesn’t Do
- Handwriting. Tesseract is for printed text. Handwriting models exist but aren’t viable in-browser.
- Languages other than English (v1). Adding a language picker means another 5–30MB download per language. We’ll add one if there’s demand — for now, English only.
- Tables and complex layouts. OCR returns text in roughly reading order, but multi-column layouts, tables, and forms may come out interleaved or out of order. The recognition itself is fine; it’s the layout reconstruction that’s hard. Plan to clean up table-like output by hand or with a follow-up tool.
- Math, code, or unusual fonts. Equations, programming snippets in monospace, and decorative or stylized fonts often misrecognize. Plain printed prose is the sweet spot.
- Speed. OCR takes about 5–15 seconds per page in the browser. A 50-page document is a coffee break. The progress indicator tells you which page it’s on so you know it hasn’t hung.
Honest Trade-offs vs Server-Side OCR
Cloud OCR services are faster (they have bigger machines and parallel processing) and they support more languages out of the box. They also see your document. For small jobs (under ~30 pages), running OCR in your browser is a perfectly fine trade — you wait a couple of minutes instead of seconds, but you don’t upload a contract to a third party. For very large jobs (hundreds of pages), browser-side OCR is genuinely slow and you might be better off with a desktop tool.
We’re explicit about this on purpose. The point of Signegy isn’t to claim browser-side beats server-side at everything — it doesn’t. The point is to give you a private option for work that doesn’t need to be fast.
Pair With Other Tools
- Compress PDF — searchable PDFs from OCR include both the page images and the text layer, which can grow file size. Run compress after OCR to shrink.
- Repair PDF — if your scanned PDF won’t open or has structural errors, repair it first, then OCR.
- PDF to JPG — if you only want the page images and plan to OCR them in another tool.
- Sign PDF online — sign the searchable version to lock in both the original images and the recognized text.
Privacy
Your PDF is processed entirely in your browser. The Tesseract WebAssembly engine, the rendering canvas, and the output PDF builder all run client-side. No file bytes, no recognized text, no metadata is sent to any server. Network requests during OCR fetch only the engine itself (one-time, cached) and the English language data (also cached). Open your browser’s network tab while OCR runs to verify — you’ll see the engine load on the first run and zero outbound traffic carrying your document.
Signegy provides general information, not legal advice. Consult a qualified legal professional for advice specific to your situation and jurisdiction.
Frequently Asked Questions
Why does the page take a few seconds to start the first time?
Recognition uses Tesseract.js, an open-source OCR engine compiled to WebAssembly. The engine bundle is about 10MB and is only loaded the first time you click Recognize — it never downloads if you don't use the tool. After the first run, the browser caches the engine, so subsequent files start much faster. We deliberately don't load it up-front because most visitors don't end up running OCR.
What languages are supported?
English only by default in v1. Tesseract supports 100+ language packs, but each one is a separate download — adding a language picker means letting users pull down 5–30MB more per language they choose. We're starting with English-only to keep things simple. If you need another language, the underlying engine supports them — drop us a note and we'll add a picker.
How accurate is OCR in the browser?
It depends almost entirely on the scan quality. A clean 300 DPI scan of typed text? Usually 95–99% accurate. A low-resolution photo of a printed page taken at an angle? Maybe 70–85%. Faded carbon copies, small fonts, multi-column layouts, and skewed pages all reduce accuracy. The engine is the same Tesseract LSTM model used by many commercial tools — quality is fundamentally limited by the input image, not by browser execution.
Does it work on handwriting?
No. Tesseract is trained on printed and typeset text. Handwriting recognition needs a different class of model (typically a transformer trained on handwritten samples) which doesn't exist in a browser-runnable form yet. If you have mixed printed + handwritten content, the printed parts will recognize fine and the handwritten parts will produce garbage that you should ignore.
What's the difference between the .txt output and the searchable PDF?
The .txt file is just the recognized text — useful if you want to copy it into a document, search it, or feed it to another tool. The searchable PDF keeps the original page images visually identical to the source, but adds an invisible text layer on top so the document is selectable, searchable in any PDF viewer, and indexable by search engines. Choose .txt for content extraction, choose searchable PDF for archival or for making a scanned document Ctrl-F-friendly.