Pdf stacks ocr

12/11/2023

Here is sample code for finding text and highlighting with PyMuPDF: import fitz # READ IN PDF doc = fitz. filter (lambda obj: obj = "char" and "Bold" in obj ) print (clean_text. But “ocrfeeder” didn’t seem to be working on my install (Kubuntu 18.10) when I tried to run it. I then tried to pull up the GUI… And… Nothing. I downloaded “ocrfeeder” quickly: $ sudo apt install ocrfeeder On the same page where I found “ocrmypdf,” there was mention of another software package: “ocrfeeder.” Since I had been dreaming of this kind of software for ages, I figured I’d give it a spin as well. But given the speed with which I can do this, I’ll absolutely be using this over the old software I was using on a virtual machine before. You can stack, merge, or split PDF files. But that is more than sufficient for what I need most of the time. With file recognition (OCR) features, you can scan images into searchable PDF files. In the next image, you can see that I can select the text in the OCRd image: I can select the text in the OCRd image.įinally, the real question is, how accurate is the OCR? The image below shows the OCR document next to the text: PDF on the left selected text copied and pasted on the right.Īs you can see, this isn’t award-winning software. Original PDF on the left OCR PDF on the right. With a quick command, I ran it through the “ocrmypdf” program and got out a nearly identical PDF that was smaller (just 9 mb) and allowed me to select the text (image on the right below). My initial PDF (on the left below) was 14 mb in size and looked fine, but I couldn’t select the text – it was just an image. Navigate to the directory where you have your PDF you want to have recognized then type in the following: $ ocrmypdf input.pdf output.pdf

Once it was installed, I gave it a whirl. The first option was a command line program called “ocrmypdf.” That sounds like a dream! I quickly installed it on my Kubuntu machine: $ sudo apt install ocrmypdfĪ number of additional packages were installed as well. There, I found two new options for OCR on Linux. A quick Google search landed me on Stack Exchange (where I seem to spend a lot of time these days). I got the updates started, then realized that I hadn’t checked to see if any progress had been made on OCR on Linux for quite a while (probably a couple of years). However, my virtual machine was giving me some issues and required me to install some updates that were going to take a while (’cause, Windows!). I then converted the TIF files from Scan Tailor into PDF files, put them in the correct order, and was ready to OCR them in the software I used in Windows. The scan looked good (especially after I used Scan Tailor’s Dewarping feature to flatten the pages). I scanned a chapter I wrote in a book recently. But, I think I can safely move past that thanks to recent advances in OCR on Linux. Up until now, I have kept a software package on a Windows virtual machine (in Virtualbox) specifically to OCR PDFs on the rare occasion when I need to do that. However, the occasional need arises when I either have to scan something myself or I receive a document that does not have selectable text and is just an image. Most of them were digital documents to begin with and the text is readily selectable. One of the few tasks I have not been able to do on Linux since I switched over from Windows more than a decade ago is optical character recognition (OCR) of PDF documents.

0 Comments

Pdf stacks ocr

Leave a Reply.

Author

Archives

Categories