Is there a Windows program to OCR one PDF which is an IMAGE (text isn't selectable).I know nothing about it, but you may try
It's about 200 pages but it's not worth buying OCR software for just one file.
Is there a way to upload the PDF to the net for others to see what it is?
Is there a way to upload the PDF to the net for others to see what it is?I know nothing about it, but you may try
https://pdf.wondershare.net/ad/pdf-editor/ocr.html
In the past, I have you the camera function of the Adobe Reader, pasted
the selection into Irfanview, and use the Irfanview Plugin to OCR the information.
https://www.irfanview.info/plugins/kadmos/
Is there a Windows program to OCR one PDF which is an IMAGE (text isn't selectable).
It's about 200 pages but it's not worth buying OCR software for just one file.
Is there a way to upload the PDF to the net for others to see what it is?
Is there a Windows program to OCR one PDF which is an IMAGE (text isn't selectable).
It's about 200 pages
Is there a Windows program to OCR one PDF which is an IMAGE (text isn't selectable).
It's about 200 pages but it's not worth buying OCR software for just one file.
Is there a way to upload the PDF to the net for others to see what it is?
On Thu, 15 Jun 2023 17:43:12 +0100, Peter wrote:
Is there a Windows program to OCR one PDF which is an IMAGE (text isn't
selectable).
It's about 200 pages
If it's 200 pages, don't you mean it's 200 images rather than one
image?
But that's a quibble. OneNote, part of the MS Office suite, can OCR
an image, and it does a fairly good job if the image is fairly clear.
Paste the image from clipboard into OneNote, then right-click on it
and select Copy Text from Picture. Then paste the text from clipboard
to whatever program you wish.
If you don't have Office, google for free OCR sites. There are quite
a few, but I've never used one because I use OneNote. Caution: If
what you're OCRing is sensitive, you wouldn't want to upload it to
some possibly sketchy website.
When you run "overlay OCR" on that 200 page scanner document,
each page is an OCR run. All the characters in one image are
"recognized", then PDF lines-of-text in a particular font,
are added to the PDF code for that page. Each page is handled
individually.
On Fri, 16 Jun 2023 06:40:50 -0400, Paul wrote:
When you run "overlay OCR" on that 200 page scanner document,
each page is an OCR run. All the characters in one image are
"recognized", then PDF lines-of-text in a particular font,
are added to the PDF code for that page. Each page is handled
individually.
What do you use to make the OCR overlay?
On 6/16/2023 11:01 AM, Stan Brown wrote:
On Fri, 16 Jun 2023 06:40:50 -0400, Paul wrote:
When you run "overlay OCR" on that 200 page scanner document,
each page is an OCR run. All the characters in one image are
"recognized", then PDF lines-of-text in a particular font,
are added to the PDF code for that page. Each page is handled
individually.
What do you use to make the OCR overlay?
Since Linux is more likely to have a current Tesseract, I used
Win10 Bash shell and a Ubuntu distro.
apt search ocrmypdf
sudo apt install ocrmypdf
You don't really need to do this step, but for test purposes,
I just wanted to run it on a single page. I fed it the image from
page 8.
mutool extract sony_srs-t1_t1pc_sm.pdf # collect image files for pages
Then, in Bash shell on Windows, I did (using the installed ocrmypdf)
for a PNG input to PDF output:
ocrmypdf -l eng --image-dpi 400 --output-type pdf image-0044.png image-0044.pdf
INFO - Input file is not a PDF, checking if it is an image...
INFO - Input file is an image
INFO - Image seems valid. Try converting to PDF...
INFO - Successfully converted to PDF, processing...
Scan: 100% 1/1 [00:00<00:00, 625.83page/s]
INFO - Using Tesseract OpenMP thread limit 3
OCR: 100% 1.0/1.0 [00:07<00:00, 7.01s/page]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
INFO - Optimize ratio: 1.00 savings: 0.0%
To do the whole document, you'd likely need less than that, as some
metadata is already inside the PDF. Something like this maybe.
ocrmypdf --output-type pdf input.pdf output.pdf
The output from my Page 8 image, made this standalone PDF. The DPI declaration, helped it pick a weird page size for the output.
image-0044.pdf
Wiping over that gives text to copy.
I didn't do quality analysis, or refine the command to do a better job.
I should be able to feed it the entire 10 page PDF intact, and
have it output a 10 page PDF with text overlay. Again, not tested.
It's normal for these processes, to not be able to overlay text
exactly on top of the bitmap character underneath. The Adobe OCR
in their paid tool, does do an exact job. Many other "hobby projects",
do not.
For a start, I was just happy to see Tesseract not fall over.
The Adobe tool (in the Acrobat editor in their distiller package),
first does layout analysis. On a three-column magazine layout,
it correctly removes the image content from consideration,
then it OCR-processes each column and precisely lays the text on top.
And has been previously described in this thread, if there is even
a bit of font&text in the document already, the OCR does not like that
and it bails. It expects "pristine" cut-sheet scan images to work on
and no fonts declared in the PDF. In the case of Adobe, it also expects
the scan to be done at 200DPI to 400DPI (based on page size declaration
and such). Many times, I was thwarted in Adobe by a "this image needs
to be between 200DPI and 400DPI" type of message. And then it takes
half the day to arrange a strict diet of noodles for the stupid thing :-)
Paul
Thanks, Paul, for the detailed explanation. One eye-
opener for me was that the Win10 Bash shell can run
actual Linux programs.
apt search ocrmypdf
sudo apt install ocrmypdf
On Fri, 16 Jun 2023 15:03:30 -0400, Paul wrote:
On 6/16/2023 11:01 AM, Stan Brown wrote:
On Fri, 16 Jun 2023 06:40:50 -0400, Paul wrote:
When you run "overlay OCR" on that 200 page scanner document,
each page is an OCR run. All the characters in one image are
"recognized", then PDF lines-of-text in a particular font,
are added to the PDF code for that page. Each page is handled
individually.
What do you use to make the OCR overlay?
Since Linux is more likely to have a current Tesseract, I used
Win10 Bash shell and a Ubuntu distro.
Thanks, Paul, for the detailed explanation. One eye-
opener for me was that the Win10 Bash shell can run
actual Linux programs.
apt search ocrmypdf
sudo apt install ocrmypdf
You don't really need to do this step, but for test purposes,
I just wanted to run it on a single page. I fed it the image from
page 8.
mutool extract sony_srs-t1_t1pc_sm.pdf # collect image files for pages
Then, in Bash shell on Windows, I did (using the installed ocrmypdf)
for a PNG input to PDF output:
ocrmypdf -l eng --image-dpi 400 --output-type pdf image-0044.png image-0044.pdf
INFO - Input file is not a PDF, checking if it is an image...
INFO - Input file is an image
INFO - Image seems valid. Try converting to PDF...
INFO - Successfully converted to PDF, processing...
Scan: 100% 1/1 [00:00<00:00, 625.83page/s]
INFO - Using Tesseract OpenMP thread limit 3
OCR: 100% 1.0/1.0 [00:07<00:00, 7.01s/page]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
INFO - Optimize ratio: 1.00 savings: 0.0%
To do the whole document, you'd likely need less than that, as some metadata is already inside the PDF. Something like this maybe.
ocrmypdf --output-type pdf input.pdf output.pdf
The output from my Page 8 image, made this standalone PDF. The DPI declaration, helped it pick a weird page size for the output.
image-0044.pdf
Wiping over that gives text to copy.
I didn't do quality analysis, or refine the command to do a better job.
I should be able to feed it the entire 10 page PDF intact, and
have it output a 10 page PDF with text overlay. Again, not tested.
It's normal for these processes, to not be able to overlay text
exactly on top of the bitmap character underneath. The Adobe OCR
in their paid tool, does do an exact job. Many other "hobby projects",
do not.
For a start, I was just happy to see Tesseract not fall over.
The Adobe tool (in the Acrobat editor in their distiller package),
first does layout analysis. On a three-column magazine layout,
it correctly removes the image content from consideration,
then it OCR-processes each column and precisely lays the text on top.
And has been previously described in this thread, if there is even
a bit of font&text in the document already, the OCR does not like that
and it bails. It expects "pristine" cut-sheet scan images to work on
and no fonts declared in the PDF. In the case of Adobe, it also expects
the scan to be done at 200DPI to 400DPI (based on page size declaration
and such). Many times, I was thwarted in Adobe by a "this image needs
to be between 200DPI and 400DPI" type of message. And then it takes
half the day to arrange a strict diet of noodles for the stupid thing :-)
Paul
--
Stan Brown, Tehachapi, California, USA
https://BrownMath.com/
Shikata ga nai...
Is there a Windows program to OCR one PDF which is an IMAGE (text isn't selectable).
It's about 200 pages but it's not worth buying OCR software for just one file.
Is there a way to upload the PDF to the net for others to see what it is?
Is there a Windows program to OCR one PDF which is an IMAGE (text isn't selectable).
It's about 200 pages but it's not worth buying OCR software for just one file.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 297 |
Nodes: | 16 (2 / 14) |
Uptime: | 02:03:55 |
Calls: | 6,666 |
Calls today: | 4 |
Files: | 12,212 |
Messages: | 5,335,600 |