Forum: >>> Magnum BBS <<<

Convert an imagebook to a textbook (perhaps using OCR?)

From Rudolph Rhein@21:1/5 to All on Sat Aug 26 04:05:18 2023

XPost: comp.text.pdf, rec.photo.digital

My sister's next-month Great Books is Noel Coward's play comedy named
"Private Lives" from the 1930s. She's almost blind from complications.

She is not technical and she only has an iPad and an iPhone but I have
Android & Windows so she asked me to help her with IOS text to speech.

She sent me the link to the PDF because it won't text-to-speech read out. <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>

Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
it) but just a set of scanned images of the book (with no actual text).

I tried converting that PDF with Calibre on Windows to an EPUB format,
but the EPUB was nothing more than a set of the same images in a file.

What's a good way for me to convert that "imagebook" (whatever you call it)
to a "textbook" so that I can send it to her to use TTS on her iPad?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to RudolphRhein@nospam.net on Sat Aug 26 02:00:45 2023

XPost: comp.text.pdf, rec.photo.digital

Follow-ups set to comp.text.pdf.

In rec.photo.digital, Rudolph Rhein <RudolphRhein@nospam.net> wrote:

My sister's next-month Great Books is Noel Coward's play comedy named "Private Lives" from the 1930s. She's almost blind from complications.

...

<https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>
Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
it) but just a set of scanned images of the book (with no actual text).

It's archive.org. They have documents in multiple formats already.

https://archive.org/details/in.ernet.dli.2015.210130

DOWNLOAD OPTIONS
* ABBYY GZ download
* DAISY download For print-disabled users
* EPUB download
* FULL TEXT download
* ITEM TILE download
* KINDLE download
* PDF download
* PDF WITH TEXT download
* SINGLE PAGE PROCESSED JP2 ZIP

Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
typical OCR errors in it.

Elijah
------
does not know what all of the formats are

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Rudolph Rhein on Fri Aug 25 23:29:06 2023

XPost: comp.text.pdf, rec.photo.digital

On 8/25/2023 9:05 PM, Rudolph Rhein wrote:

My sister's next-month Great Books is Noel Coward's play comedy named "Private Lives" from the 1930s. She's almost blind from complications.

She is not technical and she only has an iPad and an iPhone but I have Android & Windows so she asked me to help her with IOS text to speech.

She sent me the link to the PDF because it won't text-to-speech read out. <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>

Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
it) but just a set of scanned images of the book (with no actual text).

I tried converting that PDF with Calibre on Windows to an EPUB format,
but the EPUB was nothing more than a set of the same images in a file.

What's a good way for me to convert that "imagebook" (whatever you call it) to a "textbook" so that I can send it to her to use TTS on her iPad?

Noel Coward is a genius.

He picked the perfect font, to prevent OCR :-)

Italics font, with rough edges. The scanning team did a great job, but maybe they should have tried OCR first, before cleanup.

*******

https://archive.org/stream/in.ernet.dli.2015.210130/2015.210130.Private-Lives_djvu.txt <=== try TTS on this

( https://archive.org/details/in.ernet.dli.2015.210130 )

Ocr ABBYY FineReader 11.0
Ppi 600 <=== Didn't look like 600 to me...

Each scanned page is 2800 x 4000 pixels, so it would
depend on the size of the printed page, as to whether
600 is true or not.

Windows apparently has an OCR library. Fat lot of good that does me.

https://blogs.windows.com/windowsdeveloper/2016/02/08/optical-character-recognition-ocr-for-windows-10/

If you watch how the OCR in the old Acrobat Distiller package
used to work, first it does layout analysis. It recognizes text columns
in a three-column layout. Then, it selects lines of text (pixmap sections)
and does OCR on them, and it associates the text with the column.

The Microsoft OCR library, at a guess, does not do layout analysis. It
takes whatever pixmap section you feed it, and makes a line of text
(with little or no punctuation or layout info). This is why the
sample image they fed it, only had one line of text in it, because
the output result would be indistinguishable from whether a layout
engine had been present or not. If the image had just two lines of
text, you would realize what its capabilities actually were.

I could easily feed the sample through some package running
Tesseract, but we all know how that will turn out.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rudolph Rhein@21:1/5 to Eli the Bearded on Sat Aug 26 09:37:00 2023

XPost: comp.text.pdf, rec.photo.digital

Eli the Bearded <*@eli.users.panix.com> wrote:

It's archive.org. They have documents in multiple formats already.

How the heck did you know that?

https://archive.org/details/in.ernet.dli.2015.210130

That's a much better link (to send to the other Great Bookers!).

DOWNLOAD OPTIONS
* ABBYY GZ download
* DAISY download For print-disabled users
* EPUB download
* FULL TEXT download

Even though I was aiming for a PDF, a "full text" seems to be the most
native for a speech-to-text program, wouldn't you think it would be?

* ITEM TILE download
* KINDLE download
* PDF download
* PDF WITH TEXT download
* SINGLE PAGE PROCESSED JP2 ZIP

Usually I'm comfortable starting with an EPUB or Kindle for conversion.
But what's the difference between "PDF" and "PDF with text" anyway?

Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
typical OCR errors in it.

How do you know that?
Are you saying the EPUB/Kindle are the most faithful then?

Elijah
------
does not know what all of the formats are

Kindle: <https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.mobi>

EPUB: <https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.epub>

I opened that EPUB file in the Windows Calibre program.
It had a mixture of mostly text, but some scanned pages.

The disclaimer at the beginning said:
"This book was produced in EPUB format by the Internet
Archive.The book pages were scanned and converted to EPUB
format automatically. This process relies on optical
character recognition, and is somewhat susceptible to
errors. The book may not offer the correct reading
sequence, and there may be weird characters, nonwords, and incorrect
guesses at structure. Some page numbers and headers or footers may remain
from the scanned page. The process which identifies images might have found stray marks on the page which are not actually images from the book. The
hidden page numbering which may be available to your ereader corresponds to
the numbered pages in the print edition, but is not an exact match; page numbers will increment at the same rate as the corresponding print edition,
but we may have started numbering before the print book's visible page
numbers. The Internet Archive is working to improve the scanning process
and resulting books, but in the meantime, we hope that this book will be
useful to you."

Using Calibre, I converted that 271KB EPUP into a 625KB PDF file instead. Unlike before, the font is a normal font now, and it seems to be PDF text.

I think, thanks to you, that the mission was accomplished.
But I'll only know later when her iPad reads that PDF out as text.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stan Brown@21:1/5 to Rudolph Rhein on Sat Aug 26 08:50:35 2023

XPost: comp.text.pdf, rec.photo.digital

On Sat, 26 Aug 2023 09:37:00 +0300, Rudolph Rhein wrote:

Usually I'm comfortable starting with an EPUB or Kindle for conversion.
But what's the difference between "PDF" and "PDF with text" anyway?

The text is a second "layer". PDF-Xchange, among others, can OCR the
images and create that layer. The quality of the text rendering is
_highly_ dependent on the quality of the images.

--
Stan Brown, Tehachapi, California, USA https://BrownMath.com/
Shikata ga nai...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	297
Nodes:	16 (2 / 14)
Uptime:	03:57:51
Calls:	6,666
Files:	12,213
Messages:	5,335,872

Convert an imagebook to a textbook (perhaps using OCR?)

Who's Online

System Info