• Convert an imagebook to a textbook (perhaps using OCR?)

    From Rudolph Rhein@21:1/5 to All on Sat Aug 26 04:05:18 2023
    XPost: comp.text.pdf, rec.photo.digital

    My sister's next-month Great Books is Noel Coward's play comedy named
    "Private Lives" from the 1930s. She's almost blind from complications.

    She is not technical and she only has an iPad and an iPhone but I have
    Android & Windows so she asked me to help her with IOS text to speech.

    She sent me the link to the PDF because it won't text-to-speech read out. <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>

    Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
    it) but just a set of scanned images of the book (with no actual text).

    I tried converting that PDF with Calibre on Windows to an EPUB format,
    but the EPUB was nothing more than a set of the same images in a file.

    What's a good way for me to convert that "imagebook" (whatever you call it)
    to a "textbook" so that I can send it to her to use TTS on her iPad?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to RudolphRhein@nospam.net on Sat Aug 26 02:00:45 2023
    XPost: comp.text.pdf, rec.photo.digital

    Follow-ups set to comp.text.pdf.

    In rec.photo.digital, Rudolph Rhein <RudolphRhein@nospam.net> wrote:
    My sister's next-month Great Books is Noel Coward's play comedy named "Private Lives" from the 1930s. She's almost blind from complications.
    ...
    <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>
    Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
    it) but just a set of scanned images of the book (with no actual text).

    It's archive.org. They have documents in multiple formats already.

    https://archive.org/details/in.ernet.dli.2015.210130

    DOWNLOAD OPTIONS
    * ABBYY GZ download
    * DAISY download For print-disabled users
    * EPUB download
    * FULL TEXT download
    * ITEM TILE download
    * KINDLE download
    * PDF download
    * PDF WITH TEXT download
    * SINGLE PAGE PROCESSED JP2 ZIP

    Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
    typical OCR errors in it.

    Elijah
    ------
    does not know what all of the formats are

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Rudolph Rhein on Fri Aug 25 23:29:06 2023
    XPost: comp.text.pdf, rec.photo.digital

    On 8/25/2023 9:05 PM, Rudolph Rhein wrote:
    My sister's next-month Great Books is Noel Coward's play comedy named "Private Lives" from the 1930s. She's almost blind from complications.

    She is not technical and she only has an iPad and an iPhone but I have Android & Windows so she asked me to help her with IOS text to speech.

    She sent me the link to the PDF because it won't text-to-speech read out. <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>

    Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
    it) but just a set of scanned images of the book (with no actual text).

    I tried converting that PDF with Calibre on Windows to an EPUB format,
    but the EPUB was nothing more than a set of the same images in a file.

    What's a good way for me to convert that "imagebook" (whatever you call it) to a "textbook" so that I can send it to her to use TTS on her iPad?


    Noel Coward is a genius.

    He picked the perfect font, to prevent OCR :-)

    Italics font, with rough edges. The scanning team did a great job, but maybe they should have tried OCR first, before cleanup.

    *******

    https://archive.org/stream/in.ernet.dli.2015.210130/2015.210130.Private-Lives_djvu.txt <=== try TTS on this

    ( https://archive.org/details/in.ernet.dli.2015.210130 )

    Ocr ABBYY FineReader 11.0
    Ppi 600 <=== Didn't look like 600 to me...

    Each scanned page is 2800 x 4000 pixels, so it would
    depend on the size of the printed page, as to whether
    600 is true or not.

    Windows apparently has an OCR library. Fat lot of good that does me.

    https://blogs.windows.com/windowsdeveloper/2016/02/08/optical-character-recognition-ocr-for-windows-10/

    If you watch how the OCR in the old Acrobat Distiller package
    used to work, first it does layout analysis. It recognizes text columns
    in a three-column layout. Then, it selects lines of text (pixmap sections)
    and does OCR on them, and it associates the text with the column.

    The Microsoft OCR library, at a guess, does not do layout analysis. It
    takes whatever pixmap section you feed it, and makes a line of text
    (with little or no punctuation or layout info). This is why the
    sample image they fed it, only had one line of text in it, because
    the output result would be indistinguishable from whether a layout
    engine had been present or not. If the image had just two lines of
    text, you would realize what its capabilities actually were.

    I could easily feed the sample through some package running
    Tesseract, but we all know how that will turn out.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rudolph Rhein@21:1/5 to Eli the Bearded on Sat Aug 26 09:37:00 2023
    XPost: comp.text.pdf, rec.photo.digital

    Eli the Bearded <*@eli.users.panix.com> wrote:

    It's archive.org. They have documents in multiple formats already.

    How the heck did you know that?

    https://archive.org/details/in.ernet.dli.2015.210130

    That's a much better link (to send to the other Great Bookers!).

    DOWNLOAD OPTIONS
    * ABBYY GZ download
    * DAISY download For print-disabled users
    * EPUB download
    * FULL TEXT download

    Even though I was aiming for a PDF, a "full text" seems to be the most
    native for a speech-to-text program, wouldn't you think it would be?

    * ITEM TILE download
    * KINDLE download
    * PDF download
    * PDF WITH TEXT download
    * SINGLE PAGE PROCESSED JP2 ZIP

    Usually I'm comfortable starting with an EPUB or Kindle for conversion.
    But what's the difference between "PDF" and "PDF with text" anyway?

    Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
    typical OCR errors in it.

    How do you know that?
    Are you saying the EPUB/Kindle are the most faithful then?

    Elijah
    ------
    does not know what all of the formats are

    Kindle: <https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.mobi>

    EPUB: <https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.epub>

    I opened that EPUB file in the Windows Calibre program.
    It had a mixture of mostly text, but some scanned pages.

    The disclaimer at the beginning said:
    "This book was produced in EPUB format by the Internet
    Archive.The book pages were scanned and converted to EPUB
    format automatically. This process relies on optical
    character recognition, and is somewhat susceptible to
    errors. The book may not offer the correct reading
    sequence, and there may be weird characters, nonwords, and incorrect
    guesses at structure. Some page numbers and headers or footers may remain
    from the scanned page. The process which identifies images might have found stray marks on the page which are not actually images from the book. The
    hidden page numbering which may be available to your ereader corresponds to
    the numbered pages in the print edition, but is not an exact match; page numbers will increment at the same rate as the corresponding print edition,
    but we may have started numbering before the print book's visible page
    numbers. The Internet Archive is working to improve the scanning process
    and resulting books, but in the meantime, we hope that this book will be
    useful to you."

    Using Calibre, I converted that 271KB EPUP into a 625KB PDF file instead. Unlike before, the font is a normal font now, and it seems to be PDF text.

    I think, thanks to you, that the mission was accomplished.
    But I'll only know later when her iPad reads that PDF out as text.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stan Brown@21:1/5 to Rudolph Rhein on Sat Aug 26 08:50:35 2023
    XPost: comp.text.pdf, rec.photo.digital

    On Sat, 26 Aug 2023 09:37:00 +0300, Rudolph Rhein wrote:
    Usually I'm comfortable starting with an EPUB or Kindle for conversion.
    But what's the difference between "PDF" and "PDF with text" anyway?


    The text is a second "layer". PDF-Xchange, among others, can OCR the
    images and create that layer. The quality of the text rendering is
    _highly_ dependent on the quality of the images.

    --
    Stan Brown, Tehachapi, California, USA https://BrownMath.com/
    Shikata ga nai...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)