• Adding invisible text layer ("manual ocr")

    From Philipp Klaus Krause@21:1/5 to All on Tue Nov 10 11:37:33 2020
    I have a .jpeg image that I want to turn into a searchable .pdf.
    Usually, I use tesseract for that. But in this case, the .pdf from
    tesseract is useless, as most words aren't recognized.

    Is there a way I can do the job of tesseract manually? I can read the
    text in the image, and thus type it. I want a .pdf with an invisible
    searchable text layer on top of the image.

    Philipp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Flynn@21:1/5 to Philipp Klaus Krause on Fri Nov 13 21:36:32 2020
    On 10/11/2020 10:37, Philipp Klaus Krause wrote:
    I have a .jpeg image that I want to turn into a searchable .pdf.
    Usually, I use tesseract for that. But in this case, the .pdf from
    tesseract is useless, as most words aren't recognized.

    Is there a way I can do the job of tesseract manually? I can read
    the text in the image, and thus type it. I want a .pdf with an
    invisible searchable text layer on top of the image.
    pdflatex can do that, using the package "transparent". In a project I
    worked with, we were experimenting with doing this for old manuscripts,
    so that users can, in effect, copy-and-paste from what looks like a
    photograph of the manuscript page.

    There's a test page at
    http://xml.silmaril.ie/downloads/copytext-example.pdf — you should be
    able to select text from the image and paste it into another application.

    Obviously for something written by hand, every typeset letter has to be
    aligned manually so that it superimposes on the manuscript letter, which
    is amazingly time-consuming.

    But if you are dealing with modern type, where it's straight and
    regular, it would be fairly straightforward to do the superimposition by
    using the same font so the spacing is the same.

    Can you share the image that you're dealing with?

    Peter

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)