• Re: Text to audio file on Windows

    From Newyana2@21:1/5 to Joe Beanfish on Fri Jul 26 09:21:51 2024
    On 7/25/2024 10:29 AM, Joe Beanfish wrote:

    Even extracting the text can be tricky. Some PDFs are actually
    just storing images, so OCR is necessary first. Even when text
    is stored, it's not stored as character encoding but rather as
    vector images. Which is why even the best PDF text extractors
    will do things like converting u to ii or converting d to cl.

    Incorrect. Text is stored as characters(glyphs), not vector images.

    Glyph just means shape. The shape must be encoded somehow.
    Isn't it encoded as a vector image? There are only two methods
    I'm aware of. A raster image is a map of pixel values. A vector image
    is a math formula. The latter can be losslessly enlarged because
    they're shapes rather than point data. My understanding is that PDFs
    are using vector encoding, which is why they can be enlarged
    without losing definition.

    If that's not true then perhaps you could point to a link. I'd be
    curious to understand better how it works.

    This makes a difference
    because if it's a vector image shape then OCR software might be
    the best way to extract the text. Stored text, on the other hand, is
    not shapes but rather numbers. For example, in plain ASCII, ANSI,
    or UTF-8 text, a byte value of 65 represents "A". Binary data that
    directly represents characters would translate perfectly to text. But
    I don't think PDFs are storing it that way. First, because fonts must
    be stored in the file. Second because PDF converters often make
    visual/shape errors, like seeing "u" as "ii".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)