On 7/25/2024 10:29 AM, Joe Beanfish wrote:
Even extracting the text can be tricky. Some PDFs are actually
just storing images, so OCR is necessary first. Even when text
is stored, it's not stored as character encoding but rather as
vector images. Which is why even the best PDF text extractors
will do things like converting u to ii or converting d to cl.
Incorrect. Text is stored as characters(glyphs), not vector images.
Glyph just means shape. The shape must be encoded somehow.
Isn't it encoded as a vector image? There are only two methods
I'm aware of. A raster image is a map of pixel values. A vector image
is a math formula. The latter can be losslessly enlarged because
they're shapes rather than point data. My understanding is that PDFs
are using vector encoding, which is why they can be enlarged
without losing definition.
If that's not true then perhaps you could point to a link. I'd be
curious to understand better how it works.
This makes a difference
because if it's a vector image shape then OCR software might be
the best way to extract the text. Stored text, on the other hand, is
not shapes but rather numbers. For example, in plain ASCII, ANSI,
or UTF-8 text, a byte value of 65 represents "A". Binary data that
directly represents characters would translate perfectly to text. But
I don't think PDFs are storing it that way. First, because fonts must
be stored in the file. Second because PDF converters often make
visual/shape errors, like seeing "u" as "ii".
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)