Forum: >>> Magnum BBS <<<

Re: Text to audio file on Windows

From Newyana2@21:1/5 to Joe Beanfish on Fri Jul 26 09:21:51 2024

On 7/25/2024 10:29 AM, Joe Beanfish wrote:

Even extracting the text can be tricky. Some PDFs are actually
just storing images, so OCR is necessary first. Even when text
is stored, it's not stored as character encoding but rather as
vector images. Which is why even the best PDF text extractors
will do things like converting u to ii or converting d to cl.

Incorrect. Text is stored as characters(glyphs), not vector images.

Glyph just means shape. The shape must be encoded somehow.
Isn't it encoded as a vector image? There are only two methods
I'm aware of. A raster image is a map of pixel values. A vector image
is a math formula. The latter can be losslessly enlarged because
they're shapes rather than point data. My understanding is that PDFs
are using vector encoding, which is why they can be enlarged
without losing definition.

If that's not true then perhaps you could point to a link. I'd be
curious to understand better how it works.

This makes a difference
because if it's a vector image shape then OCR software might be
the best way to extract the text. Stored text, on the other hand, is
not shapes but rather numbers. For example, in plain ASCII, ANSI,
or UTF-8 text, a byte value of 65 represents "A". Binary data that
directly represents characters would translate perfectly to text. But
I don't think PDFs are storing it that way. First, because fonts must
be stored in the file. Second because PDF converters often make
visual/shape errors, like seeing "u" as "ii".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Sat Sep 14 21:37:46 2024
  from Wales, Uk via Telnet
- Zharous
  Sat Sep 14 21:22:34 2024
  from Tempe, Az via Telnet
- Keyop
  Sat Sep 14 20:00:50 2024
  from Huddersfield, West Yorkshire via SSH
- Ratio
  Sat Sep 14 19:07:25 2024
  from Your, Mom, Womb via Telnet
- Tom21200
  Sat Sep 14 19:06:47 2024
  from France via Telnet
- Tom21200
  Sat Sep 14 18:58:27 2024
  from France via Telnet
- Tom21200
  Sat Sep 14 18:40:42 2024
  from France via Telnet
- Pussydestroyer3945
  Sat Sep 14 18:39:48 2024
  from Your, Mom via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	344
Nodes:	16 (2 / 14)
Uptime:	33:21:20
Calls:	7,521
Calls today:	18
Files:	12,713
Messages:	5,642,724
Posted today:	2

Re: Text to audio file on Windows

Who's Online

Recent Visitors

System Info