Forum: >>> Magnum BBS <<<

[mutool] Save images as independent files + manage paragraphs?

From Heck Lennon@21:1/5 to All on Thu Apr 23 15:38:25 2020

Hello,

According to Artifex*, this newsgroup is one of the ways to ask questions.

I'm only getting started investing how to turn a PDF into EPUB.

By default*, "mutool draw" saves pictures within the HTML files as base64, and breaks paragraphs into indepdent lines with ….

mutool draw -F html -o out.%d.html in.pdf

I was wondering if there were a way to…
1. Have it keep paragraphs together
2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

Thank you.

* https://artifex.com/support/open-source/
** https://mupdf.com/docs/manual-mutool-draw.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to Heck Lennon on Thu Apr 23 21:32:34 2020

On Thursday, April 23, 2020 at 5:38:26 PM UTC-5, Heck Lennon wrote:

Hello,

According to Artifex*, this newsgroup is one of the ways to ask questions.

That's true, but we're more focused on PostScript rather than the whole document processing milieu.

I'm only getting started investing how to turn a PDF into EPUB.

PDF has its own group.

https://groups.google.com/forum/#!forum/comp.text.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From ken@21:1/5 to All on Fri Apr 24 09:13:31 2020

In article <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>, frdtheman@gmail.com says...

According to Artifex*, this newsgroup is one of the ways to ask

questions.

It is, but essentially for Ghostscript (which is a PostScript
interpreter) rather than MuPDF. You may find you get answers more
quickly (and indeed better informed ones) by using IRC and joining the
#mupdf channel on freenode.net

By default*, "mutool draw" saves pictures within the HTML files as

base64, and breaks paragraphs into indepdent lines with ?.

mutool draw -F html -o out.%d.html in.pdf

I was wondering if there were a way to?
1. Have it keep paragraphs together

OK you may need to do some more research on the structure of a PDF file.
I'm assuming you are more familiar with HTML than PDF, and it may come
as a surprise to you to discover that PDF does not have the same kind of metadata that an HTML file would.

This is especially true with text, there is no concept of text structure
in a PDF file at all, no lines, no paragraphs, sentences, nothing. All
there is in a PDF file is 'this text' and 'put it here on the page'.

The encoding used for the text may even be custom, and ther emay be no
possible method (other than OCR) for determining the actual text content
(eg the Unicode values).

Sentences don't even have to be contiguous, I could (and PDF files
sometimes do) write at the top left of the page "The quick brown" then
drop to the bottom of the page, write "Copyright mother goose", then
jump back up to the top of the page, but moved along to the right, and
write "jumped over the lazy dog". Then move back to the left, between
the two existing pieces of text at the top, and write "fox".

So that's why you don't get the paragraphs you exepct, there aren't any
to start with. So by inference no, you can't have MuPDF keep paragraphs together.

If you just look at the text and the order it appears in the PDF file,
it won't reliably tell you much. There is positional information
available for the text though, so you can post-process the extracted
text and apply your own heuristics to try and decide where paragraphs,
columns, tables etc are.

2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

No, currently there is no way to do that. Obviously the code could be
altered so that the image data is written to a series of files, and
links to those files inserted into the HTML in their place.

But it can't be done with the existing code by simply flipping a switch
or something.

Caveat: I am not one of the MuPDF developers, the information above
regarding image data was provided to me by one of the developers though,
the text information is by me, so if its wrong I can be blamed.

Regards,

Ken

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Heck Lennon@21:1/5 to All on Fri Apr 24 11:05:22 2020

Thanks much for the infos!

Le vendredi 24 avril 2020 10:13:33 UTC+2, ken a écrit :

In article <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>, frdtheman@gmail.com says...

According to Artifex*, this newsgroup is one of the ways to ask

questions.

It is, but essentially for Ghostscript (which is a PostScript
interpreter) rather than MuPDF. You may find you get answers more
quickly (and indeed better informed ones) by using IRC and joining the #mupdf channel on freenode.net

By default*, "mutool draw" saves pictures within the HTML files as

base64, and breaks paragraphs into indepdent lines with ?.

mutool draw -F html -o out.%d.html in.pdf

I was wondering if there were a way to?
1. Have it keep paragraphs together

OK you may need to do some more research on the structure of a PDF file.
I'm assuming you are more familiar with HTML than PDF, and it may come
as a surprise to you to discover that PDF does not have the same kind of metadata that an HTML file would.

This is especially true with text, there is no concept of text structure
in a PDF file at all, no lines, no paragraphs, sentences, nothing. All
there is in a PDF file is 'this text' and 'put it here on the page'.

The encoding used for the text may even be custom, and ther emay be no possible method (other than OCR) for determining the actual text content
(eg the Unicode values).

Sentences don't even have to be contiguous, I could (and PDF files
sometimes do) write at the top left of the page "The quick brown" then
drop to the bottom of the page, write "Copyright mother goose", then
jump back up to the top of the page, but moved along to the right, and
write "jumped over the lazy dog". Then move back to the left, between
the two existing pieces of text at the top, and write "fox".

So that's why you don't get the paragraphs you exepct, there aren't any
to start with. So by inference no, you can't have MuPDF keep paragraphs together.

If you just look at the text and the order it appears in the PDF file,
it won't reliably tell you much. There is positional information
available for the text though, so you can post-process the extracted
text and apply your own heuristics to try and decide where paragraphs, columns, tables etc are.

2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

No, currently there is no way to do that. Obviously the code could be altered so that the image data is written to a series of files, and
links to those files inserted into the HTML in their place.

But it can't be done with the existing code by simply flipping a switch
or something.

Caveat: I am not one of the MuPDF developers, the information above regarding image data was provided to me by one of the developers though,
the text information is by me, so if its wrong I can be blamed.

Regards,

Ken

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From news@zzo38computer.org.invalid@21:1/5 to ken on Fri Apr 24 22:50:32 2020

ken <ken@spamcop.net> wrote:

2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

No, currently there is no way to do that. Obviously the code could be
altered so that the image data is written to a series of files, and
links to those files inserted into the HTML in their place.

But it can't be done with the existing code by simply flipping a switch
or something.

Of course, it would also be possible to post-process the HTML data with an external program and copy the pictures to external files. I don't know if
there is an existing program to do this, though. (Doing it manually would
also be possible, although this isn't ideal.)

Another question might be where the PDFs come from, and why you need
converted to EPUB; depending on the answer, it might be possible to do something else in order to do what is needed (including for converting the paragraphs). However, that isn't the question being asked, so for now we
just answer the question about converting PDF to EPUB.

--
This signature intentionally left blank.
(But if it has these words, then actually it isn't blank, isn't it?)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Apr 18 21:44:01 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 19 09:15:26 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 19 08:49:01 2024
  from Wales, Uk via Telnet
- Chippey
  Fri Apr 19 02:45:49 2024
  from Winnipeg, Canada via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	217:59:16
Calls:	6,621
Calls today:	3
Files:	12,171
Messages:	5,317,713

[mutool] Save images as independent files + manage paragraphs?

Who's Online

Recent Visitors

System Info