Hello,
According to Artifex*, this newsgroup is one of the ways to ask questions.
I'm only getting started investing how to turn a PDF into EPUB.
According to Artifex*, this newsgroup is one of the ways to askquestions.
By default*, "mutool draw" saves pictures within the HTML files asbase64, and breaks paragraphs into indepdent lines with <p>?</p>.
mutool draw -F html -o out.%d.html in.pdf
I was wondering if there were a way to?
1. Have it keep paragraphs together
2. Save pictures as external JPG/PNG files instead of including them in the HTML file.
In article <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>, frdtheman@gmail.com says...
According to Artifex*, this newsgroup is one of the ways to askquestions.
It is, but essentially for Ghostscript (which is a PostScript
interpreter) rather than MuPDF. You may find you get answers more
quickly (and indeed better informed ones) by using IRC and joining the #mupdf channel on freenode.net
By default*, "mutool draw" saves pictures within the HTML files asbase64, and breaks paragraphs into indepdent lines with <p>?</p>.
mutool draw -F html -o out.%d.html in.pdf
I was wondering if there were a way to?
1. Have it keep paragraphs together
OK you may need to do some more research on the structure of a PDF file.
I'm assuming you are more familiar with HTML than PDF, and it may come
as a surprise to you to discover that PDF does not have the same kind of metadata that an HTML file would.
This is especially true with text, there is no concept of text structure
in a PDF file at all, no lines, no paragraphs, sentences, nothing. All
there is in a PDF file is 'this text' and 'put it here on the page'.
The encoding used for the text may even be custom, and ther emay be no possible method (other than OCR) for determining the actual text content
(eg the Unicode values).
Sentences don't even have to be contiguous, I could (and PDF files
sometimes do) write at the top left of the page "The quick brown" then
drop to the bottom of the page, write "Copyright mother goose", then
jump back up to the top of the page, but moved along to the right, and
write "jumped over the lazy dog". Then move back to the left, between
the two existing pieces of text at the top, and write "fox".
So that's why you don't get the paragraphs you exepct, there aren't any
to start with. So by inference no, you can't have MuPDF keep paragraphs together.
If you just look at the text and the order it appears in the PDF file,
it won't reliably tell you much. There is positional information
available for the text though, so you can post-process the extracted
text and apply your own heuristics to try and decide where paragraphs, columns, tables etc are.
2. Save pictures as external JPG/PNG files instead of including them in the HTML file.
No, currently there is no way to do that. Obviously the code could be altered so that the image data is written to a series of files, and
links to those files inserted into the HTML in their place.
But it can't be done with the existing code by simply flipping a switch
or something.
Caveat: I am not one of the MuPDF developers, the information above regarding image data was provided to me by one of the developers though,
the text information is by me, so if its wrong I can be blamed.
Regards,
Ken
2. Save pictures as external JPG/PNG files instead of including them in the HTML file.
No, currently there is no way to do that. Obviously the code could be
altered so that the image data is written to a series of files, and
links to those files inserted into the HTML in their place.
But it can't be done with the existing code by simply flipping a switch
or something.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 293 |
Nodes: | 16 (2 / 14) |
Uptime: | 217:59:16 |
Calls: | 6,621 |
Calls today: | 3 |
Files: | 12,171 |
Messages: | 5,317,713 |