• [mutool] Save images as independent files + manage paragraphs?

    From Heck Lennon@21:1/5 to All on Thu Apr 23 15:38:25 2020
    Hello,

    According to Artifex*, this newsgroup is one of the ways to ask questions.

    I'm only getting started investing how to turn a PDF into EPUB.

    By default*, "mutool draw" saves pictures within the HTML files as base64, and breaks paragraphs into indepdent lines with <p>…</p>.

    mutool draw -F html -o out.%d.html in.pdf

    I was wondering if there were a way to…
    1. Have it keep paragraphs together
    2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

    Thank you.

    * https://artifex.com/support/open-source/
    ** https://mupdf.com/docs/manual-mutool-draw.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luser droog@21:1/5 to Heck Lennon on Thu Apr 23 21:32:34 2020
    On Thursday, April 23, 2020 at 5:38:26 PM UTC-5, Heck Lennon wrote:
    Hello,

    According to Artifex*, this newsgroup is one of the ways to ask questions.

    That's true, but we're more focused on PostScript rather than the whole document processing milieu.


    I'm only getting started investing how to turn a PDF into EPUB.

    PDF has its own group.

    https://groups.google.com/forum/#!forum/comp.text.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From ken@21:1/5 to All on Fri Apr 24 09:13:31 2020
    In article <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>, frdtheman@gmail.com says...

    According to Artifex*, this newsgroup is one of the ways to ask
    questions.

    It is, but essentially for Ghostscript (which is a PostScript
    interpreter) rather than MuPDF. You may find you get answers more
    quickly (and indeed better informed ones) by using IRC and joining the
    #mupdf channel on freenode.net


    By default*, "mutool draw" saves pictures within the HTML files as
    base64, and breaks paragraphs into indepdent lines with <p>?</p>.

    mutool draw -F html -o out.%d.html in.pdf

    I was wondering if there were a way to?
    1. Have it keep paragraphs together

    OK you may need to do some more research on the structure of a PDF file.
    I'm assuming you are more familiar with HTML than PDF, and it may come
    as a surprise to you to discover that PDF does not have the same kind of metadata that an HTML file would.

    This is especially true with text, there is no concept of text structure
    in a PDF file at all, no lines, no paragraphs, sentences, nothing. All
    there is in a PDF file is 'this text' and 'put it here on the page'.

    The encoding used for the text may even be custom, and ther emay be no
    possible method (other than OCR) for determining the actual text content
    (eg the Unicode values).

    Sentences don't even have to be contiguous, I could (and PDF files
    sometimes do) write at the top left of the page "The quick brown" then
    drop to the bottom of the page, write "Copyright mother goose", then
    jump back up to the top of the page, but moved along to the right, and
    write "jumped over the lazy dog". Then move back to the left, between
    the two existing pieces of text at the top, and write "fox".

    So that's why you don't get the paragraphs you exepct, there aren't any
    to start with. So by inference no, you can't have MuPDF keep paragraphs together.

    If you just look at the text and the order it appears in the PDF file,
    it won't reliably tell you much. There is positional information
    available for the text though, so you can post-process the extracted
    text and apply your own heuristics to try and decide where paragraphs,
    columns, tables etc are.


    2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

    No, currently there is no way to do that. Obviously the code could be
    altered so that the image data is written to a series of files, and
    links to those files inserted into the HTML in their place.

    But it can't be done with the existing code by simply flipping a switch
    or something.


    Caveat: I am not one of the MuPDF developers, the information above
    regarding image data was provided to me by one of the developers though,
    the text information is by me, so if its wrong I can be blamed.


    Regards,

    Ken

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Heck Lennon@21:1/5 to All on Fri Apr 24 11:05:22 2020
    Thanks much for the infos!

    Le vendredi 24 avril 2020 10:13:33 UTC+2, ken a écrit :
    In article <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>, frdtheman@gmail.com says...

    According to Artifex*, this newsgroup is one of the ways to ask
    questions.

    It is, but essentially for Ghostscript (which is a PostScript
    interpreter) rather than MuPDF. You may find you get answers more
    quickly (and indeed better informed ones) by using IRC and joining the #mupdf channel on freenode.net


    By default*, "mutool draw" saves pictures within the HTML files as
    base64, and breaks paragraphs into indepdent lines with <p>?</p>.

    mutool draw -F html -o out.%d.html in.pdf

    I was wondering if there were a way to?
    1. Have it keep paragraphs together

    OK you may need to do some more research on the structure of a PDF file.
    I'm assuming you are more familiar with HTML than PDF, and it may come
    as a surprise to you to discover that PDF does not have the same kind of metadata that an HTML file would.

    This is especially true with text, there is no concept of text structure
    in a PDF file at all, no lines, no paragraphs, sentences, nothing. All
    there is in a PDF file is 'this text' and 'put it here on the page'.

    The encoding used for the text may even be custom, and ther emay be no possible method (other than OCR) for determining the actual text content
    (eg the Unicode values).

    Sentences don't even have to be contiguous, I could (and PDF files
    sometimes do) write at the top left of the page "The quick brown" then
    drop to the bottom of the page, write "Copyright mother goose", then
    jump back up to the top of the page, but moved along to the right, and
    write "jumped over the lazy dog". Then move back to the left, between
    the two existing pieces of text at the top, and write "fox".

    So that's why you don't get the paragraphs you exepct, there aren't any
    to start with. So by inference no, you can't have MuPDF keep paragraphs together.

    If you just look at the text and the order it appears in the PDF file,
    it won't reliably tell you much. There is positional information
    available for the text though, so you can post-process the extracted
    text and apply your own heuristics to try and decide where paragraphs, columns, tables etc are.


    2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

    No, currently there is no way to do that. Obviously the code could be altered so that the image data is written to a series of files, and
    links to those files inserted into the HTML in their place.

    But it can't be done with the existing code by simply flipping a switch
    or something.


    Caveat: I am not one of the MuPDF developers, the information above regarding image data was provided to me by one of the developers though,
    the text information is by me, so if its wrong I can be blamed.


    Regards,

    Ken

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From news@zzo38computer.org.invalid@21:1/5 to ken on Fri Apr 24 22:50:32 2020
    ken <ken@spamcop.net> wrote:

    2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

    No, currently there is no way to do that. Obviously the code could be
    altered so that the image data is written to a series of files, and
    links to those files inserted into the HTML in their place.

    But it can't be done with the existing code by simply flipping a switch
    or something.

    Of course, it would also be possible to post-process the HTML data with an external program and copy the pictures to external files. I don't know if
    there is an existing program to do this, though. (Doing it manually would
    also be possible, although this isn't ideal.)

    Another question might be where the PDFs come from, and why you need
    converted to EPUB; depending on the answer, it might be possible to do something else in order to do what is needed (including for converting the paragraphs). However, that isn't the question being asked, so for now we
    just answer the question about converting PDF to EPUB.

    --
    This signature intentionally left blank.
    (But if it has these words, then actually it isn't blank, isn't it?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)