• Copying text from n2479.pdf

    From Keith Thompson@21:1/5 to All on Fri Sep 25 11:13:20 2020
    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
    draft of C20.

    When I copy text from n2479.pdf, I get things like this:

    The
    :
    strdup function
    ::::::
    creates
    ::
    a
    :::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s:: in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
    ::
    to
    :::::::
    malloc.
    :

    (It varies slightly depending on which PDF viewer I use.)

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips Healthcare
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pankaj Jangid@21:1/5 to Keith Thompson on Sat Sep 26 08:48:54 2020
    On Fri, Sep 25 2020, Keith Thompson wrote:

    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
    draft of C20.

    When I copy text from n2479.pdf, I get things like this:

    The
    :
    strdup function
    ::::::
    creates
    ::
    a
    :::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s::
    in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
    ::
    to
    :::::::
    malloc.
    :

    It is because of those wavy underlines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Pankaj Jangid on Fri Sep 25 23:05:57 2020
    Pankaj Jangid <pankaj.jangid@gmail.com> writes:
    On Fri, Sep 25 2020, Keith Thompson wrote:
    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
    draft of C20.

    When I copy text from n2479.pdf, I get things like this:

    The
    :
    strdup function
    ::::::
    creates
    ::
    a
    :::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s::
    in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
    ::
    to
    :::::::
    malloc.
    :

    It is because of those wavy underlines.

    Yes, that explains it, thanks. So I can copy-and-paste from N2478,
    which doesn't have the wavy wavy underlining:

    The strndup function creates a string initialized with no more than
    size initial characters of the array pointed to by s and up to the
    first null character, whichever comes first, in a space allocated as
    if by a call to malloc .

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips Healthcare
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Keith Thompson on Sat Sep 26 07:51:52 2020
    Keith Thompson wrote:
    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
    draft of C20.

    When I copy text from n2479.pdf, I get things like this:

    The
    :
    strdup function
    ::::::
    creates
    ::
    a
    :::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s:: in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
    ::
    to
    :::::::
    malloc.
    :

    (It varies slightly depending on which PDF viewer I use.)


    PDF files can be read into Office Word, but this only works
    when the author has generated a dual-representation type of
    PDF which holds info Office can use.

    LibreOffice Draw can read in PDF, but not likely with any
    good purpose in mind. Don't try it on this document!!!
    Use it on a single page test PDF just to see how it works.

    So far, nothing I have handy here, looks immediately useful
    in the "pure GUI power tool" department.

    *******

    I tried this.

    mutool convert -F pdf -O decompress,clean -o n2479_out.pdf n2479.pdf # a mess

    The underline effect seems to be a font with a single character (sinewave)
    in it. In the document, where it underlines the word "underlining", the
    stanza looks like... ten sinewaves underneath an eleven character word.

    /F3 5.9776 Tf
    1 0 0 1 230.857 349.568 Tm
    [<0001000100010001000100010001000100010001>] TJ

    If converted to Postscript, the underline method looks like this.

    .895628 .7673 0 0 cmyk
    VWZQUL+LASY6*1 [5.9776 0 0 -5.9776 0 0 ]msf
    320.52 467.331 mo
    (::::::)
    [4.98111 4.98114 4.98111 4.98111 4.98114 0 ]xsh

    Neither method was of sufficient quality to be part of a workflow.
    The document does not convert cleanly enough for this.

    *******

    Converted to HTML, there were no complaints about font conversion.
    Loading the HTML into a browser sorta works OK. The above spaghetti
    shows what the HTML section with the "underlining" text looks like.
    The color is blue #0000ff.

    mutool convert -F html -o n2479.html n2479.pdf

    <p style="position:absolute;white-space:pre;margin:0;padding:0;top:480pt;left:111pt">
    <span style="font-family:URWPalladioL,serif;font-size:9.024493pt">
    text that has been deleted and
    </span>

    <i><span style="font-family:LASY6,serif;font-size:5.9776pt;color:#0000ff"> <=== to be
    ::::::::::</span></i> <=== removed



    <p style="position:absolute;white-space:pre;margin:0;padding:0;top:480pt;left:233pt">
    <span style="font-family:URWPalladioL,serif;font-size:8.9664pt;color:#0000ff">
    underlining
    </span>
    <span style="font-family:URWPalladioL,serif;font-size:9.024493pt">
    text that has been added. Pages that contain changes
    </span></p>

    This removed some of them, until I found a ">:: ::: :<" one.
    The second expression may have got rid of more of them. What I'm
    doing, is just removing the strings of colons and replacing
    them with a blank >< pair, an empty text string. Rather than
    edit the whole string in front of it.

    sed 's/>:*</></g' n2479.html > n2479sed.html

    sed 's/>[: ]*</></g' n2479.html > n2479sed.html

    That's as far as I got.

    Still no good HTML to text function has shown up.
    I'd like to preserve some of the positioning so the
    file is human-readable.

    The colored text still has to be corrected. The HTML version
    did not preserve the strikethru effect, and if the file is
    converted to text, both old and new strings will be
    included. And not all red text is strikethru text, so
    finding red coloring and removing strings likely won't
    work right either.

    You can copy/paste out of Firefox after using

    firefox n2479sed.html

    That should be workable for small samples.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)