Keith Thompson wrote:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
draft of C20.
When I copy text from n2479.pdf, I get things like this:
The
:
strdup function
::::::
creates
::
a
:::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s:: in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
::
to
:::::::
malloc.
:
(It varies slightly depending on which PDF viewer I use.)
PDF files can be read into Office Word, but this only works
when the author has generated a dual-representation type of
PDF which holds info Office can use.
LibreOffice Draw can read in PDF, but not likely with any
good purpose in mind. Don't try it on this document!!!
Use it on a single page test PDF just to see how it works.
So far, nothing I have handy here, looks immediately useful
in the "pure GUI power tool" department.
*******
I tried this.
mutool convert -F pdf -O decompress,clean -o n2479_out.pdf n2479.pdf # a mess
The underline effect seems to be a font with a single character (sinewave)
in it. In the document, where it underlines the word "underlining", the
stanza looks like... ten sinewaves underneath an eleven character word.
/F3 5.9776 Tf
1 0 0 1 230.857 349.568 Tm
[<0001000100010001000100010001000100010001>] TJ
If converted to Postscript, the underline method looks like this.
.895628 .7673 0 0 cmyk
VWZQUL+LASY6*1 [5.9776 0 0 -5.9776 0 0 ]msf
320.52 467.331 mo
(::::::)
[4.98111 4.98114 4.98111 4.98111 4.98114 0 ]xsh
Neither method was of sufficient quality to be part of a workflow.
The document does not convert cleanly enough for this.
*******
Converted to HTML, there were no complaints about font conversion.
Loading the HTML into a browser sorta works OK. The above spaghetti
shows what the HTML section with the "underlining" text looks like.
The color is blue #0000ff.
mutool convert -F html -o n2479.html n2479.pdf
<p style="position:absolute;white-space:pre;margin:0;padding:0;top:480pt;left:111pt">
<span style="font-family:URWPalladioL,serif;font-size:9.024493pt">
text that has been deleted and
</span>
<i><span style="font-family:LASY6,serif;font-size:5.9776pt;color:#0000ff"> <=== to be
::::::::::</span></i> <=== removed
<p style="position:absolute;white-space:pre;margin:0;padding:0;top:480pt;left:233pt">
<span style="font-family:URWPalladioL,serif;font-size:8.9664pt;color:#0000ff">
underlining
</span>
<span style="font-family:URWPalladioL,serif;font-size:9.024493pt">
text that has been added. Pages that contain changes
</span></p>
This removed some of them, until I found a ">:: ::: :<" one.
The second expression may have got rid of more of them. What I'm
doing, is just removing the strings of colons and replacing
them with a blank >< pair, an empty text string. Rather than
edit the whole string in front of it.
sed 's/>:*</></g' n2479.html > n2479sed.html
sed 's/>[: ]*</></g' n2479.html > n2479sed.html
That's as far as I got.
Still no good HTML to text function has shown up.
I'd like to preserve some of the positioning so the
file is human-readable.
The colored text still has to be corrected. The HTML version
did not preserve the strikethru effect, and if the file is
converted to text, both old and new strings will be
included. And not all red text is strikethru text, so
finding red coloring and removing strings likely won't
work right either.
You can copy/paste out of Firefox after using
firefox n2479sed.html
That should be workable for small samples.
Paul
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)