At <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
It turns out that the option to download the PDF as Word on the above site doesn't work (I gave up after Ms PacMan was still biting after nearly an hour),
but the text in the PDF is selectable, although with plenty of
spelling errors, but those are easy to correct when looking at the PDF.
1) Font
Do I go for monospace, like the original report, or do I something more(?) friendly on the eyes?
2) Footnotes
Obviously they don't make sense in html,
so I'm thinking about using
<details><summary> </summary> <details> tags to place them in-line, probably/possibly underlining (on hover) of the "xx)" text.
3) Tables
Don't cut & paste, so I'll have to convert them and here I've hit a snag,
I can code myself around it, but it's ugly.
Explanation: If you look at the tables in the PDF, the first is on page 14 (26 in the PDF), it has a double outside border and a single inside one,
but most cells don't have top or bottom borders.
And how do you create the inverted "L" shaped tables that are on PDF pages
83 and 117, to name just two?
Obviously I will ***not*** rotate any tables!
4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
the graphs on PDF page 106, where SVG would seem to the logical option,
The current version can be found at <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and it's far from final.
however, there are some items I would like to have suggestions on:
1) Font
Do I go for monospace, like the original report, or do I something
more(?) friendly on the eyes?
2) Footnotes
Obviously they don't make sense in html, so I'm thinking about using <details><summary> </summary> <details> tags to place them in-line, probably/possibly underlining (on hover) of the "xx)" text.
3) Tables
Don't cut & paste, so I'll have to convert them and here I've hit a
snag, I can code myself around it, but it's ugly.
Explanation: If you look at the tables in the PDF, the first is on page
14 (26 in the PDF), it has a double outside border and a single inside
one, but most cells don't have top or bottom borders.
And how do you create the inverted "L" shaped tables that are on PDF
pages 83 and 117, to name just two?
4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
the graphs on PDF page 106, where SVG would seem to the logical option, having also converted many of the the original PNG's in the Sylvain
Viard document to that format)
Robert Prins wrote:
The current version can be found at
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
it's far from final.
Starting from a PDF document in order to produce an HTML version is rather awkward, but in this case perhaps necessary (assuming you really want to create
an HTML version). Making the officials find you another version might take a very long time and may well fail.
Doing the conversion basically by hand, just extracting the content as plain text and adding markup, is probably what I would do, too. There are various ways
to try to automate the process, but there are many problems and even if it were
somehow successful, you would probably still need to do manual fixes (or program
tuning) a lot.
however, there are some items I would like to have suggestions on:
1) Font
Do I go for monospace, like the original report, or do I something more(?) >> friendly on the eyes?
This depends on your goals and limitations. If you are just creating a “facsimile” reproduction of the PDF document in HTML, you would try to preserve
its visual appearance as far as possible. But how far could you go then, and what would then be the point of the whole process?
Since the use of a monospace font is probably a tradition from the typewrter era
and since it makes the text more difficult to read, the simplest approach is to
omit all font settings, letting each browser use its default font. Alternatively, set a reasonable proportional font.
Using justified text does not work well unless you use some hyphenation. German
text has so many long words that especially in a narrow viewing area, the appearance becomes poor. You might consider some hyphenation (like manually added ­ in longest compound words) even if you keep using justification.
2) Footnotes
Obviously they don't make sense in html, so I'm thinking about using
<details><summary> </summary> <details> tags to place them in-line,
probably/possibly underlining (on hover) of the "xx)" text.
You would run into the problem that <details> is a block element.
The simplest approach is probably to put the footnotes in a separate file and make the footnote references links to elements in that file. You might consider
embedding that file in the main document with <iframe>, but you can do that later. (Well, an even simpler approach is to make the references links to elements at the end of the document, where you would put the footnote texts. But
then you would probably need to have back-references, like on Wikipedia pages,
so that after following a link to a footnote, the user can easily get back to place where the reference is.)
3) Tables
Don't cut & paste, so I'll have to convert them and here I've hit a snag, I >> can code myself around it, but it's ugly.
Explanation: If you look at the tables in the PDF, the first is on page 14 (26
in the PDF), it has a double outside border and a single inside one, but most
cells don't have top or bottom borders.
I’m not sure I see what the problem is. Do you think you need to replicate such
use of borders, instead of simply having a table with the correct data and a suitable rendering? Anyway, if you don’t want to have double borders between
cells, set border-collapse: collapse on the table element.
And how do you create the inverted "L" shaped tables that are on PDF pages 83
and 117, to name just two?
I’m not sure what the structure there is. Just two different tables touching
each other? But you can make them a single table by using empty cells (and making sure they don’t show: empty-cells: hide).
4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or the >> graphs on PDF page 106, where SVG would seem to the logical option, having >> also converted many of the the original PNG's in the Sylvain Viard document to
that format)
Depends on the images of course, but normally PNG should be sufficient for tables that find in administrative documents.
They've already told me that there is no Wordstar/WP/Word/etc version available. There may be one at the university, but as you wrote, it may take ages to find it.
Doing the conversion basically by hand, just extracting the content as plain
text and adding markup, is probably what I would do, too. There are various ways to try to automate the process, but there are many problems and even if
it were somehow successful, you would probably still need to do manual fixes
(or program tuning) a lot.
On 2021-01-19 12:04, Helmut Richter wrote:> On Tue, 19 Jan 2021, Robert Prins wrote:
There is a handy middle way: Make sure that headlines and paragraphs are at the right places in the plain-text file, and use a tool like "markdown" to actually insert HTML tags. This model is how Wikipedia works for theauthors.
Never heard of it, but even that might be overkill given the simplicity of the
html.
They've already told me that there is no Wordstar/WP/Word/etc version
available. There may be one at the university, but as you wrote, it may take
ages to find it.
Well, for a text from the 1980ies, this may not be easy. I wonder whether the content is still interesting today but that is not my problem.
If the purpose is to get a machine-readable text with less errors than
what has PDF produced by OCR, that would be fine. You may consider a OCR-only tool like "tesseract" instead; I prefer that but I am not sure whether this is just snake oil.
If the purpose is to get a formatted text with headlines and paragraphes, forget it. Whenever I had to translate a Word document into HTML, I have first
extracted the plain text and then added the markup. This is much less work than removing the Word-specific markup which is to ensure that the outcome looks exactly like the Word document that was the source. Moreover, you save 80
or 90 % of the markup, and the HTML text is correct HTML and human readable.
Doing the conversion basically by hand, just extracting the content as plain
text and adding markup, is probably what I would do, too. There are various
ways to try to automate the process, but there are many problems and even if
it were somehow successful, you would probably still need to do manual fixes
(or program tuning) a lot.
There is a handy middle way: Make sure that headlines and paragraphs are at the right places in the plain-text file, and use a tool like "markdown" to actually insert HTML tags. This model is how Wikipedia works for the authors.
Robert Prins wrote:
The current version can be found at
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
it's far from final.
3) Tables
Don't cut & paste, so I'll have to convert them and here I've hit a snag, I >> can code myself around it, but it's ugly.
Explanation: If you look at the tables in the PDF, the first is on page 14 (26
in the PDF), it has a double outside border and a single inside one, but most
cells don't have top or bottom borders.
I’m not sure I see what the problem is. Do you think you need to replicate such
use of borders, instead of simply having a table with the correct data and a suitable rendering? Anyway, if you don’t want to have double borders between
cells, set border-collapse: collapse on the table element.
On 2021-01-19 08:43, Jukka K. Korpela wrote:<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
Robert Prins wrote:
The current version can be found at
and it's far from final.
Starting from a PDF document in order to produce an HTML version is
rather awkward, but in this case perhaps necessary (assuming you really
want to create an HTML version). Making the officials find you another
version might take a very long time and may well fail.
They've already told me that there is no Wordstar/WP/Word/etc version available.
Robert Prins wrote:
The current version can be found at
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
and it's far from final.
Starting from a PDF document in order to produce an HTML version is
rather awkward, but in this case perhaps necessary (assuming you really
want to create an HTML version).
Robert Prins wrote:
At
<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html> >>
It turns out that the option to download the PDF as Word on the above site >> doesn't work (I gave up after Ms PacMan was still biting after nearly an
hour),
WFM.
but the text in the PDF is selectable, although with plenty of
spelling errors, but those are easy to correct when looking at the PDF.
As PDF is based on PostScript, there are tools like ps2txt (alias for ps2ascii(1) which is an alias for gs(1), the GhostScript binary) which can extract text from PDF documents automatically. It appears to work quite
well with the downloaded PDF document, in case you are still unable to download the Word document.
There are also tools called “pdf2html”. One is an npm package and requires
a JRE [1], but there are others, both command-line tools and Web sites.
Just google it.
1) Font
Do I go for monospace, like the original report, or do I something more(?) >> friendly on the eyes?
That depends on to which degree you want to preserve the original document.
If you are not doing this for archiving purposes, I suggest to declare a
list of sans-serif variable-width font families instead, with the more preferable font family in front and ending the list with the generic “sans- serif”. A possible list that can be recommended is
body {
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
}
(YMMV. For example, typographers would probably frown at me for including “Arial” there, or because I put it before “Helvetica”.)
If you are not into typography, or do not have the time to educate yourself about it, simply declare only “sans-serif”.
You might need to set the font-family for some descendant elements as well. (Implementations are inconsistent.)
2) Footnotes
Obviously they don't make sense in html,
They do, just not as page-end notes as, contrary to popular belief, there
are no “_HTML_ pages”. They could be footnotes in the table footer, section-end notes, or text-end notes.
so I'm thinking about using
<details><summary> </summary> <details> tags to place them in-line,
probably/possibly underlining (on hover) of the "xx)" text.
I do not think this is the correct HTML markup for footnotes. See also:
<https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details>
Footnotes as small linked superscript text are working for me. I would suggest to inspect Wikipedia for how footnotes should be done (BTDT). You can also combine that with my Accessible Pure CSS Tooltips (license is
GPLv3) that I am using on <http://PointedEars.de/es-matrix>.
Don't cut & paste, so I'll have to convert them and here I've hit a snag,
I can code myself around it, but it's ugly.
The problem may be solved now that you can download the Word document. However, if you cannot, then you may be able make your life a little easier by changing the text (if still necessary) to the following (CSV) format (without indentation):
td_content;td_content;td_content …
td_content;td_content;td_content …
Then you can first apply the replacement
; → </td><td>
and then (e.g. using regular expressions)
^ (start of line) → <tr><td>
$ (end of line) → </td></tr>
(Use another delimiter if it is obvious that the delimiter occurs in the data.)
Then surround all rows with
<table>
and
</table>
after which you can make adjustments like <td> → <th>, rowspan, colspan and accessibility attributes.
I also remember having seen a tool that can do this conversion from text
rows to HTML tables automatically, but I do not remember its name and the circumstances.
Explanation: If you look at the tables in the PDF, the first is on page 14 >> (26 in the PDF), it has a double outside border and a single inside one,
but most cells don't have top or bottom borders.
Although it may look old-fashioned, the latter is actually how *simple* *data* tables SHOULD be done. For example, it is a standing recommendation for LaTeX tables in scientific works: Only draw horizontal lines (“\hline”
or “\midrule”) to separate *groups* of rows. (In HTML this can be achieved
with a “thead” and one or more with “tbody” elements.)
That the original table style may not be suitable for the Web does not mean that copy-and-paste is necessarily a bad idea. In my PDF reader “Okular” (version 1.3.2) at least, only the text from that table is copied then.
Once you have the text in the cells using proper table markup, the borders can be easily styled with CSS. For example, something like
table { border-collapse: collapse; border: 2px double black; }
thead tr { border-bottom: 2px solid black; }
th, td { padding: 0.25em; border-right: 2px solid black; }
would come closest to the original table style. (Whether you want to do
that depends on how much you want to preserve the original.)
I would put the table footnotes in the “tfoot” element (BTDT).
And how do you create the inverted "L" shaped tables that are on PDF pages >> 83 and 117, to name just two?
In the case of the table on page 83 of the PDF document, simply omit the
last 4 table cells in each row, or add empty cells but style them so that they are not visible.
Obviously I will ***not*** rotate any tables!
I do not see the need for any rotation in the first place :)
4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
the graphs on PDF page 106, where SVG would seem to the logical option,
Unless you want to do some fancy visualization, if you only want to link to further information about the area of the map, a simple image map (“map” and
“img” element) will suffice (and will be most backwards-compatible). Since
the map contours only have to be approximate, this will be a lot easier to
do than to recreate the map exactly with SVG (unless you have an image
editor that can convert bitmaps to SVG easily – let me know which one, then).
Otherwise only extract the image using e.g. The GIMP or ImageMagick convert(1), and add an “img” element.
<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
It turns out that the option to download the PDF as Word on the above site
doesn't work (I gave up after Ms PacMan was still biting after nearly an hour),
On Fri, 22 Jan 2021, Robert Prins wrote:
<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
It turns out that the option to download the PDF as Word on the above site >>>> doesn't work (I gave up after Ms PacMan was still biting after nearly an >>>> hour),
Have you ever published an URL of the original (i.e. the relatively to
other versions most original) PDF version?
Without it, one can hardly say anything about the quality of other
versions produced from it.
On 2021-01-23 10:38, Helmut Richter wrote:
On Fri, 22 Jan 2021, Robert Prins wrote:
<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
It turns out that the option to download the PDF as Word on the above site
doesn't work (I gave up after Ms PacMan was still biting after nearly an
hour),
Have you ever published an URL of the original (i.e. the relatively to other versions most original) PDF version?
Without it, one can hardly say anything about the quality of other
versions produced from it.
There's another copy of the same PDF around on the site of a HH friend @ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know
There's another copy of the same PDF around on the site of a HH friend
@ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far
as I know the only two, other than that it also shows up on Google on
sites with, to say the least, "dodgy" names that I wouldn't touch with
a bargepole.
There also seem to be some paper copies around, just Google the title.
The .RTF version can be (temporarily) found @ <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>.
It renders OK in Word (Word 2002), and very badly in LO Writer, and
contains a horrible amount of "images", scan artifacts.
Robert
On 23/01/2021 14:24, Robert Prins wrote:
There's another copy of the same PDF around on the site of a HH friend @ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know the only two, other than that it also shows up on Google on sites with, to say the least, "dodgy" names that I wouldn't touch with a bargepole.
There also seem to be some paper copies around, just Google the title.
The .RTF version can be (temporarily) found @ <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>. It renders OK in Word (Word 2002), and very badly in LO Writer, and contains a horrible amount of "images", scan artifacts.
Robert
Can you not just embed the file in your HTML like so?
<https://technical.mytechsite.gq/docs/test.html>
For some reason, I missed this post a few days ago, not good...
[…]
I tried to read it with tesseract, and the outcome looks good at first
sight.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 251 |
Nodes: | 16 (2 / 14) |
Uptime: | 133:44:43 |
Calls: | 5,524 |
Calls today: | 1 |
Files: | 11,671 |
Messages: | 5,095,158 |