Forum: >>> Magnum BBS <<<

Converting a scanned PDF to html

From Robert Prins@21:1/5 to All on Sun Jan 17 19:14:31 2021

XPost: comp.infosystems.www.authoring.stylesheets

At <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html> there's a report by the German Bundeskriminalamt about hitchhiking, and access is to say the least, and sticking to German, not very "Anwenderfreundlich", and the text version is readable but lacks the tables.

As I link to this page from one of the pages on my site <https://prino.neocities.org/sylvain_viard/sylvain_viard.html>, and as I don't have anything better to do right now, thanks to those little spikey balls floating around, I decided that it might be useful to convert the PDF to html.

The BKA has given me permission to do so:

<quote>
Dear Mr. Prins,

thank you for contacting the German Federal Criminal Police Office (Bundeskriminalamt - BKA).

I can happily inform you, that you have the permission from the “Bundeskriminalamt” to create a html-version out of the pdf-version of the book
and use it on your website.

Unfortunately we do not have a copy that we can send you. [RP: I had asked if they might still have a copy lying around in pre-PDF format]

I hope I could help you.

Kind regards

by order

Dimitrakis

________________________
Bundeskriminalamt
Internet: https://www.bka.de
</quote>

It turns out that the option to download the PDF as Word on the above site doesn't work (I gave up after Ms PacMan was still biting after nearly an hour), but the text in the PDF is selectable, although with plenty of spelling errors, but those are easy to correct when looking at the PDF.

The current version can be found at <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and it's far from final. I'm doing a basic conversion of the text (even inserting
<h3> tags with the page numbers), and at the rate I'm going, I might convert the
whole PDF in a two or three weeks,...

however, there are some items I would like to have suggestions on:

1) Font

Do I go for monospace, like the original report, or do I something more(?) friendly on the eyes?

2) Footnotes

Obviously they don't make sense in html, so I'm thinking about using <details><summary> </summary> <details> tags to place them in-line, probably/possibly underlining (on hover) of the "xx)" text.

3) Tables

Don't cut & paste, so I'll have to convert them and here I've hit a snag, I can code myself around it, but it's ugly.

Explanation: If you look at the tables in the PDF, the first is on page 14 (26 in the PDF), it has a double outside border and a single inside one, but most cells don't have top or bottom borders.

I've tried removing them with in-<td> styles, to no avail, so for table 3 (on page 16 (PDF page 27) I've hacked my way around it by putting all per-column items in a single cell, separated by <BR> tags, you can find the original and hacked copies by doing a find on "Tab. 3: Straftaten durch Anhalter und an Anhaltern"

It works, except of course the inside borders are still double, but I can live with that, but as I wrote, it's ugly. Is there a better way?

And how do you create the inverted "L" shaped tables that are on PDF pages 83 and 117, to name just two?

Obviously I will ***not*** rotate any tables!

4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or the graphs on PDF page 106, where SVG would seem to the logical option, having also converted many of the the original PNG's in the Sylvain Viard document to that format)

Those are the questions for now, looking forward to your suggestions,

Thanks,

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas 'PointedEars' Lahn@21:1/5 to Robert Prins on Mon Jan 18 00:37:51 2021

XPost: comp.infosystems.www.authoring.stylesheets

Robert Prins wrote:

At <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>

It turns out that the option to download the PDF as Word on the above site doesn't work (I gave up after Ms PacMan was still biting after nearly an hour),

WFM.

but the text in the PDF is selectable, although with plenty of
spelling errors, but those are easy to correct when looking at the PDF.

As PDF is based on PostScript, there are tools like ps2txt (alias for ps2ascii(1) which is an alias for gs(1), the GhostScript binary) which can extract text from PDF documents automatically. It appears to work quite
well with the downloaded PDF document, in case you are still unable to
download the Word document.

There are also tools called “pdf2html”. One is an npm package and requires a JRE [1], but there are others, both command-line tools and Web sites.
Just google it.

[1] <https://www.npmjs.com/package/pdf2html>

1) Font

Do I go for monospace, like the original report, or do I something more(?) friendly on the eyes?

That depends on to which degree you want to preserve the original document.

If you are not doing this for archiving purposes, I suggest to declare a
list of sans-serif variable-width font families instead, with the more preferable font family in front and ending the list with the generic “sans- serif”. A possible list that can be recommended is

body {
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
}

(YMMV. For example, typographers would probably frown at me for including “Arial” there, or because I put it before “Helvetica”.)

If you are not into typography, or do not have the time to educate yourself about it, simply declare only “sans-serif”.

You might need to set the font-family for some descendant elements as well. (Implementations are inconsistent.)

2) Footnotes

Obviously they don't make sense in html,

They do, just not as page-end notes as, contrary to popular belief, there
are no “_HTML_ pages”. They could be footnotes in the table footer, section-end notes, or text-end notes.

so I'm thinking about using
<details><summary> </summary> <details> tags to place them in-line, probably/possibly underlining (on hover) of the "xx)" text.

I do not think this is the correct HTML markup for footnotes. See also:

<https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details>

Footnotes as small linked superscript text are working for me. I would
suggest to inspect Wikipedia for how footnotes should be done (BTDT). You
can also combine that with my Accessible Pure CSS Tooltips (license is
GPLv3) that I am using on <http://PointedEars.de/es-matrix>.

3) Tables

Don't cut & paste, so I'll have to convert them and here I've hit a snag,
I can code myself around it, but it's ugly.

The problem may be solved now that you can download the Word document.
However, if you cannot, then you may be able make your life a little easier
by changing the text (if still necessary) to the following (CSV) format (without indentation):

td_content;td_content;td_content …
td_content;td_content;td_content …

Then you can first apply the replacement

; → </td><td>

and then (e.g. using regular expressions)

^ (start of line) → <tr><td>
$ (end of line) → </td></tr>

(Use another delimiter if it is obvious that the delimiter occurs in the
data.)

Then surround all rows with

<table>

and

</table>

after which you can make adjustments like <td> → <th>, rowspan, colspan and accessibility attributes.

I also remember having seen a tool that can do this conversion from text
rows to HTML tables automatically, but I do not remember its name and the circumstances.

Explanation: If you look at the tables in the PDF, the first is on page 14 (26 in the PDF), it has a double outside border and a single inside one,
but most cells don't have top or bottom borders.

Although it may look old-fashioned, the latter is actually how *simple*
*data* tables SHOULD be done. For example, it is a standing recommendation
for LaTeX tables in scientific works: Only draw horizontal lines (“\hline” or “\midrule”) to separate *groups* of rows. (In HTML this can be achieved with a “thead” and one or more with “tbody” elements.)

That the original table style may not be suitable for the Web does not mean that copy-and-paste is necessarily a bad idea. In my PDF reader “Okular” (version 1.3.2) at least, only the text from that table is copied then.
Once you have the text in the cells using proper table markup, the borders
can be easily styled with CSS. For example, something like

table { border-collapse: collapse; border: 2px double black; }
thead tr { border-bottom: 2px solid black; }
th, td { padding: 0.25em; border-right: 2px solid black; }

would come closest to the original table style. (Whether you want to do
that depends on how much you want to preserve the original.)

I would put the table footnotes in the “tfoot” element (BTDT).

And how do you create the inverted "L" shaped tables that are on PDF pages
83 and 117, to name just two?

In the case of the table on page 83 of the PDF document, simply omit the
last 4 table cells in each row, or add empty cells but style them so that
they are not visible.

Obviously I will ***not*** rotate any tables!

I do not see the need for any rotation in the first place :)

4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
the graphs on PDF page 106, where SVG would seem to the logical option,

Unless you want to do some fancy visualization, if you only want to link to further information about the area of the map, a simple image map (“map” and
“img” element) will suffice (and will be most backwards-compatible). Since the map contours only have to be approximate, this will be a lot easier to
do than to recreate the map exactly with SVG (unless you have an image
editor that can convert bitmaps to SVG easily – let me know which one,
then).

Otherwise only extract the image using e.g. The GIMP or ImageMagick
convert(1), and add an “img” element.

--
PointedEars
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2
Please do not cc me. /Bitte keine Kopien per E-Mail.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jukka K. Korpela@21:1/5 to Robert Prins on Tue Jan 19 10:43:57 2021

XPost: comp.infosystems.www.authoring.stylesheets

Robert Prins wrote:

The current version can be found at <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and it's far from final.

Starting from a PDF document in order to produce an HTML version is
rather awkward, but in this case perhaps necessary (assuming you really
want to create an HTML version). Making the officials find you another
version might take a very long time and may well fail.

Doing the conversion basically by hand, just extracting the content as
plain text and adding markup, is probably what I would do, too. There
are various ways to try to automate the process, but there are many
problems and even if it were somehow successful, you would probably
still need to do manual fixes (or program tuning) a lot.

however, there are some items I would like to have suggestions on:

1) Font

Do I go for monospace, like the original report, or do I something
more(?) friendly on the eyes?

This depends on your goals and limitations. If you are just creating a “facsimile” reproduction of the PDF document in HTML, you would try to preserve its visual appearance as far as possible. But how far could you
go then, and what would then be the point of the whole process?

Since the use of a monospace font is probably a tradition from the
typewrter era and since it makes the text more difficult to read, the
simplest approach is to omit all font settings, letting each browser use
its default font. Alternatively, set a reasonable proportional font.

Using justified text does not work well unless you use some hyphenation.
German text has so many long words that especially in a narrow viewing
area, the appearance becomes poor. You might consider some hyphenation
(like manually added  in longest compound words) even if you keep
using justification.

2) Footnotes

Obviously they don't make sense in html, so I'm thinking about using <details><summary> </summary> <details> tags to place them in-line, probably/possibly underlining (on hover) of the "xx)" text.

You would run into the problem that <details> is a block element.

The simplest approach is probably to put the footnotes in a separate
file and make the footnote references links to elements in that file.
You might consider embedding that file in the main document with
<iframe>, but you can do that later. (Well, an even simpler approach is
to make the references links to elements at the end of the document,
where you would put the footnote texts. But then you would probably need
to have back-references, like on Wikipedia pages, so that after
following a link to a footnote, the user can easily get back to place
where the reference is.)

3) Tables

Don't cut & paste, so I'll have to convert them and here I've hit a
snag, I can code myself around it, but it's ugly.

Explanation: If you look at the tables in the PDF, the first is on page
14 (26 in the PDF), it has a double outside border and a single inside
one, but most cells don't have top or bottom borders.

I’m not sure I see what the problem is. Do you think you need to
replicate such use of borders, instead of simply having a table with the correct data and a suitable rendering? Anyway, if you don’t want to have double borders between cells, set border-collapse: collapse on the table element.

And how do you create the inverted "L" shaped tables that are on PDF
pages 83 and 117, to name just two?

I’m not sure what the structure there is. Just two different tables
touching each other? But you can make them a single table by using empty
cells (and making sure they don’t show: empty-cells: hide).

4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
the graphs on PDF page 106, where SVG would seem to the logical option, having also converted many of the the original PNG's in the Sylvain
Viard document to that format)

Depends on the images of course, but normally PNG should be sufficient
for tables that find in administrative documents.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Prins@21:1/5 to Jukka K. Korpela on Tue Jan 19 11:51:18 2021

XPost: comp.infosystems.www.authoring.stylesheets

On 2021-01-19 08:43, Jukka K. Korpela wrote:

Robert Prins wrote:

The current version can be found at
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
it's far from final.

Starting from a PDF document in order to produce an HTML version is rather awkward, but in this case perhaps necessary (assuming you really want to create
an HTML version). Making the officials find you another version might take a very long time and may well fail.

They've already told me that there is no Wordstar/WP/Word/etc version available.
There may be one at the university, but as you wrote, it may take ages to find it.

Doing the conversion basically by hand, just extracting the content as plain text and adding markup, is probably what I would do, too. There are various ways
to try to automate the process, but there are many problems and even if it were
somehow successful, you would probably still need to do manual fixes (or program
tuning) a lot.

The PDF text is selectable, be it with a nont insignificant number of errors, but given that as a Dutchman I've had German at school, it's easy to (proof)read
and spot obvious errors.

however, there are some items I would like to have suggestions on:

1) Font

Do I go for monospace, like the original report, or do I something more(?) >> friendly on the eyes?

This depends on your goals and limitations. If you are just creating a “facsimile” reproduction of the PDF document in HTML, you would try to preserve
its visual appearance as far as possible. But how far could you go then, and what would then be the point of the whole process?

The points of the html conversion are:

1) smaller size (but who cares nowadays when actual visible page-content might be as little as 1 or 2% of the page-size) The 269-page PDF is 12.5Mb, I'm now at
PDF page 63, and my html page is just 135kb!

2) Access, the PDF is just a big file without any means of moving around to specific chapters.

Since the use of a monospace font is probably a tradition from the typewrter era
and since it makes the text more difficult to read, the simplest approach is to
omit all font settings, letting each browser use its default font. Alternatively, set a reasonable proportional font.

I've removed the monospace, no clue what font it now uses, but it most definitely looks better.

Using justified text does not work well unless you use some hyphenation. German
text has so many long words that especially in a narrow viewing area, the appearance becomes poor. You might consider some hyphenation (like manually added  in longest compound words) even if you keep using justification.

There are quite a few soft-hyphenated words in the text, I might add some soft hyphens in the longest words. What's your opinion on the current text-width (700px)? <file:///D:/01-lift/02-prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
For me it seems to be a reasonable compromise.

2) Footnotes

Obviously they don't make sense in html, so I'm thinking about using
<details><summary> </summary> <details> tags to place them in-line,
probably/possibly underlining (on hover) of the "xx)" text.

You would run into the problem that <details> is a block element.

As an earlier post here has already explained.

The simplest approach is probably to put the footnotes in a separate file and make the footnote references links to elements in that file. You might consider
embedding that file in the main document with <iframe>, but you can do that later. (Well, an even simpler approach is to make the references links to elements at the end of the document, where you would put the footnote texts. But
then you would probably need to have back-references, like on Wikipedia pages,
so that after following a link to a footnote, the user can easily get back to place where the reference is.)

I might go for a conversion to end-notes, with a link back.

And I've come across one site, which I didn't bookmark, that suggested that footnotes might be the one thing that tooltips are useful for, although accessibility of tooltips is not their strong point, to put it mildly.

3) Tables

Don't cut & paste, so I'll have to convert them and here I've hit a snag, I >> can code myself around it, but it's ugly.

Explanation: If you look at the tables in the PDF, the first is on page 14 (26
in the PDF), it has a double outside border and a single inside one, but most
cells don't have top or bottom borders.

I’m not sure I see what the problem is. Do you think you need to replicate such
use of borders, instead of simply having a table with the correct data and a suitable rendering? Anyway, if you don’t want to have double borders between
cells, set border-collapse: collapse on the table element.

I'll probably stick with what I have now, adding border-collapse to individual cells is just too much work. But how do you completely hide (top/bottom only) borders on individual cells? My approach of using <br> and putting everything into one cell works, although it's not very nice.

And how do you create the inverted "L" shaped tables that are on PDF pages 83
and 117, to name just two?

I’m not sure what the structure there is. Just two different tables touching
each other? But you can make them a single table by using empty cells (and making sure they don’t show: empty-cells: hide).

That would do the trick, didn't know about it.

4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or the >> graphs on PDF page 106, where SVG would seem to the logical option, having >> also converted many of the the original PNG's in the Sylvain Viard document to
that format)

Depends on the images of course, but normally PNG should be sufficient for tables that find in administrative documents.

Once I get to them, currently only on page 63 of the PDF, I'll make a decision, I'll probably go for PNG's first, but the plans of lines of public transport on the final pages should be (fairly) easy to convert to SVG. (Then again the PNG's
might actually be smaller...)

--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Richter@21:1/5 to Robert Prins on Tue Jan 19 13:04:34 2021

XPost: comp.infosystems.www.authoring.stylesheets

On Tue, 19 Jan 2021, Robert Prins wrote:

They've already told me that there is no Wordstar/WP/Word/etc version available. There may be one at the university, but as you wrote, it may take ages to find it.

Well, for a text from the 1980ies, this may not be easy. I wonder whether
the content is still interesting today but that is not my problem.

If the purpose is to get a machine-readable text with less errors than
what has PDF produced by OCR, that would be fine. You may consider a
OCR-only tool like "tesseract" instead; I prefer that but I am not sure
whether this is just snake oil.

If the purpose is to get a formatted text with headlines and paragraphes, forget it. Whenever I had to translate a Word document into HTML, I have first extracted the plain text and then added the markup. This is much less work
than removing the Word-specific markup which is to ensure that the outcome looks exactly like the Word document that was the source. Moreover, you save 80 or 90 % of the markup, and the HTML text is correct HTML and human readable.

Doing the conversion basically by hand, just extracting the content as plain
text and adding markup, is probably what I would do, too. There are various ways to try to automate the process, but there are many problems and even if
it were somehow successful, you would probably still need to do manual fixes
(or program tuning) a lot.

There is a handy middle way: Make sure that headlines and paragraphs are at
the right places in the plain-text file, and use a tool like "markdown" to actually insert HTML tags. This model is how Wikipedia works for the authors.

--
Helmut Richter

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Richter@21:1/5 to Robert Prins on Tue Jan 19 18:01:48 2021

XPost: comp.infosystems.www.authoring.stylesheets

On Tue, 19 Jan 2021, Robert Prins wrote:

On 2021-01-19 12:04, Helmut Richter wrote:> On Tue, 19 Jan 2021, Robert Prins wrote:

There is a handy middle way: Make sure that headlines and paragraphs are at the right places in the plain-text file, and use a tool like "markdown" to actually insert HTML tags. This model is how Wikipedia works for the

authors.

Never heard of it, but even that might be overkill given the simplicity of the
html.

I does only as much as you would program yourself in some script language
for the same purpose. When I learnt about it, I had already such a script
for myself with quite much the same simple interface.

https://nl.wikipedia.org/wiki/Markdown

--
Helmut Richter

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Prins@21:1/5 to Helmut Richter on Tue Jan 19 18:35:40 2021

XPost: comp.infosystems.www.authoring.stylesheets

On 2021-01-19 12:04, Helmut Richter wrote:> On Tue, 19 Jan 2021, Robert Prins wrote:

They've already told me that there is no Wordstar/WP/Word/etc version
available. There may be one at the university, but as you wrote, it may take
ages to find it.

Well, for a text from the 1980ies, this may not be easy. I wonder whether the content is still interesting today but that is not my problem.

It's one of the very few studies about the dangers of hitchhiking, and that makes is "kind of interesting".

If the purpose is to get a machine-readable text with less errors than
what has PDF produced by OCR, that would be fine. You may consider a OCR-only tool like "tesseract" instead; I prefer that but I am not sure whether this is just snake oil.

If the purpose is to get a formatted text with headlines and paragraphes, forget it. Whenever I had to translate a Word document into HTML, I have first
extracted the plain text and then added the markup. This is much less work than removing the Word-specific markup which is to ensure that the outcome looks exactly like the Word document that was the source. Moreover, you save 80
or 90 % of the markup, and the HTML text is correct HTML and human readable.

I've been given a converted-to-Word version, which is as good as useless, as it contains all the typos, and worse, the scan artifacts as images, so I just Cut &
Paste one page of the PDF at a time, remove all the spelling errors (I hope), add basic html, i.e. <h2/3/4>, <p> and <sup> notes for the footnotes, slap a 16-character <hr> before any footnotes, and a full <hr> at the end of the page, a hitchhiking friend will proofread the thing again, and having the separators makes it a bit easier to see where you are. They will obviously be removed from the final version, and the footnotes will become end-notes, with back-links.

Tables take a bit more time, but all in all I process about 4-8 pages per hour, which is good enough, I haven't got much else to do right now, it's way too cold
in Vilnius to go onto the balcony and continue my sanding and painting work, and
there are sadly too many police checkpoint to go out and hitchhike. (And yes, even with Covid-19 on the rampage, people still stop for hitchhikers)

Doing the conversion basically by hand, just extracting the content as plain
text and adding markup, is probably what I would do, too. There are various
ways to try to automate the process, but there are many problems and even if
it were somehow successful, you would probably still need to do manual fixes
(or program tuning) a lot.

On my PC I have a (Pascal) program that converts the output of my main statistics processing program into RTF, and on z/OS I've got a 5,000+ line REXX edit macro to do the same, and keeping those working while adding tables here, there, and everywhere is more than enough. Writing code for what's in essence a one-off task makes no sense, just convert it and be done with it!

There is a handy middle way: Make sure that headlines and paragraphs are at the right places in the plain-text file, and use a tool like "markdown" to actually insert HTML tags. This model is how Wikipedia works for the authors.

Never heard of it, but even that might be overkill given the simplicity of the html.

For the rest, thank you for your comments!

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather @ https://prino.neocities.org/
Some useful(?) REXX @ https://prino.neocities.org/zOS/zOS-Tools.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Prins@21:1/5 to Jukka K. Korpela on Wed Jan 20 23:23:23 2021

XPost: comp.infosystems.www.authoring.stylesheets

On 2021-01-19 08:43, Jukka K. Korpela wrote:

Robert Prins wrote:

The current version can be found at
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
it's far from final.

<snip>

3) Tables

Don't cut & paste, so I'll have to convert them and here I've hit a snag, I >> can code myself around it, but it's ugly.

Explanation: If you look at the tables in the PDF, the first is on page 14 (26
in the PDF), it has a double outside border and a single inside one, but most
cells don't have top or bottom borders.

I’m not sure I see what the problem is. Do you think you need to replicate such
use of borders, instead of simply having a table with the correct data and a suitable rendering? Anyway, if you don’t want to have double borders between
cells, set border-collapse: collapse on the table element.

Just for "fun", I've been fiddling with the tables to see if I can get the same format as in the PDF, and while doing so, I found out that my "whole-of-site" "style.css" is just not very useful, so I cut it down to the basics that I need for this conversions.

I've managed to get one table to look like it "should" look, but in the process I've lost the outside border on all 'class="pdftab"' tables, and even Firebug'ing between the converted PDF and my <https://prino.neocities.org/sandbox.html> I have been unable to get the border around the second table, and I would really appreciate it if someone could explain what I'm missing.

Thanks,

Robert

PS: And yes, the in-line styling on the <tr> tags still needs to go to CSS.
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas 'PointedEars' Lahn@21:1/5 to Robert Prins on Thu Jan 21 23:40:51 2021

XPost: comp.infosystems.www.authoring.stylesheets

Robert Prins wrote:

On 2021-01-19 08:43, Jukka K. Korpela wrote:

Robert Prins wrote:

The current version can be found at

<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>

and it's far from final.

Starting from a PDF document in order to produce an HTML version is
rather awkward, but in this case perhaps necessary (assuming you really
want to create an HTML version). Making the officials find you another
version might take a very long time and may well fail.

They've already told me that there is no Wordstar/WP/Word/etc version available.

As I told you already, I downloaded it from the very source that you
specified. So it certainly *is* available, even if it is the result of a conversion.

--
PointedEars
FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix> <https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas 'PointedEars' Lahn@21:1/5 to Jukka K. Korpela on Thu Jan 21 23:37:10 2021

XPost: comp.infosystems.www.authoring.stylesheets

Jukka K. Korpela wrote:

Robert Prins wrote:

The current version can be found at
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
and it's far from final.

Starting from a PDF document in order to produce an HTML version is
rather awkward, but in this case perhaps necessary (assuming you really
want to create an HTML version).

It is actually a common task in the industry as people (especially public offices) can easily produce PDF documents by scanning sheets of hardcopies
or with a word processor, but often still do not have the manpower or
technical skills to produce clean HTML (documents) for use on a Web site.
So it is good for a(n) aspiring Web developer to know how to do that.

I for one was tasked about a year ago with converting PDF documents,
produced by the Swiss Federal Office of Public Health, to HTML, so that the information provided by them would be accessible. It basically still looks
the same as it did when I was finished:

<https://www.priminfo.admin.ch/de/zahlen-und-fakten>

(You can see there that some newer documents have not been converted to HTML yet.)

F’up2 comp.infosystems.www.authoring.html

--
PointedEars
FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix> <https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Prins@21:1/5 to Thomas 'PointedEars' Lahn on Fri Jan 22 22:39:40 2021

XPost: comp.infosystems.www.authoring.stylesheets

For some reason, I missed this post a few days ago, not good...

On 2021-01-17 23:37, Thomas 'PointedEars' Lahn wrote:

Robert Prins wrote:

At
<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html> >>
It turns out that the option to download the PDF as Word on the above site >> doesn't work (I gave up after Ms PacMan was still biting after nearly an
hour),

WFM.

but the text in the PDF is selectable, although with plenty of
spelling errors, but those are easy to correct when looking at the PDF.

As PDF is based on PostScript, there are tools like ps2txt (alias for ps2ascii(1) which is an alias for gs(1), the GhostScript binary) which can extract text from PDF documents automatically. It appears to work quite
well with the downloaded PDF document, in case you are still unable to download the Word document.

There are also tools called “pdf2html”. One is an npm package and requires
a JRE [1], but there are others, both command-line tools and Web sites.
Just google it.

I've actually used the one on the Adobe site, and it's given me a 13Mb .RTF file, which is just as useful as the PDF, both allow me to cut&paste text, with the same errors that need fixing. "Zusammenhang" is a recurrent problem.

1) Font

Do I go for monospace, like the original report, or do I something more(?) >> friendly on the eyes?

That depends on to which degree you want to preserve the original document.

Nobody's going to print it, but if they want to, there's the PDF...

If you are not doing this for archiving purposes, I suggest to declare a
list of sans-serif variable-width font families instead, with the more preferable font family in front and ending the list with the generic “sans- serif”. A possible list that can be recommended is

body {
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
}

(YMMV. For example, typographers would probably frown at me for including “Arial” there, or because I put it before “Helvetica”.)

If you are not into typography, or do not have the time to educate yourself about it, simply declare only “sans-serif”.

You might need to set the font-family for some descendant elements as well. (Implementations are inconsistent.)

I'm using "Georgia,serif", like everywhere else in my site. Verdana is for me one of those fonts that immediately provokes a "yuck" reaction. Tables use "Courier New",monospace

2) Footnotes

Obviously they don't make sense in html,

They do, just not as page-end notes as, contrary to popular belief, there
are no “_HTML_ pages”. They could be footnotes in the table footer, section-end notes, or text-end notes.

Section-end notes would be a nice compromise, section would for me be a Kapitel.

so I'm thinking about using
<details><summary> </summary> <details> tags to place them in-line,
probably/possibly underlining (on hover) of the "xx)" text.

I do not think this is the correct HTML markup for footnotes. See also:

<https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details>

Footnotes as small linked superscript text are working for me. I would suggest to inspect Wikipedia for how footnotes should be done (BTDT). You can also combine that with my Accessible Pure CSS Tooltips (license is
GPLv3) that I am using on <http://PointedEars.de/es-matrix>.

Wikipedia footnotes together with the accessible tooltips are magic, I've looked
at them before, but I just couldn't figure out how they work, so, at least for now, I'll go for a simple link to a "Notes section" at the section-end, with a link back from there. Any suggestion on how you'd call this section in German? >> 3) Tables

Don't cut & paste, so I'll have to convert them and here I've hit a snag,
I can code myself around it, but it's ugly.

The problem may be solved now that you can download the Word document. However, if you cannot, then you may be able make your life a little easier by changing the text (if still necessary) to the following (CSV) format (without indentation):

td_content;td_content;td_content …
td_content;td_content;td_content …

Then you can first apply the replacement

; → </td><td>

and then (e.g. using regular expressions)

^ (start of line) → <tr><td>
$ (end of line) → </td></tr>

(Use another delimiter if it is obvious that the delimiter occurs in the data.)

Then surround all rows with

<table>

and

</table>

after which you can make adjustments like <td> → <th>, rowspan, colspan and accessibility attributes.

I also remember having seen a tool that can do this conversion from text
rows to HTML tables automatically, but I do not remember its name and the circumstances.

I probably could create something in REXX, recently wrote something that can add
"profiling" code to my Pascal programs in Regina REXX and it takes just 7 seconds from pressing Enter to get the final output, processing just under 80,000 lines of Pascal, compiling the modified code, and running the nine programs.

As someone who's worked on IBM mainframes since 1985, I'm not very much into all
of the Windows/Unix/Linux tools, I know the basics of "grep" and "sed", but that's about it!

Explanation: If you look at the tables in the PDF, the first is on page 14 >> (26 in the PDF), it has a double outside border and a single inside one,
but most cells don't have top or bottom borders.

Although it may look old-fashioned, the latter is actually how *simple* *data* tables SHOULD be done. For example, it is a standing recommendation for LaTeX tables in scientific works: Only draw horizontal lines (“\hline”
or “\midrule”) to separate *groups* of rows. (In HTML this can be achieved
with a “thead” and one or more with “tbody” elements.)

That the original table style may not be suitable for the Web does not mean that copy-and-paste is necessarily a bad idea. In my PDF reader “Okular” (version 1.3.2) at least, only the text from that table is copied then.
Once you have the text in the cells using proper table markup, the borders can be easily styled with CSS. For example, something like

table { border-collapse: collapse; border: 2px double black; }
thead tr { border-bottom: 2px solid black; }
th, td { padding: 0.25em; border-right: 2px solid black; }

would come closest to the original table style. (Whether you want to do
that depends on how much you want to preserve the original.)

I've simply created four styles

.b0 {
border-bottom: 0;
}

.t0 {
border-top: 0;
}

.l0 {
border-left: 0;
}

.r0 {
border-right: 0;
}

to remove borders from <td> elements that I don't want, and might combine them later into the likes of .bt0/.br0/.brt0/etc, and the tables I now get are, except for some spacing, carbon copies of the originals. I'm not going to try to
get the exact spacing by adding (even more)  's.

I would put the table footnotes in the “tfoot” element (BTDT).

I'm currently only using "tbody", for now I prefer KISS.

And how do you create the inverted "L" shaped tables that are on PDF pages >> 83 and 117, to name just two?

In the case of the table on page 83 of the PDF document, simply omit the
last 4 table cells in each row, or add empty cells but style them so that they are not visible.

Obviously I will ***not*** rotate any tables!

I do not see the need for any rotation in the first place :)

I know, web-pages have an infinite width. :)

4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
the graphs on PDF page 106, where SVG would seem to the logical option,

Unless you want to do some fancy visualization, if you only want to link to further information about the area of the map, a simple image map (“map” and
“img” element) will suffice (and will be most backwards-compatible). Since
the map contours only have to be approximate, this will be a lot easier to
do than to recreate the map exactly with SVG (unless you have an image
editor that can convert bitmaps to SVG easily – let me know which one, then).

There's something that converts BW images to SVG, <http://potrace.sourceforge.net/>, you've probably heard about it.

At some stage I've tried to convert the .PNG's used in <https://prino.neocities.org/mario_rinvolucri/chapter2.html> to SVG (after rescanning them, I've actually got the book, the original GIF's @ http://bernd.wechner.info/Hitchhiking/Mario/chapter2.html> are too low-res), but
I seem to remember that the resulting SVG's were bigger than the "PNGOUT" compressed PNG's. For what it's worth Bernd Wechner's webified version of this book uses

"font-family: Arial, Helvetica, sans-serif;"

whereas I don't specify any font, which seems to result in "Times New Roman" with Firefox (and Edge), which I find easier on the eyes.

Otherwise only extract the image using e.g. The GIMP or ImageMagick convert(1), and add an “img” element.

I've actually found an SVG image of Saarland on Wikipedia, and after hacking it into something more compact, Inkscape files contain a hell of a lot of bloat, it's in the current version @ <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> (do a find on "abb") Not sure if the yellow(ish) is the best colour, and the towns were added on a "looks OK" basis. (For what it's worth, the current SVG still contains two groups that are nearly identical, but I've been unable to merge them, doing so would reduce the size even more!)

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Richter@21:1/5 to Robert Prins on Sat Jan 23 11:38:24 2021

XPost: comp.infosystems.www.authoring.stylesheets

On Fri, 22 Jan 2021, Robert Prins wrote:

<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>

It turns out that the option to download the PDF as Word on the above site
doesn't work (I gave up after Ms PacMan was still biting after nearly an hour),

Have you ever published an URL of the original (i.e. the relatively to
other versions most original) PDF version?

Without it, one can hardly say anything about the quality of other
versions produced from it.

--
Helmut Richter

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Prins@21:1/5 to Helmut Richter on Sat Jan 23 14:24:20 2021

XPost: comp.infosystems.www.authoring.stylesheets

On 2021-01-23 10:38, Helmut Richter wrote:

On Fri, 22 Jan 2021, Robert Prins wrote:

<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>

It turns out that the option to download the PDF as Word on the above site >>>> doesn't work (I gave up after Ms PacMan was still biting after nearly an >>>> hour),

Have you ever published an URL of the original (i.e. the relatively to
other versions most original) PDF version?

Without it, one can hardly say anything about the quality of other
versions produced from it.

There's another copy of the same PDF around on the site of a HH friend @ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know the only two, other than that it also shows up on Google on sites with, to say the least, "dodgy" names that I wouldn't touch with a bargepole.

There also seem to be some paper copies around, just Google the title.

The .RTF version can be (temporarily) found @ <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>. It renders OK in Word (Word 2002), and very badly in LO Writer, and contains a horrible amount of "images", scan artifacts.

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Richter@21:1/5 to Robert Prins on Sat Jan 23 17:21:22 2021

XPost: comp.infosystems.www.authoring.stylesheets

On Sat, 23 Jan 2021, Robert Prins wrote:

On 2021-01-23 10:38, Helmut Richter wrote:

On Fri, 22 Jan 2021, Robert Prins wrote:

<http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>

It turns out that the option to download the PDF as Word on the above site
doesn't work (I gave up after Ms PacMan was still biting after nearly an
hour),

Have you ever published an URL of the original (i.e. the relatively to other versions most original) PDF version?

Without it, one can hardly say anything about the quality of other
versions produced from it.

There's another copy of the same PDF around on the site of a HH friend @ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know

I tried to read it with tesseract, and the outcome looks good at first
sight. Please tell me whether this is of some help for you. Of course, tesseract can only read what it recognises as text, no images or the like.

The main problem was to convert the pdf into a graphics file. I ended up
with 276M TIFF. tesseract took 12 min to make text out of it; the text
in UTF-8 encoding takes 446K and can be found at https://hhr-m.userweb.mwn.de/weblab/anhalterwesen.txt .

I have learnt a bit know-how and a lot of know-how-not.

--
Helmut Richter

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?B?8J+YiSBHb29kIEd1eSDwn5iJ?@21:1/5 to Robert Prins on Sun Jan 24 01:58:43 2021

XPost: comp.infosystems.www.authoring.stylesheets

This is a multi-part message in MIME format.
On 23/01/2021 14:24, Robert Prins wrote:

There's another copy of the same PDF around on the site of a HH friend
@ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far
as I know the only two, other than that it also shows up on Google on
sites with, to say the least, "dodgy" names that I wouldn't touch with
a bargepole.

There also seem to be some paper copies around, just Google the title.

The .RTF version can be (temporarily) found @ <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>.
It renders OK in Word (Word 2002), and very badly in LO Writer, and
contains a horrible amount of "images", scan artifacts.

Robert

Can you not just embed the file in your HTML like so?

<https://technical.mytechsite.gq/docs/test.html>

Document

Robert Prins

--

With over 1.2 billion devices now running Windows 10, customer
satisfaction is higher than any previous version of windows.

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#008000" bgcolor="#faf0e6">
<div class="moz-cite-prefix">On 23/01/2021 14:24, Robert Prins
wrote:<br>
</div>
<blockquote type="cite" cite="mid:ruh4jm$7hs$1@dont-email.me"><br>
There's another copy of the same PDF around on the site of a HH
friend @ <a class="moz-txt-link-rfc2396E" href="http://www.franknature.nl/anhalterwesen.pdf"><http://www.franknature.nl/anhalterwesen.pdf></a>, but
those are as far as I know the only two, other than that it also
shows up on Google on sites with, to say the least, "dodgy" names
that I wouldn't touch with a bargepole.
<br>
<br>
There also seem to be some paper copies around, just Google the
title.
<br>
<br>
The .RTF version can be (temporarily) found @
<a class="moz-txt-link-rfc2396E" href="https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf"><https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf></a>.
It renders OK in Word (Word 2002), and very badly in LO Writer,
and contains a horrible amount of "images", scan artifacts.
<br>
<br>
Robert
<br>
</blockquote>
<p>Can you not just embed the file in your HTML like so?</p>
<p><a class="moz-txt-link-rfc2396E" href="https://technical.mytechsite.gq/docs/test.html"><https://technical.mytechsite.gq/docs/test.html></a></p>
<p> </p>
<p>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width,
initial-scale=1.0">
<title>Document</title>
<h1>Robert Prins <robert@prino.org></robert@prino.org></h1>
<iframe src="anhalterwesen.pdf" width="100%" height="800px"> </iframe>
</p>
<p><br>
</p>
<div class="moz-signature">-- <br>
<div style="background-color: blue; color: yellow; font-weight:
bolder; display: grid; align-items: center; justify-items:
center; min-height: 80px; font-size: 1.2em; border-radius: 50px;
">
<p>With over 1.2 billion devices now running Windows 10,
customer satisfaction is higher than any previous version of
windows.</p>
</div>
</div>
</body>
</html>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Prins@21:1/5 to All on Sun Jan 24 16:12:27 2021

XPost: comp.infosystems.www.authoring.stylesheets

On 2021-01-24 01:58, 😉 Good Guy 😉 wrote:

On 23/01/2021 14:24, Robert Prins wrote:

There's another copy of the same PDF around on the site of a HH friend @ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know the only two, other than that it also shows up on Google on sites with, to say the least, "dodgy" names that I wouldn't touch with a bargepole.

There also seem to be some paper copies around, just Google the title.

The .RTF version can be (temporarily) found @ <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>. It renders OK in Word (Word 2002), and very badly in LO Writer, and contains a horrible amount of "images", scan artifacts.

Robert

Can you not just embed the file in your HTML like so?

<https://technical.mytechsite.gq/docs/test.html>

Then I might just as well send people to the original. The reasons for the conversion to html are, as mentioned before,

1) Size: the PDF is 12.5 Mb, the final html is likely to be well under 1Mb, currently on page 81/93 of the PDF (of the 197/209) of actual contents, and I'm as yet not sure what I'm going to do with the 60 pages containing appendices with the questionnaires. Size right now is a mere 222kb...

2) Accessibility. 'nuff said.

Admittedly, your usual website nowadays carries multi-megabyte behind the scenes
CSS and JS, on my mobile I just deleted another 108(!)Mb of "cookies" left there
by the <http://www.independent.co.uk/>, so accessibility is the main issue, and based on my own experience, quite a bit of the contents of this study is still reasonably relevant!

And it's Covid-19 time, too cold to be out on the balcony sanding and painting doors (only two of 16 left anyway), impossible to hitchhike, although I'm still going to try this week to keep the stritch going, so this conversion and updating the PC based copies of my HH programs (about to hit a snag, as some input lines now exceed 255 characters) are useful to keep me busy.

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas 'PointedEars' Lahn@21:1/5 to Robert Prins on Wed Jan 27 21:29:05 2021

XPost: comp.infosystems.www.authoring.stylesheets

Robert Prins wrote:

For some reason, I missed this post a few days ago, not good...

You’re welcome :->

[…]

--
PointedEars
FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix> <https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas 'PointedEars' Lahn@21:1/5 to Helmut Richter on Fri Jan 29 22:29:19 2021

XPost: comp.infosystems.www.authoring.stylesheets

Helmut Richter wrote:

I tried to read it with tesseract, and the outcome looks good at first
sight.

I did not know it. Thank you for sharing this :)

<https://github.com/tesseract-ocr/tesseract>

--
PointedEars
FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix> <https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Apr 25 22:17:10 2024
  from Wales, Uk via Telnet
- Keyop
  Thu Apr 25 21:14:50 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Fri Apr 26 08:24:20 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 06:40:30 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (3 / 13)
Uptime:	69:57:33
Calls:	6,656
Calls today:	2
Files:	12,200
Messages:	5,332,146
Posted today:	1

Converting a scanned PDF to html

Who's Online

Recent Visitors

System Info