Forum: >>> Magnum BBS <<<

Beautiful Soup - close tags more promptly?

From Chris Angelico@21:1/5 to All on Mon Oct 24 13:29:13 2022

Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""

<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)

On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.

Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roel Schroeven@21:1/5 to All on Mon Oct 24 09:42:13 2022

Op 24/10/2022 om 4:29 schreef Chris Angelico:

Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""

<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)

On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.

Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.

Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one
before the closing </ol> . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.

--
"I love science, and it pains me to think that to so many are terrified
of the subject or feel that choosing science means you cannot also
choose compassion, or the arts, or be awed by nature. Science is not
meant to cure us of mystery, but to reinvent and reinvigorate it."
-- Robert Sapolsky

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Angelico@21:1/5 to Roel Schroeven on Mon Oct 24 19:02:15 2022

On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <roel@roelschroeven.net> wrote:

Op 24/10/2022 om 4:29 schreef Chris Angelico:

Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm blob = b"""

<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)

On this small snippet, it works acceptably, but puts a large number of </li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.

Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.

Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one
before the closing </ol> . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.

Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable
parser?

I'm dealing with a HUGE mess of different coding standards, all the
way from 1990s-level stuff (images for indentation, tables for
formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
of the pages have at least some <meta> tags and declare their
encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
There's even a couple of pages that use frames - yes, the old style
with a <noframes> block in case the browser can't handle it. I went
with html.parser on the expectation that it'd give the best "across
all standards" results, but I'll give html5lib a try and see if it
does better.

Would rather not try to use different parsers for different files, but
if necessary, I'll figure something out.

(For reference, this is roughly 9000 HTML files that have to be
parsed. Doing things by hand is basically not an option.)

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roel Schroeven@21:1/5 to All on Mon Oct 24 10:09:36 2022

Op 24/10/2022 om 9:42 schreef Roel Schroeven:

Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one
before the closing </ol> . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.

Just now I noticed: when I don't specify a parser, BeautifulSoup emits a warning with the parser it selected. In one of my venv's it's html5lib,
in another it's lxml. Both seem to get a correct result.

--

"I love science, and it pains me to think that to so many are terrified
of the subject or feel that choosing science means you cannot also
choose compassion, or the arts, or be awed by nature. Science is not
meant to cure us of mystery, but to reinvent and reinvigorate it."
-- Robert Sapolsky

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roel Schroeven@21:1/5 to All on Mon Oct 24 10:33:00 2022

(Oops, accidentally only sent to Chris instead of to the list)

Op 24/10/2022 om 10:02 schreef Chris Angelico:

On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <roel@roelschroeven.net>
wrote:

Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one before the closing </ol> . On my system the same happens when I don't specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.

Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable parser?

There's a bit of information here: https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser
Not much but maybe it can be helpful.

I'm dealing with a HUGE mess of different coding standards, all the
way from 1990s-level stuff (images for indentation, tables for
formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
of the pages have at least some <meta> tags and declare their
encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
There's even a couple of pages that use frames - yes, the old style
with a <noframes> block in case the browser can't handle it. I went
with html.parser on the expectation that it'd give the best "across
all standards" results, but I'll give html5lib a try and see if it
does better.

Would rather not try to use different parsers for different files, but
if necessary, I'll figure something out.

(For reference, this is roughly 9000 HTML files that have to be
parsed. Doing things by hand is basically not an option.)

I'd give lxml a try too. Maybe try to preprocess the HTML using
html-tidy (https://www.html-tidy.org/), that might actually do a pretty
good job of getting rid of all kinds of historical inconsistencies.
Somehow checking if any solution works for thousands of input files will
always be a pain, I'm afraid.

--
"I've come up with a set of rules that describe our reactions to technologies: 1. Anything that is in the world when you’re born is normal and ordinary and is
just a natural part of the way the world works.
2. Anything that's invented between when you’re fifteen and thirty-five is new
and exciting and revolutionary and you can probably get a career in it.
3. Anything invented after you're thirty-five is against the natural order of things."
-- Douglas Adams, The Salmon of Doubt

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Angelico@21:1/5 to Peter J. Holzer on Mon Oct 24 21:56:13 2022

On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:

Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:

The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of
nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.

That's correct. However, parsing it with html.parser and then
reconstituting it as shown in the example code results in all the
</li> tags coming up right before the </ol>, indicating that the <li>
tags were parsed as deeply nested rather than as siblings.

In order to get a successful parse out of this, I need something which
sees them as siblings, which html5lib seems to be doing fine. Whether
it has other issues, I don't know, but I guess I'll find out.... it's
currently running on the live site and taking several hours (due to
network delays and the server being slow, so I don't really want to
parallelize and overload the thing).

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Peter J. Holzer on Mon Oct 24 12:34:56 2022

On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote:

Ron has already noted that the lxml and html5 parser do the right thing,

^^^
Oops, sorry. That was Roel.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWalAACgkQ8g5IURL+ KF0i1xAAndDovULQoDzrmaCxy2dT12YMALdJHvAmYaM41VNv8tmi45id4PGPEh6a 8yyFfr4JMfJzazH5sYVwc83GBXp4qZP7fNvlLD4UTcz0kCmD801EIit8i1zB2JWh cR3RqV9uGmwKTYGy/p/wgaPvnyOPFWUzcAGFz9+pbbND3hUinH0h94pkwxRli4Wy ZPbKSCyWi8+NRa0gUHBEUZQpISboh9Uibyzj5YEilTWfuU4FzyBcuhnj1m2R+wYJ sywIrtY+7A6flxFpkVt+M1rCNshV4Xo3bL1vZYnfLyG/t09H/UPU0y9gecVFXcCR P9kxRwzd4nqe+Kfgchm4k8UvNVlH5jg5Zcp0x6YjSBCsZ+bLHsMlFzjBJmvJMuao zzn+FtvigfKm8jca9CXg6phENICmvqBx1wkQD7oQLS3k7+oukRHHTBlZ6HPSPFOc GzGovFLC7tuW1Q5w6p9c47nPbDHmnoAMrp0qiZSudyAzmDqGJuVZOm6N4XI8ve+d RNOzZgMP8fdul3wcLIcffCPUcJBZmzDp+v+6FUCQpoRJog+yXkNDSw+voMxFBiCL VxL7yDfn4gDL/PtvkEMt4UWn6/DmpSIiyDrT+hm6CL5CXmtlzU/UfbsNygdn0WNc HkWqeJHwxNEWRP+R3WwWS1hsEsB8Gw/VBYyg1xa

From Peter J. Holzer@21:1/5 to Chris Angelico on Mon Oct 24 12:32:11 2022

On 2022-10-24 13:29:13 +1100, Chris Angelico wrote:

Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""

<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'

[...]

<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)

On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>.

Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:

The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of
nested li elements. The end tag of the li element is optional (except in
XHTML) and li elements don't nest.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWaacACgkQ8g5IURL+ KF3HFRAAsCV8VGBjLkBAl6B2043xPBWiMKmzn+6wMw11tXwnaqBkNo1Bus+UDpi1 DHNoV9LHUf0+b1DrZGCsyaX2oo+p3AohZo+JbX8rBVWq8dYmYBusfbUmnfUb+0na dEsDzKYcpu4ZP2RvLdqNNmdvYRJDbNpAVhmhYkq0nQsHII2oQFXIoEYdeC+75Uhk G2Nv8rCLrsjiCfpkXLYfc+LaFvf0es0ih58/qffQZN8cUumsi5cgtOIgzBvPtgGl sBy1y9jVH2RkBAs/2tSTXrNWb9BGuoSOgFVu4BLMeN1Zc8SjXrm/tW40Zh8MATdI P/0hPAXqVy6g8KU0KIgtH7pAtcPZMJee3cKF/qqUjHcitUOvCWZgUxQE2wu5z24O K+A4pNU2wNdg57GZnsPrUaRnKJ5a8aJFiB9GxoFk1zfAc0ictJgMmdRzgFxAhCDg pD2ENaspaVJSSjpbkLw9oPAYEuW5V7hu1Ff95pfgUcAd4LWr/nDJqrBloexVq9UO ihE64uwsyHMCWpc2dq2Y5CT3edlBTVlS4MG9rH3Y53JLOMKhzalz/ps5XcrROcrW kfINOor7dYukUsRFL8jmAbXgHrepqT90X9JlGDDSVzzAUubR3w15zXt7wNzHtrfQ ttUhRud4D/ChozBGQYTaEh/ywXydY9DxrFAOUHC

From Peter J. Holzer@21:1/5 to Chris Angelico on Mon Oct 24 14:21:34 2022

On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:

On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:

Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:

The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.

That's correct. However, parsing it with html.parser and then
reconstituting it as shown in the example code results in all the
</li> tags coming up right before the </ol>, indicating that the <li>
tags were parsed as deeply nested rather than as siblings.

Yes, I got that. What I wanted to say was that this is indeed a bug in html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.

In order to get a successful parse out of this, I need something which
sees them as siblings,

Right, but Roel (correct name this time) had already posted that lxml
and html5lib parse this correctly, so I saw no need to belabour that
point.

which html5lib seems to be doing fine. Whether
it has other issues, I don't know, but I guess I'll find out....

The link somebody posted mentions that it's "very slow". Which may or
may not be a problem when you have to parse 9000 files. But if it does implement HTML5 correctly, it should parse any file the same as a modern browser does (maybe excluding quirks mode).

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWg0oACgkQ8g5IURL+ KF3FGw/+MBxyo5sUBOA6nzQh2g9V5nl3aoct2ILJdsJ8IgfwRYW+jyrFzuAU7dS0 u3gJ+3unricm142lzZUAzPieS9jHSzjPQtM8RyAFwuCMvLChMNbtee+vFWOV6F7O Lvowa4fTvF8+4SzkhYQEVxIS1exCecRskUVQ3osMmGfF54Y0S//LB+rg6lQNKk78 KfLyy+fi2mUj05xbFT9jjxlzP0hV1gNUmbP3EM5k0pkFfLkJIw+0+f8XnxHgtYLA wLx3/ufks3CEclFVkyx50qJtVpz+MXTz5u6mh8acMjoRqNOknuxufycbzhdIvPQT Yj44jgljuF8UNlRwLPz7KWhj217hA/Jprsn6VskdYRQjgEXOEdN1bQ+KvqcrXxEv 1u28Q4thafxE06NR3VKh4lJJ8w3GOdEGB2msEsaAGDpRHzDi5lTnZpvz/VNdLsy4 7dNqfFsOAB7zenofwQXq5WMz8CeMYh2RjjIACW2aaGC97Z1GI2lrz5iHepbthFWI LXlWSk3/jt2DrUEipiJbdQn9WktSUb3bpb5z/Kc5szWKXKgigY/hNldhOFZiVA7H 9n2tjnBVmrsIeU0wip3HcqTx0DjKxN2L2d5QrRl72dqya7L77IZ7aA2JJ5T8mu4A dN0PuKl2CRsoHf0UzXUHLyBd4H3RLrqzkMqTwWQ

From Chris Angelico@21:1/5 to Peter J. Holzer on Tue Oct 25 01:01:19 2022

On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:

On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:

On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:

Ron has already noted that the lxml and html5 parser do the right thing, so just for the record:

The HTML fragment above is well-formed and contains a number of li elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.

That's correct. However, parsing it with html.parser and then reconstituting it as shown in the example code results in all the
</li> tags coming up right before the </ol>, indicating that the <li>
tags were parsed as deeply nested rather than as siblings.

Yes, I got that. What I wanted to say was that this is indeed a bug in html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.

I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.

which html5lib seems to be doing fine. Whether
it has other issues, I don't know, but I guess I'll find out....

The link somebody posted mentions that it's "very slow". Which may or
may not be a problem when you have to parse 9000 files. But if it does implement HTML5 correctly, it should parse any file the same as a modern browser does (maybe excluding quirks mode).

Yeah. TBH I think the two-hour run time is primarily dominated by
network delays, not parsing time, but if I had a service where people
could upload HTML to be parsed, that might affect throughput.

For the record, if anyone else is considering html5lib: It is likely
"fast enough", even if not fast. Give it a try.

(And I know what slow parsing feels like. Parsing a ~100MB file with a decently-fast grammar-based lexer takes a good while. Parsing the same
content after it's been converted to JSON? Fast.)

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Angelico@21:1/5 to python-list@python.org on Tue Oct 25 03:09:33 2022

On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:

On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:

Yes, I got that. What I wanted to say was that this is indeed a bug in
html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.

I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.

Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. <br>, <img>) you must not include the close tags.

Yep, I'm aware of void tags, but I'm talking about the container tags
- in this case, <li> and <p> - which, in a lot of older HTML pages,
are treated as "separator" tags. Consider this content:

<HTML>
Hello, world!

Paragraph 2

Hey look, a third paragraph!
</HTML>

Stick a doctype onto that and it should be valid HTML5, but as it is,
it's the exact sort of thing that was quite common in the 90s. (I'm
not sure when lowercase tags became more popular, but in any case (pun intended), that won't affect validity.)

The <p> tag is not a void tag, but according to the spec, it's legal
to omit the </p> if the element is followed directly by another <p>
element (or any of a specific set of others), or if there is no
further content.

Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.

And yet most people do it. Why? Are you saying that it's better to
omit them all?

More importantly: Would you omit all the </p> closing tags you can, or
would you include them?

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jon Ribbens@21:1/5 to Chris Angelico on Mon Oct 24 15:34:45 2022

On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:

Yes, I got that. What I wanted to say was that this is indeed a bug in
html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.

I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.

Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. <br>, <img>) you must not include the close tags.

A minimal HTML file might look like this:

<!DOCTYPE html>
<html lang=en><meta charset=utf-8><title>Minimal HTML file</title>
<main><h1>Minimal HTML file</h1>This is a minimal HTML file.</main>

which would be parsed into this:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Minimal HTML file</title>
</head>
<body>
<main>
<h1>Minimal HTML file</h1>
This is a minimal HTML file.
</main>
</body>
</html>

Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jon Ribbens@21:1/5 to Chris Angelico on Mon Oct 24 17:01:00 2022

On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:

On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
<python-list@python.org> wrote:

On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote: >> >> Yes, I got that. What I wanted to say was that this is indeed a bug in

html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.

I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.

Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. <br>, <img>) you must not include the close tags.

Yep, I'm aware of void tags, but I'm talking about the container tags
- in this case, <li> and <p> - which, in a lot of older HTML pages,
are treated as "separator" tags.

Yes, hence why I went on to talk about container tags.

Consider this content:

<HTML>
Hello, world!
<P>
Paragraph 2
<P>
Hey look, a third paragraph!
</HTML>

Stick a doctype onto that and it should be valid HTML5,

Nope, it's missing a <title>.

Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.

And yet most people do it. Why?

They agree with Tim Peters that "Explicit is better than implicit",
I suppose? ;-)

Are you saying that it's better to omit them all?

No, I'm saying it's neither option is necessarily better than the other.

More importantly: Would you omit all the </p> closing tags you can, or
would you include them?

It would depend on how much content was inside them I guess.
Something like:

<ol>
<li>First item
<li>Second item
<li>Third item
</ol>

is very easy to understand, but if each item was many lines long then it
may be less confusing to explicitly close - not least for indentation
purposes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roel Schroeven@21:1/5 to Jon Ribbens via Python-list on Mon Oct 24 20:10:34 2022

Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01:

On 2022-10-24, Chris Angelico<rosuav@gmail.com> wrote:

On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:

Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.

And yet most people do it. Why?

They agree with Tim Peters that "Explicit is better than implicit",
I suppose? ;-)

I don't write all that much HTML, but when I do, it include those tags
largely for that reason indeed. We don't write HTML just for the
browser, we also write it for the web developer. And I think it's easier
for the web developer when the different sections are clearly
distinguished, and what better way to do it than use their tags.

More importantly: Would you omit all the </p> closing tags you can, or
would you include them?

It would depend on how much content was inside them I guess.
Something like:

<ol>
<li>First item
<li>Second item
<li>Third item
</ol>

is very easy to understand, but if each item was many lines long then it
may be less confusing to explicitly close - not least for indentation purposes.

I mostly include closing tags, if for no other reason than that I have
the impression that editors generally work better (i.e. get things like indentation and syntax highlighting right) that way.

--
"Je ne suis pas d’accord avec ce que vous dites, mais je me battrai jusqu’à
la mort pour que vous ayez le droit de le dire."
-- Attribué à Voltaire
"I disapprove of what you say, but I will defend to the death your right to
say it."
-- Attributed to Voltaire
"Ik ben het niet eens met wat je zegt, maar ik zal je recht om het te zeggen tot de dood toe verdedigen"
-- Toegeschreven aan Voltaire

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Chris Angelico on Mon Oct 24 19:17:55 2022

On 2022-10-25 03:09:33 +1100, Chris Angelico wrote:

On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:

On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:

Yes, I got that. What I wanted to say was that this is indeed a bug in >> html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.

I described the HTML as "sloppy" for a number of reasons, but I was of the understanding that it's generally recommended to have the closing tags. Not that it matters much.

Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. <br>, <img>) you must not include the close tags.

Yep, I'm aware of void tags, but I'm talking about the container tags
- in this case, <li> and <p> - which, in a lot of older HTML pages,
are treated as "separator" tags. Consider this content:

<HTML>
Hello, world!

Paragraph 2

Hey look, a third paragraph!
</HTML>

Stick a doctype onto that and it should be valid HTML5, but as it is,
it's the exact sort of thing that was quite common in the 90s.

The <p> tag is not a void tag, but according to the spec, it's legal
to omit the </p> if the element is followed directly by another <p>
element (or any of a specific set of others), or if there is no
further content.

Right. The parser knows the structure of an HTML document, which tags
are optional and which elements can be inside of which other elements.
For SGML-based HTML versions (2.0 to 4.01) this is formally described by
the DTD.

So when parsing your file, an HTML parser would work like this

<HTML> - Yup, I expect an HTML element here:
HTML
Hello, world! - #PCDATA? Not allowed as a child of HTML. There must
be a HEAD and a BODY, both of which have optional start tags.
HEAD can't contain #PCDATA either, so we must be inside of BODY
and HEAD was empty:
HTML
├─ HEAD
└─ BODY
└─ Hello, world!
<P> - Allowed in BODY, so just add that:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
└─ P
Paragraph 2 - #PCDATA is allowed in P, so add it as a child:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
└─ P
└─ #PCDATA: Paragraph 2
<P> - Not allowed inside of P, so that implicitely closes the
previous P element and we go up one level:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
Hey look, a third paragraph! - Same as above:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
└─ #PCDATA: Hey look, a third paragraph!
</HTML> - The end tags of P and BODY are optional, so the end of
HTML closes them implicitely, and we have our final parse tree
(unchanged from the last step):
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
└─ #PCDATA: Hey look, a third paragraph!

For a human, the <p> tags might feel like separators here. But
syntactically they aren't - they start a new element. Note especially
that "Hello, world!" is not part of a P element but a direct child of
BODY (which may or may not be intended by the author).

Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend doing so as far as I'm aware.

And yet most people do it. Why?

There may be several reasons:

* Historically, some browsers differed in which end tags were actually
optional. Since (AFAIK) no mainstream browser ever implemented a real
SGML parser (they were always "tag soup" parsers with lots of ad-hoc
rules) this sometimes even changed within the same browser depending
on context (e.g. a simple table might work but nested tables woudn't).
So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
So people got into the habit of always using end tags and writing
empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
without thinking about it.

Are you saying that it's better to omit them all?

If you want to conserve keystrokes :-)

I think it doesn't matter. Both are valid.

More importantly: Would you omit all the </p> closing tags you can, or
would you include them?

I usually write them. I also indent the contents of an element, so I
would write your example as:

<!DOCTYPE html>
<html>
<body>
Hello, world!
<p>
Paragraph 2
</p>
<p>
Hey look, a third paragraph!
</p>
</body>
</html>

(As you can see I would also include the body tags to make that element explicit. I would normally also add a bit of boilerplate (especially a
head with a charset and viewport definition), but I omit them here since
they would change the parse tree)

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWyL4ACgkQ8g5IURL+ KF0WYQ/+LN4+cWjGBf4QJDCOmiDSeMaOKMup22wuU9HDtMHWMxb6SNe2L5YfOsqe Q9qF1q39asXYYe7N0sbkL+rvUtY1OJH6FUO7xQ7cHGmbnaSAL8OmkDydlcINezRR +S8/VvD0vKJtkjNOZ9jIxnlKAvx9gKDxu2b1ABfa7aOWzV/gmiQ/Piy9ICfi88Tw UPBHAsMj7Ig2OL15AFe/zYUZFYjucjYaufZdmEufmyOLsPLcJL86Vcpk6pN0fT7u DYnpJSLPJq5CR/JBO5YnoNVibcgrK7ALDG3hxu7FFlVcmQdDOaXRUpAm8SfuWZWV ThOel5M55Z6G0QxUZGnmDq5S7tPuguncnOtvIXVQI2V82a8lfCBo9dhetKYTXygz bcCfX59ulCCgNO0v7/ZkQn8/gO4l5A+LWkwJWK7uN9GyC0Gio+hSQN7Q+Ht+cc+5 qnZbvXGYNqZImNKtqehy9HbVxhLWGjBpgMZ/WKJVrBJSJHYAuuUece1dLx4ee7lw Dx6ZFy9vMsOL8q/7AxARRZ4/QMjWy43jHIxAyVwR74es+AO4sAqYIxAsut20yIme 2UIApUgw+ciNc7E6o+SgYSTFbOV4D/Q073h3cndsqNuMPrbOxTkGVrk1LsahQ7iR rHB/TXs8BOEwirFkOpEOziclORQcW/EOAc1P66j

From Chris Angelico@21:1/5 to Peter J. Holzer on Tue Oct 25 06:56:58 2022

On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer <hjp-python@hjp.at> wrote:

There may be several reasons:

* Historically, some browsers differed in which end tags were actually
optional. Since (AFAIK) no mainstream browser ever implemented a real
SGML parser (they were always "tag soup" parsers with lots of ad-hoc
rules) this sometimes even changed within the same browser depending
on context (e.g. a simple table might work but nested tables woudn't).
So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
So people got into the habit of always using end tags and writing
empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
without thinking about it.

Are you saying that it's better to omit them all?

If you want to conserve keystrokes :-)

I think it doesn't matter. Both are valid.

More importantly: Would you omit all the </p> closing tags you can, or would you include them?

I usually write them.

Interesting. So which of the above reasons is yours? Personally, I do
it for a slightly different reason: Many end tags are *situationally*
optional, and it's much easier to debug code when you
change/insert/remove something and nothing changes, than when doing so
affects the implicit closing tags.

I also indent the contents of an element, so I
would write your example as:

<!DOCTYPE html>
<html>
<body>
Hello, world!
<p>
Paragraph 2
</p>
<p>
Hey look, a third paragraph!
</p>
</body>
</html>

(As you can see I would also include the body tags to make that element explicit. I would normally also add a bit of boilerplate (especially a
head with a charset and viewport definition), but I omit them here since
they would change the parse tree)

Yeah - any REAL page would want quite a bit (very few pages these days
manage without a style sheet, and it seems that hardly any survive
without importing a few gigabytes of JavaScript, but that's not
mandatory), but in ancient pages, there's still a well-defined parse
structure for every tag sequences.

One thing I find quite interesting, though, is the way that browsers
*differ* in the face of bad nesting of tags. Recently I was struggling
to figure out a problem with an HTML form, and eventually found that
there was a spurious <form> tag way up higher in the page. Forms don't
nest, so that's invalid, but different browsers had slightly different
ways of showing it. (Obviously the W3C Validator was the most helpful
tool here, since it reports it as an error rather than constructing
any sort of DOM tree.)

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Angelico@21:1/5 to Peter J. Holzer on Tue Oct 25 09:57:07 2022

On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer <hjp-python@hjp.at> wrote:

One thing I find quite interesting, though, is the way that browsers *differ* in the face of bad nesting of tags. Recently I was struggling
to figure out a problem with an HTML form, and eventually found that
there was a spurious <form> tag way up higher in the page. Forms don't nest, so that's invalid, but different browsers had slightly different
ways of showing it.

Yeah, mismatched form tags can have weird effects. I don't remember the details but I scratched my head over that one more than once.

Yeah. I think my weirdest issue was one time when I inadvertently had
a <dialog> element (with a form inside it) inside something else with
a form (because the </form> was missing). Neither "dialog inside main"
nor "form in dialog separate from form in main" is a problem, and
even "oops, missed a closing form tag" isn't that big a deal, but put
them all together, and you end up with a bizarre situation where
Firefox 91 behaves one way and Chrome (some-version) behaves another
way.

That was a fun day. Remember, folks, even if you think you ran the W3C validator on your code recently, it can still be worth checking. Just
in case.

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Chris Angelico on Tue Oct 25 00:33:11 2022

On 2022-10-25 06:56:58 +1100, Chris Angelico wrote:

On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer <hjp-python@hjp.at> wrote:

There may be several reasons:

* Historically, some browsers differed in which end tags were actually
optional. Since (AFAIK) no mainstream browser ever implemented a real
SGML parser (they were always "tag soup" parsers with lots of ad-hoc
rules) this sometimes even changed within the same browser depending
on context (e.g. a simple table might work but nested tables woudn't).
So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
So people got into the habit of always using end tags and writing
empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
without thinking about it.

Are you saying that it's better to omit them all?

If you want to conserve keystrokes :-)

I think it doesn't matter. Both are valid.

More importantly: Would you omit all the </p> closing tags you can, or would you include them?

I usually write them.

Interesting. So which of the above reasons is yours?

Mostly the third one at this point I think. The first one has gone away
for me with HTML5. The second one still lingers at the back of
my brain, but I've gotten rid of the habit of writing <img .../>, so I'm recevering ;-). But I still like my code to be nice and tidy, and
whether my sense of tidyness was influenced by XML or not, if the end
tags are missing it looks off, somehow.

(That said, I do sometimes leave them off to reduce visual clutter.)

One thing I find quite interesting, though, is the way that browsers
*differ* in the face of bad nesting of tags. Recently I was struggling
to figure out a problem with an HTML form, and eventually found that
there was a spurious <form> tag way up higher in the page. Forms don't
nest, so that's invalid, but different browsers had slightly different
ways of showing it.

Yeah, mismatched form tags can have weird effects. I don't remember the
details but I scratched my head over that one more than once.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIyBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNXEqIACgkQ8g5IURL+ KF1X1Q/3V36PKy0isIIyf43CgsRKsbD9sFQrFvekq4g14HoSUAEXRJB1jN8yUFRj mPk5UgINbEVf6UHcZ6U1qjJoKFWDC204/FCLZaxPP+4yG9Y1vgMAZbDI/35F0hW8 3aA8A6TEgG3rvo3X4OiZVzH8BJEpe1C8vD1u0GzNciljl0fw71lL/eBGMoN+AjHC /N1/k2yN4LC+kD/pl4CwSGt4s7N+QpGhZ/JjDZ/eBEIJ6yvH2EBDD/vzYLJBtxOi l/rHVSyDRXM2yt5a2/72xKJR/9JjVMQJQw1RcC6A/Ozr82k5D2KeY8tLq8/jmvbO rlmDKUXKDmYPmH3KGmUilLjNgn/V3HFnro+n0Vek0D8OwdPGyiUDPQDN7lAzMqbq iuvS1yynKR//YT2Li7kzxsYBxcAdk6HJJhiX4CNVUOp1Fy104sZrZffDoL3acFFh V8hgXTUmoCVdVSHdzdVwRKPx1WvOXzsjcex+kCTxGGTYuW3lOhzSCaC7n34uoY5P ttmDdbSMnsP2hrAAsZxdwQuvIp2UFHwlFfX37wV2yT9epNI3Q5ElKXdM/eigPQP0 UPIpJGghRREjhJ3nk1hjYKxGFm3sHgw2v9QECO1tOlGiBCHpCMwkUgcbmbUd0+lo cYAWFZRo0KTIHtc2SRWN0hhBHmI+ADZtA5ra1Hq

From Tim Delaney@21:1/5 to Chris Angelico on Wed Oct 26 04:59:02 2022

On Mon, 24 Oct 2022 at 19:03, Chris Angelico <rosuav@gmail.com> wrote:

Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable parser?

Coming to this a bit late, but from my experience with BeautifulSoup and
HTML produced by other people ...

lxml is easily the fastest, but also the least forgiving.
html.parer is middling on performance, but as you've seen sometimes makes mistakes.
html5lib is the slowest, but is most forgiving of malformed input and edge cases.

I use html5lib - it's fast enough for what I do, and the most likely to
return results matching what the author saw when they maybe tried it in a single web browser.

Tim Delaney

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Angelico@21:1/5 to Tim Delaney on Wed Oct 26 05:03:25 2022

On Wed, 26 Oct 2022 at 04:59, Tim Delaney <timothy.c.delaney@gmail.com> wrote:

On Mon, 24 Oct 2022 at 19:03, Chris Angelico <rosuav@gmail.com> wrote:

Ah, cool. Thanks. I'm not entirely sure of the various advantages and
disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable
parser?

Coming to this a bit late, but from my experience with BeautifulSoup and HTML produced by other people ...

lxml is easily the fastest, but also the least forgiving.
html.parer is middling on performance, but as you've seen sometimes makes mistakes.
html5lib is the slowest, but is most forgiving of malformed input and edge cases.

I use html5lib - it's fast enough for what I do, and the most likely to return results matching what the author saw when they maybe tried it in a single web browser.

Cool cool. It sounds like html5lib should really be the recommended
parser for HTML, unless performance or dependency reduction is
important enough to change your plans. (But only for HTML. For XML,
lxml would still be the right choice.)

ChrisA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue May 7 21:37:57 2024
  from Wales, Uk via Telnet
- Michal Wronka
  Wed May 8 21:31:48 2024
  from Wroclaw, Poland via SSH
- Cronus
  Wed May 8 19:22:39 2024
  from Provo, Ut via SSH
- Michal Wronka
  Wed May 8 18:58:52 2024
  from Wroclaw, Poland via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	74:43:02
Calls:	6,715
Calls today:	3
Files:	12,246
Messages:	5,357,277

Beautiful Soup - close tags more promptly?

Who's Online

Recent Visitors

System Info