• Beautiful Soup - close tags more promptly?

    From Chris Angelico@21:1/5 to All on Mon Oct 24 13:29:13 2022
    Parsing ancient HTML files is something Beautiful Soup is normally
    great at. But I've run into a small problem, caused by this sort of
    sloppy HTML:

    from bs4 import BeautifulSoup
    # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
    blob = b"""

    <LI>'THERE sinks the nebulous star we call the Sun,
    <LI>If that hypothesis of theirs be sound,'
    <LI>Said Ida;' let us down and rest:' and we
    <LI>Down from the lean and wrinkled precipices,
    <LI>By every coppice-feather'd chasm and cleft,
    <LI>Dropt thro' the ambrosial gloom to where below
    <LI>No bigger than a glow-worm shone the tent
    <LI>Lamp-lit from the inner. Once she lean'd on me,
    <LI>Descending; once or twice she lent her hand,
    <LI>And blissful palpitations in the blood,
    <LI>Stirring a sudden transport rose and fell.
    </OL>
    """
    soup = BeautifulSoup(blob, "html.parser")
    print(soup)


    On this small snippet, it works acceptably, but puts a large number of
    </li> tags immediately before the </ol>. On the original file (see
    link if you want to try it), this blows right through the default
    recursion limit, due to the crazy number of "nested" list items.

    Is there a way to tell BS4 on parse that these <li> elements end at
    the next <li>, rather than waiting for the final </ol>? This would
    make tidier output, and also eliminate most of the recursion levels.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roel Schroeven@21:1/5 to All on Mon Oct 24 09:42:13 2022
    Op 24/10/2022 om 4:29 schreef Chris Angelico:
    Parsing ancient HTML files is something Beautiful Soup is normally
    great at. But I've run into a small problem, caused by this sort of
    sloppy HTML:

    from bs4 import BeautifulSoup
    # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
    blob = b"""

    <LI>'THERE sinks the nebulous star we call the Sun,
    <LI>If that hypothesis of theirs be sound,'
    <LI>Said Ida;' let us down and rest:' and we
    <LI>Down from the lean and wrinkled precipices,
    <LI>By every coppice-feather'd chasm and cleft,
    <LI>Dropt thro' the ambrosial gloom to where below
    <LI>No bigger than a glow-worm shone the tent
    <LI>Lamp-lit from the inner. Once she lean'd on me,
    <LI>Descending; once or twice she lent her hand,
    <LI>And blissful palpitations in the blood,
    <LI>Stirring a sudden transport rose and fell.
    </OL>
    """
    soup = BeautifulSoup(blob, "html.parser")
    print(soup)


    On this small snippet, it works acceptably, but puts a large number of
    </li> tags immediately before the </ol>. On the original file (see
    link if you want to try it), this blows right through the default
    recursion limit, due to the crazy number of "nested" list items.

    Is there a way to tell BS4 on parse that these <li> elements end at
    the next <li>, rather than waiting for the final </ol>? This would
    make tidier output, and also eliminate most of the recursion levels.

    Using html5lib (install package html5lib) instead of html.parser seems
    to do the trick: it inserts </li> right before the next <li>, and one
    before the closing </ol> . On my system the same happens when I don't
    specify a parser, but IIRC that's a bit fragile because other systems
    can choose different parsers of you don't explicity specify one.

    --
    "I love science, and it pains me to think that to so many are terrified
    of the subject or feel that choosing science means you cannot also
    choose compassion, or the arts, or be awed by nature. Science is not
    meant to cure us of mystery, but to reinvent and reinvigorate it."
    -- Robert Sapolsky

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Roel Schroeven on Mon Oct 24 19:02:15 2022
    On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <roel@roelschroeven.net> wrote:

    Op 24/10/2022 om 4:29 schreef Chris Angelico:
    Parsing ancient HTML files is something Beautiful Soup is normally
    great at. But I've run into a small problem, caused by this sort of
    sloppy HTML:

    from bs4 import BeautifulSoup
    # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm blob = b"""

    <LI>'THERE sinks the nebulous star we call the Sun,
    <LI>If that hypothesis of theirs be sound,'
    <LI>Said Ida;' let us down and rest:' and we
    <LI>Down from the lean and wrinkled precipices,
    <LI>By every coppice-feather'd chasm and cleft,
    <LI>Dropt thro' the ambrosial gloom to where below
    <LI>No bigger than a glow-worm shone the tent
    <LI>Lamp-lit from the inner. Once she lean'd on me,
    <LI>Descending; once or twice she lent her hand,
    <LI>And blissful palpitations in the blood,
    <LI>Stirring a sudden transport rose and fell.
    </OL>
    """
    soup = BeautifulSoup(blob, "html.parser")
    print(soup)


    On this small snippet, it works acceptably, but puts a large number of </li> tags immediately before the </ol>. On the original file (see
    link if you want to try it), this blows right through the default
    recursion limit, due to the crazy number of "nested" list items.

    Is there a way to tell BS4 on parse that these <li> elements end at
    the next <li>, rather than waiting for the final </ol>? This would
    make tidier output, and also eliminate most of the recursion levels.

    Using html5lib (install package html5lib) instead of html.parser seems
    to do the trick: it inserts </li> right before the next <li>, and one
    before the closing </ol> . On my system the same happens when I don't
    specify a parser, but IIRC that's a bit fragile because other systems
    can choose different parsers of you don't explicity specify one.


    Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation
    anywhere, or at least a list of recommendations on choosing a suitable
    parser?

    I'm dealing with a HUGE mess of different coding standards, all the
    way from 1990s-level stuff (images for indentation, tables for
    formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
    of the pages have at least some <meta> tags and declare their
    encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
    There's even a couple of pages that use frames - yes, the old style
    with a <noframes> block in case the browser can't handle it. I went
    with html.parser on the expectation that it'd give the best "across
    all standards" results, but I'll give html5lib a try and see if it
    does better.

    Would rather not try to use different parsers for different files, but
    if necessary, I'll figure something out.

    (For reference, this is roughly 9000 HTML files that have to be
    parsed. Doing things by hand is basically not an option.)

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roel Schroeven@21:1/5 to All on Mon Oct 24 10:09:36 2022
    Op 24/10/2022 om 9:42 schreef Roel Schroeven:
    Using html5lib (install package html5lib) instead of html.parser seems
    to do the trick: it inserts </li> right before the next <li>, and one
    before the closing </ol> . On my system the same happens when I don't
    specify a parser, but IIRC that's a bit fragile because other systems
    can choose different parsers of you don't explicity specify one.

    Just now I noticed: when I don't specify a parser, BeautifulSoup emits a warning with the parser it selected. In one of my venv's it's html5lib,
    in another it's lxml. Both seem to get a correct result.

    --

    "I love science, and it pains me to think that to so many are terrified
    of the subject or feel that choosing science means you cannot also
    choose compassion, or the arts, or be awed by nature. Science is not
    meant to cure us of mystery, but to reinvent and reinvigorate it."
    -- Robert Sapolsky

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roel Schroeven@21:1/5 to All on Mon Oct 24 10:33:00 2022
    (Oops, accidentally only sent to Chris instead of to the list)

    Op 24/10/2022 om 10:02 schreef Chris Angelico:
    On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <roel@roelschroeven.net>
    wrote:
    Using html5lib (install package html5lib) instead of html.parser seems
    to do the trick: it inserts </li> right before the next <li>, and one before the closing </ol> . On my system the same happens when I don't specify a parser, but IIRC that's a bit fragile because other systems
    can choose different parsers of you don't explicity specify one.


    Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation
    anywhere, or at least a list of recommendations on choosing a suitable parser?
    There's a bit of information here: https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser
    Not much but maybe it can be helpful.
    I'm dealing with a HUGE mess of different coding standards, all the
    way from 1990s-level stuff (images for indentation, tables for
    formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
    of the pages have at least some <meta> tags and declare their
    encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
    There's even a couple of pages that use frames - yes, the old style
    with a <noframes> block in case the browser can't handle it. I went
    with html.parser on the expectation that it'd give the best "across
    all standards" results, but I'll give html5lib a try and see if it
    does better.

    Would rather not try to use different parsers for different files, but
    if necessary, I'll figure something out.

    (For reference, this is roughly 9000 HTML files that have to be
    parsed. Doing things by hand is basically not an option.)

    I'd give lxml a try too. Maybe try to preprocess the HTML using
    html-tidy (https://www.html-tidy.org/), that might actually do a pretty
    good job of getting rid of all kinds of historical inconsistencies.
    Somehow checking if any solution works for thousands of input files will
    always be a pain, I'm afraid.

    --
    "I've come up with a set of rules that describe our reactions to technologies: 1. Anything that is in the world when you’re born is normal and ordinary and is
    just a natural part of the way the world works.
    2. Anything that's invented between when you’re fifteen and thirty-five is new
    and exciting and revolutionary and you can probably get a career in it.
    3. Anything invented after you're thirty-five is against the natural order of things."
    -- Douglas Adams, The Salmon of Doubt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Peter J. Holzer on Mon Oct 24 21:56:13 2022
    On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:
    Ron has already noted that the lxml and html5 parser do the right thing,
    so just for the record:

    The HTML fragment above is well-formed and contains a number of li
    elements at the same level directly below the ol element, not lots of
    nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.

    That's correct. However, parsing it with html.parser and then
    reconstituting it as shown in the example code results in all the
    </li> tags coming up right before the </ol>, indicating that the <li>
    tags were parsed as deeply nested rather than as siblings.

    In order to get a successful parse out of this, I need something which
    sees them as siblings, which html5lib seems to be doing fine. Whether
    it has other issues, I don't know, but I guess I'll find out.... it's
    currently running on the live site and taking several hours (due to
    network delays and the server being slow, so I don't really want to
    parallelize and overload the thing).

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Peter J. Holzer on Mon Oct 24 12:34:56 2022
    On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote:
    Ron has already noted that the lxml and html5 parser do the right thing,
    ^^^
    Oops, sorry. That was Roel.

    hp



    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWalAACgkQ8g5IURL+ KF0i1xAAndDovULQoDzrmaCxy2dT12YMALdJHvAmYaM41VNv8tmi45id4PGPEh6a 8yyFfr4JMfJzazH5sYVwc83GBXp4qZP7fNvlLD4UTcz0kCmD801EIit8i1zB2JWh cR3RqV9uGmwKTYGy/p/wgaPvnyOPFWUzcAGFz9+pbbND3hUinH0h94pkwxRli4Wy ZPbKSCyWi8+NRa0gUHBEUZQpISboh9Uibyzj5YEilTWfuU4FzyBcuhnj1m2R+wYJ sywIrtY+7A6flxFpkVt+M1rCNshV4Xo3bL1vZYnfLyG/t09H/UPU0y9gecVFXcCR P9kxRwzd4nqe+Kfgchm4k8UvNVlH5jg5Zcp0x6YjSBCsZ+bLHsMlFzjBJmvJMuao zzn+FtvigfKm8jca9CXg6phENICmvqBx1wkQD7oQLS3k7+oukRHHTBlZ6HPSPFOc GzGovFLC7tuW1Q5w6p9c47nPbDHmnoAMrp0qiZSudyAzmDqGJuVZOm6N4XI8ve+d RNOzZgMP8fdul3wcLIcffCPUcJBZmzDp+v+6FUCQpoRJog+yXkNDSw+voMxFBiCL VxL7yDfn4gDL/PtvkEMt4UWn6/DmpSIiyDrT+hm6CL5CXmtlzU/UfbsNygdn0WNc HkWqeJHwxNEWRP+R3WwWS1hsEsB8Gw/VBYyg1xa
  • From Peter J. Holzer@21:1/5 to Chris Angelico on Mon Oct 24 12:32:11 2022
    On 2022-10-24 13:29:13 +1100, Chris Angelico wrote:
    Parsing ancient HTML files is something Beautiful Soup is normally
    great at. But I've run into a small problem, caused by this sort of
    sloppy HTML:

    from bs4 import BeautifulSoup
    # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
    blob = b"""

    <LI>'THERE sinks the nebulous star we call the Sun,
    <LI>If that hypothesis of theirs be sound,'
    [...]
    <LI>Stirring a sudden transport rose and fell.
    </OL>
    """
    soup = BeautifulSoup(blob, "html.parser")
    print(soup)


    On this small snippet, it works acceptably, but puts a large number of
    </li> tags immediately before the </ol>.

    Ron has already noted that the lxml and html5 parser do the right thing,
    so just for the record:

    The HTML fragment above is well-formed and contains a number of li
    elements at the same level directly below the ol element, not lots of
    nested li elements. The end tag of the li element is optional (except in
    XHTML) and li elements don't nest.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWaacACgkQ8g5IURL+ KF3HFRAAsCV8VGBjLkBAl6B2043xPBWiMKmzn+6wMw11tXwnaqBkNo1Bus+UDpi1 DHNoV9LHUf0+b1DrZGCsyaX2oo+p3AohZo+JbX8rBVWq8dYmYBusfbUmnfUb+0na dEsDzKYcpu4ZP2RvLdqNNmdvYRJDbNpAVhmhYkq0nQsHII2oQFXIoEYdeC+75Uhk G2Nv8rCLrsjiCfpkXLYfc+LaFvf0es0ih58/qffQZN8cUumsi5cgtOIgzBvPtgGl sBy1y9jVH2RkBAs/2tSTXrNWb9BGuoSOgFVu4BLMeN1Zc8SjXrm/tW40Zh8MATdI P/0hPAXqVy6g8KU0KIgtH7pAtcPZMJee3cKF/qqUjHcitUOvCWZgUxQE2wu5z24O K+A4pNU2wNdg57GZnsPrUaRnKJ5a8aJFiB9GxoFk1zfAc0ictJgMmdRzgFxAhCDg pD2ENaspaVJSSjpbkLw9oPAYEuW5V7hu1Ff95pfgUcAd4LWr/nDJqrBloexVq9UO ihE64uwsyHMCWpc2dq2Y5CT3edlBTVlS4MG9rH3Y53JLOMKhzalz/ps5XcrROcrW kfINOor7dYukUsRFL8jmAbXgHrepqT90X9JlGDDSVzzAUubR3w15zXt7wNzHtrfQ ttUhRud4D/ChozBGQYTaEh/ywXydY9DxrFAOUHC
  • From Peter J. Holzer@21:1/5 to Chris Angelico on Mon Oct 24 14:21:34 2022
    On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
    On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:
    Ron has already noted that the lxml and html5 parser do the right thing,
    so just for the record:

    The HTML fragment above is well-formed and contains a number of li
    elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.

    That's correct. However, parsing it with html.parser and then
    reconstituting it as shown in the example code results in all the
    </li> tags coming up right before the </ol>, indicating that the <li>
    tags were parsed as deeply nested rather than as siblings.

    Yes, I got that. What I wanted to say was that this is indeed a bug in html.parser and not an error (or sloppyness, as you called it) in the
    input or ambiguity in the HTML standard.


    In order to get a successful parse out of this, I need something which
    sees them as siblings,

    Right, but Roel (correct name this time) had already posted that lxml
    and html5lib parse this correctly, so I saw no need to belabour that
    point.

    which html5lib seems to be doing fine. Whether
    it has other issues, I don't know, but I guess I'll find out....

    The link somebody posted mentions that it's "very slow". Which may or
    may not be a problem when you have to parse 9000 files. But if it does implement HTML5 correctly, it should parse any file the same as a modern browser does (maybe excluding quirks mode).

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWg0oACgkQ8g5IURL+ KF3FGw/+MBxyo5sUBOA6nzQh2g9V5nl3aoct2ILJdsJ8IgfwRYW+jyrFzuAU7dS0 u3gJ+3unricm142lzZUAzPieS9jHSzjPQtM8RyAFwuCMvLChMNbtee+vFWOV6F7O Lvowa4fTvF8+4SzkhYQEVxIS1exCecRskUVQ3osMmGfF54Y0S//LB+rg6lQNKk78 KfLyy+fi2mUj05xbFT9jjxlzP0hV1gNUmbP3EM5k0pkFfLkJIw+0+f8XnxHgtYLA wLx3/ufks3CEclFVkyx50qJtVpz+MXTz5u6mh8acMjoRqNOknuxufycbzhdIvPQT Yj44jgljuF8UNlRwLPz7KWhj217hA/Jprsn6VskdYRQjgEXOEdN1bQ+KvqcrXxEv 1u28Q4thafxE06NR3VKh4lJJ8w3GOdEGB2msEsaAGDpRHzDi5lTnZpvz/VNdLsy4 7dNqfFsOAB7zenofwQXq5WMz8CeMYh2RjjIACW2aaGC97Z1GI2lrz5iHepbthFWI LXlWSk3/jt2DrUEipiJbdQn9WktSUb3bpb5z/Kc5szWKXKgigY/hNldhOFZiVA7H 9n2tjnBVmrsIeU0wip3HcqTx0DjKxN2L2d5QrRl72dqya7L77IZ7aA2JJ5T8mu4A dN0PuKl2CRsoHf0UzXUHLyBd4H3RLrqzkMqTwWQ
  • From Chris Angelico@21:1/5 to Peter J. Holzer on Tue Oct 25 01:01:19 2022
    On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:

    On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
    On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:
    Ron has already noted that the lxml and html5 parser do the right thing, so just for the record:

    The HTML fragment above is well-formed and contains a number of li elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.

    That's correct. However, parsing it with html.parser and then reconstituting it as shown in the example code results in all the
    </li> tags coming up right before the </ol>, indicating that the <li>
    tags were parsed as deeply nested rather than as siblings.

    Yes, I got that. What I wanted to say was that this is indeed a bug in html.parser and not an error (or sloppyness, as you called it) in the
    input or ambiguity in the HTML standard.

    I described the HTML as "sloppy" for a number of reasons, but I was of
    the understanding that it's generally recommended to have the closing
    tags. Not that it matters much.

    which html5lib seems to be doing fine. Whether
    it has other issues, I don't know, but I guess I'll find out....

    The link somebody posted mentions that it's "very slow". Which may or
    may not be a problem when you have to parse 9000 files. But if it does implement HTML5 correctly, it should parse any file the same as a modern browser does (maybe excluding quirks mode).


    Yeah. TBH I think the two-hour run time is primarily dominated by
    network delays, not parsing time, but if I had a service where people
    could upload HTML to be parsed, that might affect throughput.

    For the record, if anyone else is considering html5lib: It is likely
    "fast enough", even if not fast. Give it a try.

    (And I know what slow parsing feels like. Parsing a ~100MB file with a decently-fast grammar-based lexer takes a good while. Parsing the same
    content after it's been converted to JSON? Fast.)

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to python-list@python.org on Tue Oct 25 03:09:33 2022
    On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:

    On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
    On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
    Yes, I got that. What I wanted to say was that this is indeed a bug in
    html.parser and not an error (or sloppyness, as you called it) in the
    input or ambiguity in the HTML standard.

    I described the HTML as "sloppy" for a number of reasons, but I was of
    the understanding that it's generally recommended to have the closing
    tags. Not that it matters much.

    Some elements don't need close tags, or even open tags. Unless you're
    using XHTML you don't need them and indeed for the case of void tags
    (e.g. <br>, <img>) you must not include the close tags.

    Yep, I'm aware of void tags, but I'm talking about the container tags
    - in this case, <li> and <p> - which, in a lot of older HTML pages,
    are treated as "separator" tags. Consider this content:

    <HTML>
    Hello, world!

    Paragraph 2

    Hey look, a third paragraph!
    </HTML>

    Stick a doctype onto that and it should be valid HTML5, but as it is,
    it's the exact sort of thing that was quite common in the 90s. (I'm
    not sure when lowercase tags became more popular, but in any case (pun intended), that won't affect validity.)

    The <p> tag is not a void tag, but according to the spec, it's legal
    to omit the </p> if the element is followed directly by another <p>
    element (or any of a specific set of others), or if there is no
    further content.

    Adding in the omitted <head>, </head>, <body>, </body>, and </html>
    would make no difference and there's no particular reason to recommend
    doing so as far as I'm aware.

    And yet most people do it. Why? Are you saying that it's better to
    omit them all?

    More importantly: Would you omit all the </p> closing tags you can, or
    would you include them?

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Chris Angelico on Mon Oct 24 15:34:45 2022
    On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
    On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
    Yes, I got that. What I wanted to say was that this is indeed a bug in
    html.parser and not an error (or sloppyness, as you called it) in the
    input or ambiguity in the HTML standard.

    I described the HTML as "sloppy" for a number of reasons, but I was of
    the understanding that it's generally recommended to have the closing
    tags. Not that it matters much.

    Some elements don't need close tags, or even open tags. Unless you're
    using XHTML you don't need them and indeed for the case of void tags
    (e.g. <br>, <img>) you must not include the close tags.

    A minimal HTML file might look like this:

    <!DOCTYPE html>
    <html lang=en><meta charset=utf-8><title>Minimal HTML file</title>
    <main><h1>Minimal HTML file</h1>This is a minimal HTML file.</main>

    which would be parsed into this:

    <!DOCTYPE html>
    <html lang="en">
    <head>
    <meta charset="utf-8">
    <title>Minimal HTML file</title>
    </head>
    <body>
    <main>
    <h1>Minimal HTML file</h1>
    This is a minimal HTML file.
    </main>
    </body>
    </html>

    Adding in the omitted <head>, </head>, <body>, </body>, and </html>
    would make no difference and there's no particular reason to recommend
    doing so as far as I'm aware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Chris Angelico on Mon Oct 24 17:01:00 2022
    On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
    On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
    <python-list@python.org> wrote:

    On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
    On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote: >> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
    html.parser and not an error (or sloppyness, as you called it) in the
    input or ambiguity in the HTML standard.

    I described the HTML as "sloppy" for a number of reasons, but I was of
    the understanding that it's generally recommended to have the closing
    tags. Not that it matters much.

    Some elements don't need close tags, or even open tags. Unless you're
    using XHTML you don't need them and indeed for the case of void tags
    (e.g. <br>, <img>) you must not include the close tags.

    Yep, I'm aware of void tags, but I'm talking about the container tags
    - in this case, <li> and <p> - which, in a lot of older HTML pages,
    are treated as "separator" tags.

    Yes, hence why I went on to talk about container tags.

    Consider this content:

    <HTML>
    Hello, world!
    <P>
    Paragraph 2
    <P>
    Hey look, a third paragraph!
    </HTML>

    Stick a doctype onto that and it should be valid HTML5,

    Nope, it's missing a <title>.

    Adding in the omitted <head>, </head>, <body>, </body>, and </html>
    would make no difference and there's no particular reason to recommend
    doing so as far as I'm aware.

    And yet most people do it. Why?

    They agree with Tim Peters that "Explicit is better than implicit",
    I suppose? ;-)

    Are you saying that it's better to omit them all?

    No, I'm saying it's neither option is necessarily better than the other.

    More importantly: Would you omit all the </p> closing tags you can, or
    would you include them?

    It would depend on how much content was inside them I guess.
    Something like:

    <ol>
    <li>First item
    <li>Second item
    <li>Third item
    </ol>

    is very easy to understand, but if each item was many lines long then it
    may be less confusing to explicitly close - not least for indentation
    purposes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roel Schroeven@21:1/5 to Jon Ribbens via Python-list on Mon Oct 24 20:10:34 2022
    Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01:
    On 2022-10-24, Chris Angelico<rosuav@gmail.com> wrote:
    On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:
    Adding in the omitted <head>, </head>, <body>, </body>, and </html>
    would make no difference and there's no particular reason to recommend
    doing so as far as I'm aware.

    And yet most people do it. Why?

    They agree with Tim Peters that "Explicit is better than implicit",
    I suppose? ;-)

    I don't write all that much HTML, but when I do, it include those tags
    largely for that reason indeed. We don't write HTML just for the
    browser, we also write it for the web developer. And I think it's easier
    for the web developer when the different sections are clearly
    distinguished, and what better way to do it than use their tags.

    More importantly: Would you omit all the </p> closing tags you can, or
    would you include them?
    It would depend on how much content was inside them I guess.
    Something like:

    <ol>
    <li>First item
    <li>Second item
    <li>Third item
    </ol>

    is very easy to understand, but if each item was many lines long then it
    may be less confusing to explicitly close - not least for indentation purposes.
    I mostly include closing tags, if for no other reason than that I have
    the impression that editors generally work better (i.e. get things like indentation and syntax highlighting right) that way.

    --
    "Je ne suis pas d’accord avec ce que vous dites, mais je me battrai jusqu’à
    la mort pour que vous ayez le droit de le dire."
    -- Attribué à Voltaire
    "I disapprove of what you say, but I will defend to the death your right to
    say it."
    -- Attributed to Voltaire
    "Ik ben het niet eens met wat je zegt, maar ik zal je recht om het te zeggen tot de dood toe verdedigen"
    -- Toegeschreven aan Voltaire

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Chris Angelico on Mon Oct 24 19:17:55 2022
    On 2022-10-25 03:09:33 +1100, Chris Angelico wrote:
    On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:
    On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
    On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
    Yes, I got that. What I wanted to say was that this is indeed a bug in >> html.parser and not an error (or sloppyness, as you called it) in the
    input or ambiguity in the HTML standard.

    I described the HTML as "sloppy" for a number of reasons, but I was of the understanding that it's generally recommended to have the closing tags. Not that it matters much.

    Some elements don't need close tags, or even open tags. Unless you're
    using XHTML you don't need them and indeed for the case of void tags
    (e.g. <br>, <img>) you must not include the close tags.

    Yep, I'm aware of void tags, but I'm talking about the container tags
    - in this case, <li> and <p> - which, in a lot of older HTML pages,
    are treated as "separator" tags. Consider this content:

    <HTML>
    Hello, world!

    Paragraph 2

    Hey look, a third paragraph!
    </HTML>

    Stick a doctype onto that and it should be valid HTML5, but as it is,
    it's the exact sort of thing that was quite common in the 90s.

    The <p> tag is not a void tag, but according to the spec, it's legal
    to omit the </p> if the element is followed directly by another <p>
    element (or any of a specific set of others), or if there is no
    further content.

    Right. The parser knows the structure of an HTML document, which tags
    are optional and which elements can be inside of which other elements.
    For SGML-based HTML versions (2.0 to 4.01) this is formally described by
    the DTD.

    So when parsing your file, an HTML parser would work like this

    <HTML> - Yup, I expect an HTML element here:
    HTML
    Hello, world! - #PCDATA? Not allowed as a child of HTML. There must
    be a HEAD and a BODY, both of which have optional start tags.
    HEAD can't contain #PCDATA either, so we must be inside of BODY
    and HEAD was empty:
    HTML
    ├─ HEAD
    └─ BODY
    └─ Hello, world!
    <P> - Allowed in BODY, so just add that:
    HTML
    ├─ HEAD
    └─ BODY
    ├─ #PCDATA: Hello, world!
    └─ P
    Paragraph 2 - #PCDATA is allowed in P, so add it as a child:
    HTML
    ├─ HEAD
    └─ BODY
    ├─ #PCDATA: Hello, world!
    └─ P
    └─ #PCDATA: Paragraph 2
    <P> - Not allowed inside of P, so that implicitely closes the
    previous P element and we go up one level:
    HTML
    ├─ HEAD
    └─ BODY
    ├─ #PCDATA: Hello, world!
    ├─ P
    │ └─ #PCDATA: Paragraph 2
    └─ P
    Hey look, a third paragraph! - Same as above:
    HTML
    ├─ HEAD
    └─ BODY
    ├─ #PCDATA: Hello, world!
    ├─ P
    │ └─ #PCDATA: Paragraph 2
    └─ P
    └─ #PCDATA: Hey look, a third paragraph!
    </HTML> - The end tags of P and BODY are optional, so the end of
    HTML closes them implicitely, and we have our final parse tree
    (unchanged from the last step):
    HTML
    ├─ HEAD
    └─ BODY
    ├─ #PCDATA: Hello, world!
    ├─ P
    │ └─ #PCDATA: Paragraph 2
    └─ P
    └─ #PCDATA: Hey look, a third paragraph!

    For a human, the <p> tags might feel like separators here. But
    syntactically they aren't - they start a new element. Note especially
    that "Hello, world!" is not part of a P element but a direct child of
    BODY (which may or may not be intended by the author).


    Adding in the omitted <head>, </head>, <body>, </body>, and </html>
    would make no difference and there's no particular reason to recommend doing so as far as I'm aware.

    And yet most people do it. Why?

    There may be several reasons:

    * Historically, some browsers differed in which end tags were actually
    optional. Since (AFAIK) no mainstream browser ever implemented a real
    SGML parser (they were always "tag soup" parsers with lots of ad-hoc
    rules) this sometimes even changed within the same browser depending
    on context (e.g. a simple table might work but nested tables woudn't).
    So people started to use end-tags defensively.
    * XHTML was for some time popular and it doesn't have any optional tags.
    So people got into the habit of always using end tags and writing
    empty tags as <XXX />.
    * Aesthetics: Always writing the end tags is more consistent and may
    look more balanced.
    * Cargo-cult: People saw other people do that and copied the habit
    without thinking about it.


    Are you saying that it's better to omit them all?

    If you want to conserve keystrokes :-)

    I think it doesn't matter. Both are valid.

    More importantly: Would you omit all the </p> closing tags you can, or
    would you include them?

    I usually write them. I also indent the contents of an element, so I
    would write your example as:

    <!DOCTYPE html>
    <html>
    <body>
    Hello, world!
    <p>
    Paragraph 2
    </p>
    <p>
    Hey look, a third paragraph!
    </p>
    </body>
    </html>

    (As you can see I would also include the body tags to make that element explicit. I would normally also add a bit of boilerplate (especially a
    head with a charset and viewport definition), but I omit them here since
    they would change the parse tree)

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNWyL4ACgkQ8g5IURL+ KF0WYQ/+LN4+cWjGBf4QJDCOmiDSeMaOKMup22wuU9HDtMHWMxb6SNe2L5YfOsqe Q9qF1q39asXYYe7N0sbkL+rvUtY1OJH6FUO7xQ7cHGmbnaSAL8OmkDydlcINezRR +S8/VvD0vKJtkjNOZ9jIxnlKAvx9gKDxu2b1ABfa7aOWzV/gmiQ/Piy9ICfi88Tw UPBHAsMj7Ig2OL15AFe/zYUZFYjucjYaufZdmEufmyOLsPLcJL86Vcpk6pN0fT7u DYnpJSLPJq5CR/JBO5YnoNVibcgrK7ALDG3hxu7FFlVcmQdDOaXRUpAm8SfuWZWV ThOel5M55Z6G0QxUZGnmDq5S7tPuguncnOtvIXVQI2V82a8lfCBo9dhetKYTXygz bcCfX59ulCCgNO0v7/ZkQn8/gO4l5A+LWkwJWK7uN9GyC0Gio+hSQN7Q+Ht+cc+5 qnZbvXGYNqZImNKtqehy9HbVxhLWGjBpgMZ/WKJVrBJSJHYAuuUece1dLx4ee7lw Dx6ZFy9vMsOL8q/7AxARRZ4/QMjWy43jHIxAyVwR74es+AO4sAqYIxAsut20yIme 2UIApUgw+ciNc7E6o+SgYSTFbOV4D/Q073h3cndsqNuMPrbOxTkGVrk1LsahQ7iR rHB/TXs8BOEwirFkOpEOziclORQcW/EOAc1P66j
  • From Chris Angelico@21:1/5 to Peter J. Holzer on Tue Oct 25 06:56:58 2022
    On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
    There may be several reasons:

    * Historically, some browsers differed in which end tags were actually
    optional. Since (AFAIK) no mainstream browser ever implemented a real
    SGML parser (they were always "tag soup" parsers with lots of ad-hoc
    rules) this sometimes even changed within the same browser depending
    on context (e.g. a simple table might work but nested tables woudn't).
    So people started to use end-tags defensively.
    * XHTML was for some time popular and it doesn't have any optional tags.
    So people got into the habit of always using end tags and writing
    empty tags as <XXX />.
    * Aesthetics: Always writing the end tags is more consistent and may
    look more balanced.
    * Cargo-cult: People saw other people do that and copied the habit
    without thinking about it.


    Are you saying that it's better to omit them all?

    If you want to conserve keystrokes :-)

    I think it doesn't matter. Both are valid.

    More importantly: Would you omit all the </p> closing tags you can, or would you include them?

    I usually write them.

    Interesting. So which of the above reasons is yours? Personally, I do
    it for a slightly different reason: Many end tags are *situationally*
    optional, and it's much easier to debug code when you
    change/insert/remove something and nothing changes, than when doing so
    affects the implicit closing tags.

    I also indent the contents of an element, so I
    would write your example as:

    <!DOCTYPE html>
    <html>
    <body>
    Hello, world!
    <p>
    Paragraph 2
    </p>
    <p>
    Hey look, a third paragraph!
    </p>
    </body>
    </html>

    (As you can see I would also include the body tags to make that element explicit. I would normally also add a bit of boilerplate (especially a
    head with a charset and viewport definition), but I omit them here since
    they would change the parse tree)


    Yeah - any REAL page would want quite a bit (very few pages these days
    manage without a style sheet, and it seems that hardly any survive
    without importing a few gigabytes of JavaScript, but that's not
    mandatory), but in ancient pages, there's still a well-defined parse
    structure for every tag sequences.

    One thing I find quite interesting, though, is the way that browsers
    *differ* in the face of bad nesting of tags. Recently I was struggling
    to figure out a problem with an HTML form, and eventually found that
    there was a spurious <form> tag way up higher in the page. Forms don't
    nest, so that's invalid, but different browsers had slightly different
    ways of showing it. (Obviously the W3C Validator was the most helpful
    tool here, since it reports it as an error rather than constructing
    any sort of DOM tree.)

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Peter J. Holzer on Tue Oct 25 09:57:07 2022
    On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer <hjp-python@hjp.at> wrote:
    One thing I find quite interesting, though, is the way that browsers *differ* in the face of bad nesting of tags. Recently I was struggling
    to figure out a problem with an HTML form, and eventually found that
    there was a spurious <form> tag way up higher in the page. Forms don't nest, so that's invalid, but different browsers had slightly different
    ways of showing it.

    Yeah, mismatched form tags can have weird effects. I don't remember the details but I scratched my head over that one more than once.


    Yeah. I think my weirdest issue was one time when I inadvertently had
    a <dialog> element (with a form inside it) inside something else with
    a form (because the </form> was missing). Neither "dialog inside main"
    nor "form in dialog separate from form in main" is a problem, and
    even "oops, missed a closing form tag" isn't that big a deal, but put
    them all together, and you end up with a bizarre situation where
    Firefox 91 behaves one way and Chrome (some-version) behaves another
    way.

    That was a fun day. Remember, folks, even if you think you ran the W3C validator on your code recently, it can still be worth checking. Just
    in case.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Chris Angelico on Tue Oct 25 00:33:11 2022
    On 2022-10-25 06:56:58 +1100, Chris Angelico wrote:
    On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
    There may be several reasons:

    * Historically, some browsers differed in which end tags were actually
    optional. Since (AFAIK) no mainstream browser ever implemented a real
    SGML parser (they were always "tag soup" parsers with lots of ad-hoc
    rules) this sometimes even changed within the same browser depending
    on context (e.g. a simple table might work but nested tables woudn't).
    So people started to use end-tags defensively.
    * XHTML was for some time popular and it doesn't have any optional tags.
    So people got into the habit of always using end tags and writing
    empty tags as <XXX />.
    * Aesthetics: Always writing the end tags is more consistent and may
    look more balanced.
    * Cargo-cult: People saw other people do that and copied the habit
    without thinking about it.


    Are you saying that it's better to omit them all?

    If you want to conserve keystrokes :-)

    I think it doesn't matter. Both are valid.

    More importantly: Would you omit all the </p> closing tags you can, or would you include them?

    I usually write them.

    Interesting. So which of the above reasons is yours?

    Mostly the third one at this point I think. The first one has gone away
    for me with HTML5. The second one still lingers at the back of
    my brain, but I've gotten rid of the habit of writing <img .../>, so I'm recevering ;-). But I still like my code to be nice and tidy, and
    whether my sense of tidyness was influenced by XML or not, if the end
    tags are missing it looks off, somehow.

    (That said, I do sometimes leave them off to reduce visual clutter.)


    One thing I find quite interesting, though, is the way that browsers
    *differ* in the face of bad nesting of tags. Recently I was struggling
    to figure out a problem with an HTML form, and eventually found that
    there was a spurious <form> tag way up higher in the page. Forms don't
    nest, so that's invalid, but different browsers had slightly different
    ways of showing it.

    Yeah, mismatched form tags can have weird effects. I don't remember the
    details but I scratched my head over that one more than once.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIyBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmNXEqIACgkQ8g5IURL+ KF1X1Q/3V36PKy0isIIyf43CgsRKsbD9sFQrFvekq4g14HoSUAEXRJB1jN8yUFRj mPk5UgINbEVf6UHcZ6U1qjJoKFWDC204/FCLZaxPP+4yG9Y1vgMAZbDI/35F0hW8 3aA8A6TEgG3rvo3X4OiZVzH8BJEpe1C8vD1u0GzNciljl0fw71lL/eBGMoN+AjHC /N1/k2yN4LC+kD/pl4CwSGt4s7N+QpGhZ/JjDZ/eBEIJ6yvH2EBDD/vzYLJBtxOi l/rHVSyDRXM2yt5a2/72xKJR/9JjVMQJQw1RcC6A/Ozr82k5D2KeY8tLq8/jmvbO rlmDKUXKDmYPmH3KGmUilLjNgn/V3HFnro+n0Vek0D8OwdPGyiUDPQDN7lAzMqbq iuvS1yynKR//YT2Li7kzxsYBxcAdk6HJJhiX4CNVUOp1Fy104sZrZffDoL3acFFh V8hgXTUmoCVdVSHdzdVwRKPx1WvOXzsjcex+kCTxGGTYuW3lOhzSCaC7n34uoY5P ttmDdbSMnsP2hrAAsZxdwQuvIp2UFHwlFfX37wV2yT9epNI3Q5ElKXdM/eigPQP0 UPIpJGghRREjhJ3nk1hjYKxGFm3sHgw2v9QECO1tOlGiBCHpCMwkUgcbmbUd0+lo cYAWFZRo0KTIHtc2SRWN0hhBHmI+ADZtA5ra1Hq
  • From Tim Delaney@21:1/5 to Chris Angelico on Wed Oct 26 04:59:02 2022
    On Mon, 24 Oct 2022 at 19:03, Chris Angelico <rosuav@gmail.com> wrote:


    Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation
    anywhere, or at least a list of recommendations on choosing a suitable parser?


    Coming to this a bit late, but from my experience with BeautifulSoup and
    HTML produced by other people ...

    lxml is easily the fastest, but also the least forgiving.
    html.parer is middling on performance, but as you've seen sometimes makes mistakes.
    html5lib is the slowest, but is most forgiving of malformed input and edge cases.

    I use html5lib - it's fast enough for what I do, and the most likely to
    return results matching what the author saw when they maybe tried it in a single web browser.

    Tim Delaney

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Tim Delaney on Wed Oct 26 05:03:25 2022
    On Wed, 26 Oct 2022 at 04:59, Tim Delaney <timothy.c.delaney@gmail.com> wrote:

    On Mon, 24 Oct 2022 at 19:03, Chris Angelico <rosuav@gmail.com> wrote:


    Ah, cool. Thanks. I'm not entirely sure of the various advantages and
    disadvantages of the different parsers; is there a tabulation
    anywhere, or at least a list of recommendations on choosing a suitable
    parser?


    Coming to this a bit late, but from my experience with BeautifulSoup and HTML produced by other people ...

    lxml is easily the fastest, but also the least forgiving.
    html.parer is middling on performance, but as you've seen sometimes makes mistakes.
    html5lib is the slowest, but is most forgiving of malformed input and edge cases.

    I use html5lib - it's fast enough for what I do, and the most likely to return results matching what the author saw when they maybe tried it in a single web browser.

    Cool cool. It sounds like html5lib should really be the recommended
    parser for HTML, unless performance or dependency reduction is
    important enough to change your plans. (But only for HTML. For XML,
    lxml would still be the right choice.)

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)