Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)
On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.
Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.
Op 24/10/2022 om 4:29 schreef Chris Angelico:
Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm blob = b"""
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)
On this small snippet, it works acceptably, but puts a large number of </li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.
Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.
Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one
before the closing </ol> . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.
Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one
before the closing </ol> . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <roel@roelschroeven.net>
wrote:
Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one before the closing </ol> . On my system the same happens when I don't specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.
Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulationThere's a bit of information here: https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser
anywhere, or at least a list of recommendations on choosing a suitable parser?
I'm dealing with a HUGE mess of different coding standards, all the
way from 1990s-level stuff (images for indentation, tables for
formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
of the pages have at least some <meta> tags and declare their
encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
There's even a couple of pages that use frames - yes, the old style
with a <noframes> block in case the browser can't handle it. I went
with html.parser on the expectation that it'd give the best "across
all standards" results, but I'll give html5lib a try and see if it
does better.
Would rather not try to use different parsers for different files, but
if necessary, I'll figure something out.
(For reference, this is roughly 9000 HTML files that have to be
parsed. Doing things by hand is basically not an option.)
Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:
The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of
nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.
Ron has already noted that the lxml and html5 parser do the right thing,^^^
Parsing ancient HTML files is something Beautiful Soup is normally[...]
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)
On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>.
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:
Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:
The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.
That's correct. However, parsing it with html.parser and then
reconstituting it as shown in the example code results in all the
</li> tags coming up right before the </ol>, indicating that the <li>
tags were parsed as deeply nested rather than as siblings.
In order to get a successful parse out of this, I need something which
sees them as siblings,
which html5lib seems to be doing fine. Whether
it has other issues, I don't know, but I guess I'll find out....
On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python@hjp.at> wrote:
Ron has already noted that the lxml and html5 parser do the right thing, so just for the record:
The HTML fragment above is well-formed and contains a number of li elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest.
That's correct. However, parsing it with html.parser and then reconstituting it as shown in the example code results in all the
</li> tags coming up right before the </ol>, indicating that the <li>
tags were parsed as deeply nested rather than as siblings.
Yes, I got that. What I wanted to say was that this is indeed a bug in html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.
which html5lib seems to be doing fine. Whether
it has other issues, I don't know, but I guess I'll find out....
The link somebody posted mentions that it's "very slow". Which may or
may not be a problem when you have to parse 9000 files. But if it does implement HTML5 correctly, it should parse any file the same as a modern browser does (maybe excluding quirks mode).
On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
Yes, I got that. What I wanted to say was that this is indeed a bug in
html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.
I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.
Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. <br>, <img>) you must not include the close tags.
Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
Yes, I got that. What I wanted to say was that this is indeed a bug in
html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.
I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
<python-list@python.org> wrote:
On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote: >> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.
I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.
Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. <br>, <img>) you must not include the close tags.
Yep, I'm aware of void tags, but I'm talking about the container tags
- in this case, <li> and <p> - which, in a lot of older HTML pages,
are treated as "separator" tags.
Consider this content:
<HTML>
Hello, world!
<P>
Paragraph 2
<P>
Hey look, a third paragraph!
</HTML>
Stick a doctype onto that and it should be valid HTML5,
Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.
And yet most people do it. Why?
Are you saying that it's better to omit them all?
More importantly: Would you omit all the </p> closing tags you can, or
would you include them?
On 2022-10-24, Chris Angelico<rosuav@gmail.com> wrote:
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:
Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.
And yet most people do it. Why?
They agree with Tim Peters that "Explicit is better than implicit",
I suppose? ;-)
I mostly include closing tags, if for no other reason than that I haveIt would depend on how much content was inside them I guess.More importantly: Would you omit all the </p> closing tags you can, or
would you include them?
Something like:
<ol>
<li>First item
<li>Second item
<li>Third item
</ol>
is very easy to understand, but if each item was many lines long then it
may be less confusing to explicitly close - not least for indentation purposes.
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list <python-list@python.org> wrote:
On 2022-10-24, Chris Angelico <rosuav@gmail.com> wrote:
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
Yes, I got that. What I wanted to say was that this is indeed a bug in >> html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.
I described the HTML as "sloppy" for a number of reasons, but I was of the understanding that it's generally recommended to have the closing tags. Not that it matters much.
Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. <br>, <img>) you must not include the close tags.
Yep, I'm aware of void tags, but I'm talking about the container tags
- in this case, <li> and <p> - which, in a lot of older HTML pages,
are treated as "separator" tags. Consider this content:
<HTML>
Hello, world!
Paragraph 2
Hey look, a third paragraph!
</HTML>
Stick a doctype onto that and it should be valid HTML5, but as it is,
it's the exact sort of thing that was quite common in the 90s.
The <p> tag is not a void tag, but according to the spec, it's legal
to omit the </p> if the element is followed directly by another <p>
element (or any of a specific set of others), or if there is no
further content.
Adding in the omitted <head>, </head>, <body>, </body>, and </html>
would make no difference and there's no particular reason to recommend doing so as far as I'm aware.
And yet most people do it. Why?
Are you saying that it's better to omit them all?
More importantly: Would you omit all the </p> closing tags you can, or
would you include them?
There may be several reasons:
* Historically, some browsers differed in which end tags were actually
optional. Since (AFAIK) no mainstream browser ever implemented a real
SGML parser (they were always "tag soup" parsers with lots of ad-hoc
rules) this sometimes even changed within the same browser depending
on context (e.g. a simple table might work but nested tables woudn't).
So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
So people got into the habit of always using end tags and writing
empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
without thinking about it.
Are you saying that it's better to omit them all?
If you want to conserve keystrokes :-)
I think it doesn't matter. Both are valid.
More importantly: Would you omit all the </p> closing tags you can, or would you include them?
I usually write them.
I also indent the contents of an element, so I
would write your example as:
<!DOCTYPE html>
<html>
<body>
Hello, world!
<p>
Paragraph 2
</p>
<p>
Hey look, a third paragraph!
</p>
</body>
</html>
(As you can see I would also include the body tags to make that element explicit. I would normally also add a bit of boilerplate (especially a
head with a charset and viewport definition), but I omit them here since
they would change the parse tree)
One thing I find quite interesting, though, is the way that browsers *differ* in the face of bad nesting of tags. Recently I was struggling
to figure out a problem with an HTML form, and eventually found that
there was a spurious <form> tag way up higher in the page. Forms don't nest, so that's invalid, but different browsers had slightly different
ways of showing it.
Yeah, mismatched form tags can have weird effects. I don't remember the details but I scratched my head over that one more than once.
On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer <hjp-python@hjp.at> wrote:
There may be several reasons:
* Historically, some browsers differed in which end tags were actually
optional. Since (AFAIK) no mainstream browser ever implemented a real
SGML parser (they were always "tag soup" parsers with lots of ad-hoc
rules) this sometimes even changed within the same browser depending
on context (e.g. a simple table might work but nested tables woudn't).
So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
So people got into the habit of always using end tags and writing
empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
without thinking about it.
Are you saying that it's better to omit them all?
If you want to conserve keystrokes :-)
I think it doesn't matter. Both are valid.
More importantly: Would you omit all the </p> closing tags you can, or would you include them?
I usually write them.
Interesting. So which of the above reasons is yours?
One thing I find quite interesting, though, is the way that browsers
*differ* in the face of bad nesting of tags. Recently I was struggling
to figure out a problem with an HTML form, and eventually found that
there was a spurious <form> tag way up higher in the page. Forms don't
nest, so that's invalid, but different browsers had slightly different
ways of showing it.
Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable parser?
On Mon, 24 Oct 2022 at 19:03, Chris Angelico <rosuav@gmail.com> wrote:
Ah, cool. Thanks. I'm not entirely sure of the various advantages and
disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable
parser?
Coming to this a bit late, but from my experience with BeautifulSoup and HTML produced by other people ...
lxml is easily the fastest, but also the least forgiving.
html.parer is middling on performance, but as you've seen sometimes makes mistakes.
html5lib is the slowest, but is most forgiving of malformed input and edge cases.
I use html5lib - it's fast enough for what I do, and the most likely to return results matching what the author saw when they maybe tried it in a single web browser.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 74:43:02 |
Calls: | 6,715 |
Calls today: | 3 |
Files: | 12,246 |
Messages: | 5,357,277 |