A few year ago I have created the privacy-breach lintian checks in
order to detect trackers in our doc
I think we are losing the battle here.
I believe that we need better tools than sed in order to fix this kind
of problem.
I have some idea like:
- read the html tree
- convert the html tree dom representation to xml serialization (so called
XHTML5 or polyglot)
- apply to this xhtml5 xslt2 rules for fixing the privacy breach
The problem are the tools to use...
I will like to use javascript for this kind of transformation but
nodejs does not compile on armel, and for saxon-ce I need gwt that is
not in debian...
I could use saxon2,but it will need java.
Quoting Bastien Roucariès (2021-09-02 17:53:18)
A few year ago I have created the privacy-breach lintian checks in
order to detect trackers in our doc
I think we are losing the battle here.
I believe that we need better tools than sed in order to fix this kind
of problem.
I have some idea like:
- read the html tree
- convert the html tree dom representation to xml serialization (so called
XHTML5 or polyglot)
- apply to this xhtml5 xslt2 rules for fixing the privacy breach
The problem are the tools to use...
I will like to use javascript for this kind of transformation but
nodejs does not compile on armel, and for saxon-ce I need gwt that is
not in debian...
I could use saxon2,but it will need java.
Perl is famous for its text juggling features, and sloppy parsing of
html can be done e.g. with HTML::HTML5::Parser (i.e. Debian package libhtml-html5-parser-perl).
Also, debhelper itself is written in perl, so is likely easier to
integrate plugins written in perl as well. If perl is an option at all, obviously...
I am sure Python/Ruby/PHP/Haskell/Scheme/Rust/etc. folks will argue that their pet language is the right for the task as well: I think it will
help the conversation if you clarify what you are open to and what are constraints for you.
E.g. do you mean that it *must* be JavaScript when you mention that? Or
are you perhaps asking if someone else wants to take over the challenge
from you, so it does not matter how it is done?
- Jonas
Perl is an option I implemented the privacy breach test in perl. The
problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...
Generic one could be coded in perl, but for the end side I need
something like xslt2
I am sure Python/Ruby/PHP/Haskell/Scheme/Rust/etc. folks will argue
that their pet language is the right for the task as well: I think
it will help the conversation if you clarify what you are open to
and what are constraints for you.
E.g. do you mean that it *must* be JavaScript when you mention that?
Or are you perhaps asking if someone else wants to take over the
challenge from you, so it does not matter how it is done?
No it must no be javascript, but using V8 or something like browser
internal in order to fail to get a dom tree in case of broken html
file, like a browser do. But may be I am overconcious
A few year ago I have created the privacy-breach lintian checks in
order to detect trackers in our doc
I think we are losing the battle here.
I believe that we need better tools than sed in order to fix this kind
of problem.
Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :
Quoting Bastien Roucariès (2021-09-02 23:45:30)
Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...
Generic one could be coded in perl, but for the end side I need
something like xslt2
If you are asking how to sloppily parse HTML5 files from upstream source and XSLT2 files provided by package maintainers, then with perl you
could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for
the second.
Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946
Your suggestion will work fine but we need to get some solution for
this utf-8 problem...
Quoting Bastien Roucariès (2021-09-02 23:45:30)
Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...
Generic one could be coded in perl, but for the end side I need
something like xslt2
If you are asking how to sloppily parse HTML5 files from upstream source
and XSLT2 files provided by package maintainers, then with perl you
could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for
the second.
I am sure Python/Ruby/PHP/Haskell/Scheme/Rust/etc. folks will argue
that their pet language is the right for the task as well: I think
it will help the conversation if you clarify what you are open to
and what are constraints for you.
E.g. do you mean that it *must* be JavaScript when you mention that?
Or are you perhaps asking if someone else wants to take over the challenge from you, so it does not matter how it is done?
No it must no be javascript, but using V8 or something like browser internal in order to fail to get a dom tree in case of broken html
file, like a browser do. But may be I am overconcious
If you are asking how to parse HTML5 files like a web browser, then with
perl you could use Gtk3::WebKit2 for that.
- Jonas
--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
[x] quote me freely [ ] ask before reusing [ ] keep private
Quoting Bastien ROUCARIES (2021-09-04 19:52:50)
Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :
Quoting Bastien Roucariès (2021-09-02 23:45:30)
Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...
Generic one could be coded in perl, but for the end side I need something like xslt2
If you are asking how to sloppily parse HTML5 files from upstream source and XSLT2 files provided by package maintainers, then with perl you
could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for
the second.
Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946
Ouch!
I keep forgetting which packages are affected by that annoying bug :-/
Your suggestion will work fine but we need to get some solution for
this utf-8 problem...
I have recently grown somewhat more familiar with UTF-8 and perl (in my
work towards fixing bug#867305 in licensecheck), and will try take a
fresh look at bug#750946...
- Jonas
--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
[x] quote me freely [ ] ask before reusing [ ] keep private
Le sam. 4 sept. 2021 à 18:03, Jonas Smedegaard <jonas@jones.dk> a écrit :
Quoting Bastien ROUCARIES (2021-09-04 19:52:50)
Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :
Quoting Bastien Roucariès (2021-09-02 23:45:30)
Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...
Generic one could be coded in perl, but for the end side I need something like xslt2
If you are asking how to sloppily parse HTML5 files from upstream source
and XSLT2 files provided by package maintainers, then with perl you could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for the second.
Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946
Ouch!
I keep forgetting which packages are affected by that annoying bug :-/
Your suggestion will work fine but we need to get some solution for
this utf-8 problem...
I have recently grown somewhat more familiar with UTF-8 and perl (in my work towards fixing bug#867305 in licensecheck), and will try take a
fresh look at bug#750946...
The solution is straightforward just send you a mail. Use html5
sniffing and add an optional parameter to method to specify encoding.
Quoting Jonas Smedegaard (2021-09-04 22:51:40)No I really need to ouptut xml and after pass to saxon...
It is my understanding that upstream would considers the API being
tied to the API - i.e. if you want a different API then look for
different module.
Seems the upstream author of HTML::HTML5::Parser even himself switched
to HTML5::DOM for his newer work: https://metacpan.org/pod/Types::HTML5
- Jonas
--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
[x] quote me freely [ ] ask before reusing [ ] keep private
Quoting Jonas Smedegaard (2021-09-04 22:51:40)
It is my understanding that upstream would considers the API being
tied to the API - i.e. if you want a different API then look for
different module.
Seems the upstream author of HTML::HTML5::Parser even himself switched
to HTML5::DOM for his newer work: https://metacpan.org/pod/Types::HTML5
- Jonas
--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
[x] quote me freely [ ] ask before reusing [ ] keep private
Quoting Bastien ROUCARIES (2021-09-04 19:52:50)
Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :
Quoting Bastien Roucariès (2021-09-02 23:45:30)
Perl is an option I implemented the privacy breach test in perl.
The problem is I prefer to drop a debian/package.privacy.xslt
file in the package instead of asking maintainer to code the
removal of privacy problems...
Generic one could be coded in perl, but for the end side I need something like xslt2
If you are asking how to sloppily parse HTML5 files from upstream
source and XSLT2 files provided by package maintainers, then with
perl you could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for the second.
Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946
Ouch!
I keep forgetting which packages are affected by that annoying bug :-/
Your suggestion will work fine but we need to get some solution for
this utf-8 problem...
I have recently grown somewhat more familiar with UTF-8 and perl (in
my work towards fixing bug#867305 in licensecheck), and will try take
a fresh look at bug#750946...
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 286 |
Nodes: | 16 (2 / 14) |
Uptime: | 90:00:42 |
Calls: | 6,496 |
Calls today: | 7 |
Files: | 12,100 |
Messages: | 5,277,558 |