• Ideas for a dh-privacy-helper

    From Bastien =?ISO-8859-1?Q?Roucari=E8s?@21:1/5 to Debian Debian Developers on Thu Sep 2 15:53:18 2021
    Copy: pabs@debian.org (pabs@debian.org)

    Hi,

    A few year ago I have created the privacy-breach lintian checks in order to detect trackers in our doc

    I think we are losing the battle here.

    I believe that we need better tools than sed in order to fix this kind of problem.

    I have some idea like:
    - read the html tree
    - convert the html tree dom representation to xml serialization (so called XHTML5 or polyglot)
    - apply to this xhtml5 xslt2 rules for fixing the privacy breach

    The problem are the tools to use...

    I will like to use javascript for this kind of transformation but nodejs does not compile on armel, and for saxon-ce I need gwt that is not in debian...

    I could use saxon2,but it will need java.

    Any ideas is welcomed

    Bastien
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEXQGHuUCiRbrXsPVqADoaLapBCF8FAmEw824ACgkQADoaLapB CF9qOQ//YwICV0V9rne2xIaoLxK46DrknZH4+zf+w27f8ook73+lSo2RmNoVzO3+ klc4jY3deBSuPvaSc3xN0c8zuxG+ncFYaaOi943AnU1C0aYgXlk+IvHx04D4uTTI PzB8WgfIm28AzmARMDH9qAAghL9rgjJOXHcBwYyYaDvfjDurvL+X5+BhFY9gPMMz 99RQXpBbBIoP8j+EBHudjemvVrQjim4SVp8U8j//iXJAbOICKGl1y8iA+raYWK4w rzxi2MK4JWxCeyOPHo0aGj7LVAwg1wgNhSrFSFK4rZpXRw2dgcFNmogr5KRyRws6 XOllCQn09MVgOn1l1O5v0tg+9mQS4yuQ4G6zYeydCeNgq3AtYu0U1VOQGZZrlFYJ o/n5l+r+oHcHJ/2iYRRaUjFcZUgowFPPwA6tsgeRUN75E2SqaDXD+MbS1telPi0s 03vwvtDR/BfG1HIYzQtUtXI3FTkr9Dv+yXypW+vKO6xjU897nnKoegEHQ43NmHJB rskS8iDxzAWZEWngQGEUbNHPOqzP+HQsgOtNAcO1eCCZ96nMPe7IQwW9vo3lKm0K /xzOOof6NMF9GFp7acID7iH8dWViYwjXQSNEJe2A+EwTX269u7y7iBn0V0crlmp7 N1AnW6Vi4md6cjZ34kopwHpGObTmcpyTr2VK5Z79RLrc/qUj5gY=
    =Y0bR
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonas Smedegaard@21:1/5 to All on Thu Sep 2 18:20:02 2021
    Quoting Bastien Roucariès (2021-09-02 17:53:18)
    A few year ago I have created the privacy-breach lintian checks in
    order to detect trackers in our doc

    I think we are losing the battle here.

    I believe that we need better tools than sed in order to fix this kind
    of problem.

    I have some idea like:
    - read the html tree
    - convert the html tree dom representation to xml serialization (so called
    XHTML5 or polyglot)
    - apply to this xhtml5 xslt2 rules for fixing the privacy breach

    The problem are the tools to use...

    I will like to use javascript for this kind of transformation but
    nodejs does not compile on armel, and for saxon-ce I need gwt that is
    not in debian...

    I could use saxon2,but it will need java.

    Perl is famous for its text juggling features, and sloppy parsing of
    html can be done e.g. with HTML::HTML5::Parser (i.e. Debian package libhtml-html5-parser-perl).

    Also, debhelper itself is written in perl, so is likely easier to
    integrate plugins written in perl as well. If perl is an option at all, obviously...

    I am sure Python/Ruby/PHP/Haskell/Scheme/Rust/etc. folks will argue that
    their pet language is the right for the task as well: I think it will
    help the conversation if you clarify what you are open to and what are constraints for you.

    E.g. do you mean that it *must* be JavaScript when you mention that? Or
    are you perhaps asking if someone else wants to take over the challenge
    from you, so it does not matter how it is done?


    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private --==============21162563558251173=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEn+Ppw2aRpp/1PMaELHwxRsGgASEFAmEw98EACgkQLHwxRsGg ASGDhQ//Sn2XXISGgUIyOPcq7FTYaP+cqXp4L3WuAF7UEZ3ufjQfBjpyNBZuaZ6Z p2/m26fsLqjEYleGl+DA661UNHtnHlSMq3FxUqY73pK740XPeYJ37wfUraGzofou 3vMEb4BcSIzBFtcgB50nCzZzBBoJwh9IAXMmiKRlxNT9q/LBeMNjWTcSh0nCyExy UpMEnoc1J1Tmzye2z0M6iVblevK2fetXeuNEz2x9uOOeRl5DMkI1hkvz0qYuJAzq QDFoXBfyKHlZJpBfLcj0WJK+tbRm7n0dLpe/YF55SJUpRolnZfjiTEYleUImYLFg DRS4sdUKxQhlKv8Db5C+Cb4k9iC8quySY1ixRakOESm2Tz/Y86b/JfosHogji97K ilv4T5+pAy/5m23ooFvLKZ8yd1g+ALyxdN91Keb8Vel2weVnlfDTIdnW/oW6sJZA hP0i1UFVzYUsnFn8HV5wDSOVxpgFkhJgtYpQl7plMQdin6y2Oq3K/Vawqghk/aoc VTTuKPKwqBUSEZh36
  • From Bastien =?ISO-8859-1?Q?Roucari=E8s?@21:1/5 to Debian Debian Developers on Thu Sep 2 21:45:30 2021
    Copy: jonas@jones.dk (Jonas Smedegaard)

    Le jeudi 2 septembre 2021, 16:11:48 UTC Jonas Smedegaard a écrit :
    Quoting Bastien Roucariès (2021-09-02 17:53:18)

    A few year ago I have created the privacy-breach lintian checks in
    order to detect trackers in our doc

    I think we are losing the battle here.

    I believe that we need better tools than sed in order to fix this kind
    of problem.

    I have some idea like:
    - read the html tree
    - convert the html tree dom representation to xml serialization (so called

    XHTML5 or polyglot)

    - apply to this xhtml5 xslt2 rules for fixing the privacy breach

    The problem are the tools to use...

    I will like to use javascript for this kind of transformation but
    nodejs does not compile on armel, and for saxon-ce I need gwt that is
    not in debian...

    I could use saxon2,but it will need java.

    Perl is famous for its text juggling features, and sloppy parsing of
    html can be done e.g. with HTML::HTML5::Parser (i.e. Debian package libhtml-html5-parser-perl).

    Also, debhelper itself is written in perl, so is likely easier to
    integrate plugins written in perl as well. If perl is an option at all, obviously...

    Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...

    Generic one could be coded in perl, but for the end side I need something like xslt2

    I am sure Python/Ruby/PHP/Haskell/Scheme/Rust/etc. folks will argue that their pet language is the right for the task as well: I think it will
    help the conversation if you clarify what you are open to and what are constraints for you.

    E.g. do you mean that it *must* be JavaScript when you mention that? Or
    are you perhaps asking if someone else wants to take over the challenge
    from you, so it does not matter how it is done?

    No it must no be javascript, but using V8 or something like browser internal in order to fail to get a dom tree in case of broken html file, like a browser do. But may be I am overconcious

    Bastien


    - Jonas


    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEXQGHuUCiRbrXsPVqADoaLapBCF8FAmExRfoACgkQADoaLapB CF8MVw/+PjOy8DTiULOtPdEEaS2Op2pS3IzgrDL8h22oNKkiUqRdoDeMdi0VTmfD f1RbMxdPDS2LgVyVHaE7c6qAwYa8pQw0kYKO1ZWOJ3jx9QC5fGZC4ZiI0gz/0i1X oil3hkQrd9RucyH+lbUSugeQ+uZESXQbuHJeE2n7Ke2iO0sOmUCGTll6/ftZ4o3D wxMuduXu8digJ/uhKlOZ4pqnyueI+WvRID1uzBPLjWLKcrA7mV+xLuEoIFl/eTzZ YREDXhPsF/6dFWNY0ZoXvpHLCm5qVeJI3pX2f8OUorbKuuDHiAdAWIdGb6b0nKdY tR481o9H4AIil/2yBV13QcGBNQXtJ4y2Ksrl7U+M0kQ/Ddt3dMOv2ORRwiaSADYZ aJ+bywxHgH+a3NyozfXKdrqg1ZjxJ6PVIRnuYsm1JEZa50adC+OnVSZP+WrUOUM3 kfiDBqVlNoahBvHJG1kHpfuX2jCKG3Qm4NiZlI60iI64gEJ7tgloeQTce8E0fh1O 5/KyZz5deOHL8agAuYdBmX0Khxah7xSyrvqfsU30CGAHXPwFCYXOXMTqYEcD/MwI dCMXeXiD3Mm+v08T0ArFtEPPB+ENz6kToKhCTVHyoA6o8emiiEqJQ6hwSTXtVnD0 YTxUWp0XHB597Dzv9iSjoMJETSRIX31XrU98Y7yr6/CgVTo53vk=
    =R8uB
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonas Smedegaard@21:1/5 to All on Fri Sep 3 03:10:02 2021
    Quoting Bastien Roucariès (2021-09-02 23:45:30)
    Perl is an option I implemented the privacy breach test in perl. The
    problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...

    Generic one could be coded in perl, but for the end side I need
    something like xslt2

    If you are asking how to sloppily parse HTML5 files from upstream source
    and XSLT2 files provided by package maintainers, then with perl you
    could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for
    the second.


    I am sure Python/Ruby/PHP/Haskell/Scheme/Rust/etc. folks will argue
    that their pet language is the right for the task as well: I think
    it will help the conversation if you clarify what you are open to
    and what are constraints for you.

    E.g. do you mean that it *must* be JavaScript when you mention that?
    Or are you perhaps asking if someone else wants to take over the
    challenge from you, so it does not matter how it is done?

    No it must no be javascript, but using V8 or something like browser
    internal in order to fail to get a dom tree in case of broken html
    file, like a browser do. But may be I am overconcious

    If you are asking how to parse HTML5 files like a web browser, then with
    perl you could use Gtk3::WebKit2 for that.


    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private --==============W78408103592376716=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEn+Ppw2aRpp/1PMaELHwxRsGgASEFAmExdEoACgkQLHwxRsGg ASEl1g/8Ds+0M/7AQvL9/fOpGEifRzMF6jSl5ZaShkgJe9hTcUkEVqbUWSGgmr9H IndD4JOV+dH9evY5Zkr5zmAetZOq/NNLDS32NAFiUDR4g+63zSuSLRAsP6SHT2Ql ujZ7CuBy5AUAATiz1uYUWFafJdoCv4gZ9SP9/6FG2mX+ambrp6GcMOvUXpoBEGl9 IDOvjPZ9YpCcjLpOlaaBxQhHNPErXKFEuMKDPRtAN6aN5snx0hGWWjTmm5hBKX9x pntML35kCamTT2IIS8mWrD/ZVU0mJYKcAiZ3YdTTdCTyUJbO+ehV6MUmMTxKiKtG AkkoYJCQRl3j59Xyn+pdUFglNILKjCjqld1r84XcukfU9u4/bs4Skv3jzT/K77Lk iv6pVZY71rbeRFUF30GQnlIxuYuOJ8/zEz86E2NPo6w18QW6HRI3WFutm4Q/wyxD ABSHe+cef1QwBCxNxleEPSs9uwFHnXofEnGACeA0gg+XvubJPQ12y8haIBun+j9n GEqG0KzwXgPpQs5kR
  • From Paul Wise@21:1/5 to All on Fri Sep 3 03:30:01 2021
    On Thu, 2021-09-02 at 15:53 +0000, Bastien Roucariès wrote:

    A few year ago I have created the privacy-breach lintian checks in
    order to detect trackers in our doc

    I think we are losing the battle here.

    These lintian checks are a good start, but they are just heuristics
    that cannot detect how documentation will behave when loaded into a
    browser or the other appropriate documentation viewer. Especially for documentation with scripting languages and or interactive features.

    Another thing to do would be to load the documentation in the most
    appropriate viewer, interact with it in expected ways and monitor for
    network activity or other data leaking mechanisms (eg WebBluetooth).

    There are also many other types of privacy issues with using Debian:

    https://wiki.debian.org/PrivacyIssues

    I believe that we need better tools than sed in order to fix this kind
    of problem.

    Could you detail the kinds of issues you are seeing with sed that make
    you want to replace it with something else?

    --
    bye,
    pabs

    https://wiki.debian.org/PaulWise

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEYQsotVz8/kXqG1Y7MRa6Xp/6aaMFAmExeKoACgkQMRa6Xp/6 aaOYRQ/+MbcJUAuiREQhosL+B2mm/Q69FGv0Kqz3YxOzU5wsR3x6M8/K9Io/R8EQ 6L0euPj8C0o/sjCVyeMpRDCgRWZk84kCmYO2gYiySbSkJwoFrTMnsqvxa2T5itBW 3rdechP2T6zeG4RdDqVwNBb4WyIgeUyvJLmI5fvZYklwfrcJgxtRKvh4xTMpio5d mHmef2qzqFvOt/oRjiU93baP8zL8QgaMwU7IqES3YqLrnOgTlLwfnNmtjQgudWjh dUEQiSaJlOX5x+SwuJvW5nAWb8qPB3akibeEv53Pvpr9gQ37gvXUme0vJ/Y4D4OQ UBwarSDF6uXhB04+UN9qUTTJOAS1OOfeZQi6NYdEOpOsSS1WjUveniZGvKm5Gpe6 sWRqG9XLqTKIWQ3TYjRaJICp9pYlwaZm7YNE22ioGef1Vg5brJc72DuZSrrU/qI4 8QXNBO+/e/YUT/LKoqyopzM/71jAOqtDDYJNt8XG5wj/BfS1z5h04egyX3ru8zif k7wo9kT3m0Tvsavu2TIdDCOimE+IgV136t3aN8FR6sgHcP6foos8chroQm7O4Wsv gOkda3DglfkLWjmnXPNeX4j2B4HESfUIRXzR9HtwLxdwvGRN9uHETDurXf6UMfND Ki861COh2HnzInhB3CHpBdLka9gYAJ2Q9fMjHde46vesWEP4AUE=
    =BBzA
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonas Smedegaard@21:1/5 to All on Sat Sep 4 20:10:01 2021
    Quoting Bastien ROUCARIES (2021-09-04 19:52:50)
    Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Bastien Roucariès (2021-09-02 23:45:30)
    Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...

    Generic one could be coded in perl, but for the end side I need
    something like xslt2

    If you are asking how to sloppily parse HTML5 files from upstream source and XSLT2 files provided by package maintainers, then with perl you
    could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for
    the second.

    Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
    bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946

    Ouch!

    I keep forgetting which packages are affected by that annoying bug :-/


    Your suggestion will work fine but we need to get some solution for
    this utf-8 problem...

    I have recently grown somewhat more familiar with UTF-8 and perl (in my
    work towards fixing bug#867305 in licensecheck), and will try take a
    fresh look at bug#750946...


    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private --==============p49565494864371381=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEn+Ppw2aRpp/1PMaELHwxRsGgASEFAmEztMwACgkQLHwxRsGg ASEKYQ//bGfbuqr52fwYoqsohHdnKBfIaopENnkbg67qH59xzfNJoc3Ij0Z+Xzae VqELW3dGRgW9bIc1PujZFTRH5vo+bz6FEjNW8XhF0DM94J1UjNoweMvI7407wU63 udKVBh9XqIp8koVVxwIcMlEfhhvFzfIdB1GlsMSg14/blz1OdYQABrKbxEEC+8l+ idprx9V6ukKvd/X4Q07T4jZEHOySAAJTCOq0To+rLPOrMXpOT3+ihWPJzblTe7LI /0oxNc2hbDLQ4BVzXGcV+l1zVWeWfv/yv9H6R2xaBcJhPJhgL192nf8tLr6r5Q5o lpDRl8wpReIpZihn+7sJiha9ccE6R0UnbDeY52kErttNW+LYz9dH9UNQaBEdvQ61 DVO4F2BjOSHhRyuHqkjBnvPEDByMs2bDayC1oMC5iuX7EHd/pmRAbmV9XxID38fh 9opGaHHFEndJhlcs6UoLA+qfXaN/tk/JJswcUSDBB+kUGEb7y7XpgcFdIOosyhJS 9QX6t2Wa6BO4Agq6b
  • From Bastien ROUCARIES@21:1/5 to All on Sat Sep 4 20:10:02 2021
    Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Bastien Roucariès (2021-09-02 23:45:30)
    Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...

    Generic one could be coded in perl, but for the end side I need
    something like xslt2

    If you are asking how to sloppily parse HTML5 files from upstream source
    and XSLT2 files provided by package maintainers, then with perl you
    could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for
    the second.

    Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
    bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946

    Your suggestion will work fine but we need to get some solution for
    this utf-8 problem...

    Bastien







    I am sure Python/Ruby/PHP/Haskell/Scheme/Rust/etc. folks will argue
    that their pet language is the right for the task as well: I think
    it will help the conversation if you clarify what you are open to
    and what are constraints for you.

    E.g. do you mean that it *must* be JavaScript when you mention that?
    Or are you perhaps asking if someone else wants to take over the challenge from you, so it does not matter how it is done?

    No it must no be javascript, but using V8 or something like browser internal in order to fail to get a dom tree in case of broken html
    file, like a browser do. But may be I am overconcious

    If you are asking how to parse HTML5 files like a web browser, then with
    perl you could use Gtk3::WebKit2 for that.


    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastien ROUCARIES@21:1/5 to All on Sat Sep 4 20:50:01 2021
    Le sam. 4 sept. 2021 à 18:03, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Bastien ROUCARIES (2021-09-04 19:52:50)
    Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Bastien Roucariès (2021-09-02 23:45:30)
    Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...

    Generic one could be coded in perl, but for the end side I need something like xslt2

    If you are asking how to sloppily parse HTML5 files from upstream source and XSLT2 files provided by package maintainers, then with perl you
    could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for
    the second.

    Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
    bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946

    Ouch!

    I keep forgetting which packages are affected by that annoying bug :-/


    Your suggestion will work fine but we need to get some solution for
    this utf-8 problem...

    I have recently grown somewhat more familiar with UTF-8 and perl (in my
    work towards fixing bug#867305 in licensecheck), and will try take a
    fresh look at bug#750946...

    The solution is straightforward just send you a mail. Use html5
    sniffing and add an optional parameter to method to specify encoding.

    Bastien


    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonas Smedegaard@21:1/5 to All on Sat Sep 4 23:00:02 2021
    Quoting Bastien ROUCARIES (2021-09-04 20:28:49)
    Le sam. 4 sept. 2021 à 18:03, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Bastien ROUCARIES (2021-09-04 19:52:50)
    Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Bastien Roucariès (2021-09-02 23:45:30)
    Perl is an option I implemented the privacy breach test in perl. The problem is I prefer to drop a debian/package.privacy.xslt file in the package instead of asking maintainer to code the removal of privacy problems...

    Generic one could be coded in perl, but for the end side I need something like xslt2

    If you are asking how to sloppily parse HTML5 files from upstream source
    and XSLT2 files provided by package maintainers, then with perl you could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for the second.

    Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
    bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946

    Ouch!

    I keep forgetting which packages are affected by that annoying bug :-/


    Your suggestion will work fine but we need to get some solution for
    this utf-8 problem...

    I have recently grown somewhat more familiar with UTF-8 and perl (in my work towards fixing bug#867305 in licensecheck), and will try take a
    fresh look at bug#750946...

    The solution is straightforward just send you a mail. Use html5
    sniffing and add an optional parameter to method to specify encoding.

    Seems to me - and seems from your posts to upstream bugreport that you
    agree - that a "straightforward" solution breaks the API, whereas a
    solution which preserves the API is hard.

    It is my understanding that upstream would considers the API being tied
    to the API - i.e. if you want a different API then look for different
    module.

    Therefore: How do you think about instead using HTML5::DOM? It is not
    yet in Debian so will need a "sudo apt install cpanminus; cpanm
    HTML5::DOM". If useful then I can offer to package it for Debian.


    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private --==============&41865056768468473=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEn+Ppw2aRpp/1PMaELHwxRsGgASEFAmEz3FkACgkQLHwxRsGg ASH6tQ/+ILM2OAdo6zbRE7VQQvCiSFvbLVrPFGW/5wZ9y7x+LZ331G16gsZ56jZT pPy3J0uyt3etdU/z0bM4vOI3u1cr5W9qUMGdgwdAXzOLEOcpYrFQXJUDeLZ3n6Wn uZynreIBNihLhw32RLMWcPA6nrh9wK2o+MrzQUaidv9EBVYqJxPoPMbkXrTpM/Vr fmquYsYaodoCkEmjo8UVP/Y9BxmSIcUvRztBZzkwCJODebaL496BUO44q8GF6yJz XHtBCcSAqZfJCAgJfV+UL/xe0ewigTbsHrvsuhUpksGsP9xcWe+hFiBc9RvT924w 9qGGT8aFnyo5lIoATdsDZbL0MtfGrlVDH1q8ezZP+tq2FnocFfWdE+H3hYH7tw1y vOUpER5/geR4cdtb/NJal59m+XNA9mkR5F9VghTU0DjGn43WXZXUCrOfkOaXgbe3 muWXqg+r3pFsvKJPPSmUTQt704VB5cADw72nTMcsrPbUBd593LyrKgXFpPwRH7H2 SUs9q4CgwmqFTqjdi
  • From Bastien ROUCARIES@21:1/5 to All on Sun Sep 5 00:00:01 2021
    Le sam. 4 sept. 2021 à 20:54, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Jonas Smedegaard (2021-09-04 22:51:40)
    It is my understanding that upstream would considers the API being
    tied to the API - i.e. if you want a different API then look for
    different module.

    Seems the upstream author of HTML::HTML5::Parser even himself switched
    to HTML5::DOM for his newer work: https://metacpan.org/pod/Types::HTML5
    No I really need to ouptut xml and after pass to saxon...

    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastien ROUCARIES@21:1/5 to All on Sun Sep 5 00:40:01 2021
    Le sam. 4 sept. 2021 à 20:54, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Jonas Smedegaard (2021-09-04 22:51:40)
    It is my understanding that upstream would considers the API being
    tied to the API - i.e. if you want a different API then look for
    different module.

    Seems the upstream author of HTML::HTML5::Parser even himself switched
    to HTML5::DOM for his newer work: https://metacpan.org/pod/Types::HTML5

    Ok reading the source i need HTML5::DOM and Types::HTML5

    Types::HTML5 seems to offer a html_to_xml method that I need.

    Bastien
    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonas Smedegaard@21:1/5 to All on Thu Sep 9 00:20:01 2021
    Quoting Jonas Smedegaard (2021-09-04 20:02:57)
    Quoting Bastien ROUCARIES (2021-09-04 19:52:50)
    Le ven. 3 sept. 2021 à 01:03, Jonas Smedegaard <jonas@jones.dk> a écrit :

    Quoting Bastien Roucariès (2021-09-02 23:45:30)
    Perl is an option I implemented the privacy breach test in perl.
    The problem is I prefer to drop a debian/package.privacy.xslt
    file in the package instead of asking maintainer to code the
    removal of privacy problems...

    Generic one could be coded in perl, but for the end side I need something like xslt2

    If you are asking how to sloppily parse HTML5 files from upstream
    source and XSLT2 files provided by package maintainers, then with
    perl you could use HTML::HTML5::Parser for the first and XML::Saxon::XSLT2 for the second.

    Unfortunatly HTML::HTML5::Parser is RC buggy since 4 years due to a
    bug for handling UTF-8 (#750946) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946

    Ouch!

    I keep forgetting which packages are affected by that annoying bug :-/


    Your suggestion will work fine but we need to get some solution for
    this utf-8 problem...

    I have recently grown somewhat more familiar with UTF-8 and perl (in
    my work towards fixing bug#867305 in licensecheck), and will try take
    a fresh look at bug#750946...

    HTML::HTML5::Parser should now be in better shape.

    Please try version 0.992 now in unstable, if still relevant for your
    work.


    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/

    [x] quote me freely [ ] ask before reusing [ ] keep private --==============‘56611773204854391=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEn+Ppw2aRpp/1PMaELHwxRsGgASEFAmE5NcsACgkQLHwxRsGg ASFtmRAAq/kqGPzDDP/9az90lxL3Au9uTT+AapRsy4PJ3I1G5rjR7q7/FEi8+C41 ApFIS0lmaBReT2gPvgsgUy7AQB7kVY01gObBC7mLpRk4IXmdcCIHMaMOy77gQeCL 7FvCHgoCAr7CSgR2z4mOsjtcYxZpPcAxPn9zXw5svhCiD7hAlCs5w4kOl8oyCaEg sN4wpf0JxCLD3iyEnTnRmcPWfaXP0GTFuG+S1434oQSVZwJoTZkSrsIf9n4e3agd a36Q3VHb2SBOpjcVcteHza1h8iCfQJ388bYQIM2OkkQwERClSSCFPwfYiGW8JIXz cmpCwKOMIxiOWyCud7LKEAefqfRMf7+xSJf2Qdp8cLD6ZIQZ4yh5jP6hGV/9X1f6 Jn+wYWzaDn7OBz6PKgnlBgN+1TFoa6MQgeGScaYj6d3hJlOn3ZlCdvWHO0gisVjb HB86R1GhBHzyZ+G7Zw5WW2/WzB/66DEjSodUhBJ/4UpYG4+zN+DDJA7x8N6iyIgV YsRFhJJlz1FV91cB0