• Started porting UDD to Python3 (Was: [UDD] Is there some effort to

    From Mattia Rizzolo@21:1/5 to All on Thu May 14 21:30:02 2020
    And, ideally, somebody would contact whoever is providing that file so that they re-encode it with utf8...

    On Thu, 14 May 2020, 9:16 pm Stéphane Blondon, <stephane.blondon@gmail.com> wrote:

    On 14/05/2020 11:43, Andreas Tille wrote:
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 11:
    invalid continuation byte

    The error is like [1] where the file is not encoded utf-8.

    1:

    https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte


    + f = open(trfile, encoding='utf-8')

    `f = open(trfile, encoding='latin-1')`

    could be a (temporary?) solution.


    Regards,
    Stéphane



    <div dir="auto">And, ideally, somebody would contact whoever is providing that file so that they re-encode it with utf8...</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 14 May 2020, 9:16 pm Stéphane Blondon, &lt;<a href="
    mailto:stephane.blondon@gmail.com">stephane.blondon@gmail.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 14/05/2020 11:43, Andreas Tille wrote:<br>
    &gt; UnicodeDecodeError: &#39;utf-8&#39; codec can&#39;t decode byte 0xc5 in position 11: invalid continuation byte<br>

    The error is like [1] where the file is not encoded utf-8.<br>

    1:<br>
    <a href="https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte" rel="noreferrer noreferrer" target="_blank">https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte</a><br>


    &gt; +            f = open(trfile, encoding=&#39;utf-8&#39;)<br>

    `f = open(trfile, encoding=&#39;latin-1&#39;)`<br>

    could be a (temporary?) solution.<br>


    Regards,<br>
    Stéphane<br>

    </blockquote></div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Tille@21:1/5 to All on Fri May 15 21:20:01 2020
    On Fri, May 15, 2020 at 08:51:05PM +0200, Stéphane Blondon wrote:
    And, ideally, somebody would contact whoever is providing that file so that they re-encode it with utf8...

    Yes, it's the best long term solution.

    Definitely. But who is providing that file?

    `f = open(trfile, encoding='latin-1')`

    could be a (temporary?) solution.

    Andreas, it's possible that changing the encoding will fix the bug for
    some files but you will get new errors on other files (encoded in
    utf-8). Trying several encoding or using 'chardet' library could be a
    better workaround.

    Would you mind providing a patch with chardet?

    Kind regards

    Andreas.

    --
    http://fam-tille.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Tille@21:1/5 to All on Mon May 18 15:20:01 2020
    Hi Stéphane,

    thanks for your patch which I applied in the python3 branch. Unfortunately
    it does not solve the issue:


    udd(python3) $ ./update-and-run.sh ddtp
    Traceback (most recent call last):
    File "/srv/udd.debian.org/udd//udd.py", line 88, in <module>
    exec("gatherer.%s()" % command)
    File "<string>", line 1, in <module>
    File "/srv/udd.debian.org/udd/udd/ddtp_gatherer.py", line 127, in run
    h.update(f.read())
    File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 11: invalid continuation byte


    Thanks a lot anyway

    Andreas.

    On Mon, May 18, 2020 at 01:15:11PM +0200, Stéphane Blondon wrote:
    Hello,

    On 15/05/2020 21:10, Andreas Tille wrote:> Would you mind providing a
    patch with chardet?
    There is a patch attached to this e-mail.

    I used [1] for the base file. I don't think the patch is great (because
    there are two 'open()' calls) but it has minimal modifications of the
    current source code. I think it's a better solution for the success the migration to python3 (because it avoid introducing bugs during the migration).


    Feel free to ask for more explanations or other stuff if you need.

    1: https://salsa.debian.org/qa/udd/-/blob/master/udd/ddtp_gatherer.py

    --
    Stéphane

    --- ddtp_gatherer.py.orig 2020-05-17 22:54:21.793075000 +0200
    +++ ddtp_gatherer.py 2020-05-18 13:02:47.210764004 +0200
    @@ -25,6 +25,8 @@
    import logging
    import logging.handlers

    +import chardet
    +
    debug=0

    def get_gatherer(connection, config, source):
    @@ -117,7 +119,7 @@
    trfile = trfilepath + file
    # check whether hash recorded in index file fits real file
    try:
    - f = open(trfile)
    + f = _open_file(trfile)
    except IOError, err:
    self.log.error("%s: %s.", str(err), trfile)
    continue
    @@ -236,6 +238,13 @@
    except IOError, err:
    self.log.exception("Error reading %s%s", dir, filename)

    +def _open_file(path):
    + with open(path, 'rb') as f:
    + raw_content = f.read()
    + encoding = chardet.detect(raw_content)["encoding"]
    + return open(path, encoding=encoding)
    +
    +
    if __name__ == '__main__':
    main()






    --
    http://fam-tille.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lucas Nussbaum@21:1/5 to Andreas Tille on Mon May 18 22:10:01 2020
    Hi,

    Do all those people in Cc need to read this? If you really want to keep
    this public, maybe debian-qa@ is enough? (I personally don't feel I
    need to read this at this time; if I had time to spend on UDD, I would
    fix actual bugs)

    Thanks

    Lucas


    On 18/05/20 at 21:57 +0200, Andreas Tille wrote:
    On Mon, May 18, 2020 at 08:35:33PM +0200, Stéphane Blondon wrote:

    Can you send me the file 'gatherer.${I_dont_know_the_command}' which
    raises the UnicodeDecodeError exception? I will try to write a working patch.

    I simply added a debug line:

    udd(python3) $ git diff
    diff --git a/udd/ddtp_gatherer.py b/udd/ddtp_gatherer.py
    index bbf041b..d32b85f 100644
    --- a/udd/ddtp_gatherer.py
    +++ b/udd/ddtp_gatherer.py
    @@ -239,6 +239,7 @@ class ddtp_gatherer(gatherer):
    self.log.exception("Error reading %s%s", dir, filename)

    def _open_file(path):
    + print(path)
    with open(path, 'rb') as f:
    raw_content = f.read()
    encoding = chardet.detect(raw_content)["encoding"]


    which leads to


    udd(python3) $ ./update-and-run.sh ddtp /srv/mirrors/debian/dists/squeeze-proposed-updates/main/i18n/Translation-en.bz2
    /srv/mirrors/debian/dists/squeeze-proposed-updates/non-free/i18n/Translation-en.bz2
    /srv/mirrors/debian/dists/squeeze-proposed-updates/contrib/i18n/Translation-en.bz2
    /srv/mirrors/debian/dists/stretch-proposed-updates/main/i18n/Translation-en.bz2
    Traceback (most recent call last):
    File "/srv/udd.debian.org/udd//udd.py", line 88, in <module>
    exec("gatherer.%s()" % command)
    File "<string>", line 1, in <module>
    File "/srv/udd.debian.org/udd/udd/ddtp_gatherer.py", line 127, in run
    h.update(f.read())
    File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 11: invalid continuation byte


    While you can download the files from any Debian mirror I've attached
    /srv/mirrors/debian/dists/stretch-proposed-updates/main/i18n/Translation-en.bz2
    to this mail. My guess is that translations from stretch will not be
    touched any more and thus we need to cope somehow with the existing
    encoding.

    Thanks a lot for your help

    Andreas.

    --
    http://fam-tille.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)