On 14/05/2020 11:43, Andreas Tille wrote:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 11:invalid continuation byte
The error is like [1] where the file is not encoded utf-8.
1:
https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte
+ f = open(trfile, encoding='utf-8')
`f = open(trfile, encoding='latin-1')`
could be a (temporary?) solution.
Regards,
Stéphane
And, ideally, somebody would contact whoever is providing that file so that they re-encode it with utf8...
Yes, it's the best long term solution.
`f = open(trfile, encoding='latin-1')`
could be a (temporary?) solution.
Andreas, it's possible that changing the encoding will fix the bug for
some files but you will get new errors on other files (encoded in
utf-8). Trying several encoding or using 'chardet' library could be a
better workaround.
Hello,
On 15/05/2020 21:10, Andreas Tille wrote:> Would you mind providing a
patch with chardet?
There is a patch attached to this e-mail.
I used [1] for the base file. I don't think the patch is great (because
there are two 'open()' calls) but it has minimal modifications of the
current source code. I think it's a better solution for the success the migration to python3 (because it avoid introducing bugs during the migration).
Feel free to ask for more explanations or other stuff if you need.
1: https://salsa.debian.org/qa/udd/-/blob/master/udd/ddtp_gatherer.py
--
Stéphane
--- ddtp_gatherer.py.orig 2020-05-17 22:54:21.793075000 +0200
+++ ddtp_gatherer.py 2020-05-18 13:02:47.210764004 +0200
@@ -25,6 +25,8 @@
import logging
import logging.handlers
+import chardet
+
debug=0
def get_gatherer(connection, config, source):
@@ -117,7 +119,7 @@
trfile = trfilepath + file
# check whether hash recorded in index file fits real file
try:
- f = open(trfile)
+ f = _open_file(trfile)
except IOError, err:
self.log.error("%s: %s.", str(err), trfile)
continue
@@ -236,6 +238,13 @@
except IOError, err:
self.log.exception("Error reading %s%s", dir, filename)
+def _open_file(path):
+ with open(path, 'rb') as f:
+ raw_content = f.read()
+ encoding = chardet.detect(raw_content)["encoding"]
+ return open(path, encoding=encoding)
+
+
if __name__ == '__main__':
main()
On Mon, May 18, 2020 at 08:35:33PM +0200, Stéphane Blondon wrote:
Can you send me the file 'gatherer.${I_dont_know_the_command}' which
raises the UnicodeDecodeError exception? I will try to write a working patch.
I simply added a debug line:
udd(python3) $ git diff
diff --git a/udd/ddtp_gatherer.py b/udd/ddtp_gatherer.py
index bbf041b..d32b85f 100644
--- a/udd/ddtp_gatherer.py
+++ b/udd/ddtp_gatherer.py
@@ -239,6 +239,7 @@ class ddtp_gatherer(gatherer):
self.log.exception("Error reading %s%s", dir, filename)
def _open_file(path):
+ print(path)
with open(path, 'rb') as f:
raw_content = f.read()
encoding = chardet.detect(raw_content)["encoding"]
which leads to
udd(python3) $ ./update-and-run.sh ddtp /srv/mirrors/debian/dists/squeeze-proposed-updates/main/i18n/Translation-en.bz2
/srv/mirrors/debian/dists/squeeze-proposed-updates/non-free/i18n/Translation-en.bz2
/srv/mirrors/debian/dists/squeeze-proposed-updates/contrib/i18n/Translation-en.bz2
/srv/mirrors/debian/dists/stretch-proposed-updates/main/i18n/Translation-en.bz2
Traceback (most recent call last):
File "/srv/udd.debian.org/udd//udd.py", line 88, in <module>
exec("gatherer.%s()" % command)
File "<string>", line 1, in <module>
File "/srv/udd.debian.org/udd/udd/ddtp_gatherer.py", line 127, in run
h.update(f.read())
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 11: invalid continuation byte
While you can download the files from any Debian mirror I've attached
/srv/mirrors/debian/dists/stretch-proposed-updates/main/i18n/Translation-en.bz2
to this mail. My guess is that translations from stretch will not be
touched any more and thus we need to cope somehow with the existing
encoding.
Thanks a lot for your help
Andreas.
--
http://fam-tille.de
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 295 |
Nodes: | 16 (2 / 14) |
Uptime: | 01:32:45 |
Calls: | 6,642 |
Calls today: | 2 |
Files: | 12,190 |
Messages: | 5,325,421 |