I get data from various sources; client emails, spreadsheets, and
data from web applications. I find that I can do some_string.decode('latin1')
to get unicode that I can use with xlsxwriter,
or put <meta charset="latin1"> in the header of a web page to display >European characters correctly.
am using data from the wild. It's frustrating that I have to play
a guessing game to figure out how to use incoming text. I'm just wondering
if there are any thoughts. What if we just globally decided to use utf-8? >Could that ever happen?
I get data from various sources; client emails, spreadsheets, and
data from web applications. I find that I can do some_string.decode('latin1')
to get unicode that I can use with xlsxwriter,
or put <meta charset="latin1"> in the header of a web page to display European characters correctly. But normally UTF-8 is recommended as
the encoding to use today. latin1 works correctly more often when I
am using data from the wild. It's frustrating that I have to play
a guessing game to figure out how to use incoming text. I'm just wondering if there are any thoughts. What if we just globally decided to use utf-8? Could that ever happen?
On 8/17/22 08:33, Stefan Ram wrote:
Tobiah <toby@tobiah.org> writes:I'm using 2.7. Maybe that's why.
I get data from various sources; client emails, spreadsheets, andStrings have no "decode" method. ("bytes" objects do.)
data from web applications. I find that I can do some_string.decode('latin1')
That has already been decided, as much as it ever can be. UTF-8 is essentially always the correct encoding to use on output, and almost
always the correct encoding to assume on input absent any explicit
indication of another encoding. (e.g. the HTML "standard" says that
all HTML files must be UTF-8.)
Tobiah <toby@tobiah.org> writes:
I get data from various sources; client emails, spreadsheets, and
data from web applications. I find that I can do some_string.decode('latin1')
Strings have no "decode" method. ("bytes" objects do.)
Tobiah <toby@tobiah.org> writes:
I get data from various sources; client emails, spreadsheets, and
data from web applications. I find that I can do some_string.decode('latin1')
Strings have no "decode" method. ("bytes" objects do.)
to get unicode that I can use with xlsxwriter,
or put <meta charset="latin1"> in the header of a web page to display
European characters correctly.
|You should always use the UTF-8 character encoding. (Remember
|that this means you also need to save your content as UTF-8.)
World Wide Web Consortium (W3C) (2014)
am using data from the wild. It's frustrating that I have to play
a guessing game to figure out how to use incoming text. I'm just wondering
You can let Python guess the encoding of a file.
def encoding_of( name ):
path = pathlib.Path( name )
for encoding in( "utf_8", "cp1252", "latin_1" ):
try:
with path.open( encoding=encoding, errors="strict" )as file:
text = file.read()
return encoding
except UnicodeDecodeError:
pass
return None
if there are any thoughts. What if we just globally decided to use utf-8? >> Could that ever happen?
That decisions has been made long ago.
On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list <python-list@python.org> wrote:
On 2022-08-17, Tobiah <toby@tobiah.org> wrote:
I get data from various sources; client emails, spreadsheets, and
data from web applications. I find that I can do some_string.decode('latin1')
to get unicode that I can use with xlsxwriter,
or put <meta charset="latin1"> in the header of a web page to display
European characters correctly. But normally UTF-8 is recommended as
the encoding to use today. latin1 works correctly more often when I
am using data from the wild. It's frustrating that I have to play
a guessing game to figure out how to use incoming text. I'm just wondering >> if there are any thoughts. What if we just globally decided to use utf-8? >> Could that ever happen?
That has already been decided, as much as it ever can be. UTF-8 is essentially always the correct encoding to use on output, and almost
always the correct encoding to assume on input absent any explicit
indication of another encoding. (e.g. the HTML "standard" says that
all HTML files must be UTF-8.)
If you are finding that your specific sources are often encoded with
latin-1 instead then you could always try something like:
try:
text = data.decode('utf-8')
except UnicodeDecodeError:
text = data.decode('latin-1')
(I think latin-1 text will almost always fail to be decoded as utf-8,
so this would work fairly reliably assuming those are the only two
encodings you see.)
Or you could use something fancy like https://pypi.org/project/chardet/
--
https://mail.python.org/mailman/listinfo/python-list
That has already been decided, as much as it ever can be. UTF-8 is
essentially always the correct encoding to use on output, and almost
always the correct encoding to assume on input absent any explicit
indication of another encoding. (e.g. the HTML "standard" says that
all HTML files must be UTF-8.)
I got an email from a client with blast text that
was in French with stuff like: Montréal, Quebéc.
latin1 did the trick.
Also, whenever I get a spreadsheet from a client and save as .csv,
or take browser data through PHP, it always seems to work with latin1,
but not UTF-8.
On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list <python-list@python.org> wrote:
On 2022-08-17, Tobiah <toby@tobiah.org> wrote:
I get data from various sources; client emails, spreadsheets, and
data from web applications. I find that I can do some_string.decode('latin1')
to get unicode that I can use with xlsxwriter,
or put <meta charset="latin1"> in the header of a web page to display
European characters correctly. But normally UTF-8 is recommended as
the encoding to use today. latin1 works correctly more often when I
am using data from the wild. It's frustrating that I have to play
a guessing game to figure out how to use incoming text. I'm just wondering
if there are any thoughts. What if we just globally decided to use utf-8? >>> Could that ever happen?
That has already been decided, as much as it ever can be. UTF-8 is
essentially always the correct encoding to use on output, and almost
always the correct encoding to assume on input absent any explicit
indication of another encoding. (e.g. the HTML "standard" says that
all HTML files must be UTF-8.)
If you are finding that your specific sources are often encoded with
latin-1 instead then you could always try something like:
try:
text = data.decode('utf-8')
except UnicodeDecodeError:
text = data.decode('latin-1')
(I think latin-1 text will almost always fail to be decoded as utf-8,
so this would work fairly reliably assuming those are the only two
encodings you see.)
Only if a reserved byte is used in the string.
It will often work in either.
For web pages it cannot be assumed that markup saying it’s utf-8 is correct. Many pages are I fact cp1252. Usually you find out because
of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.
Generally speaking browser submisisons were/are supposed to be sent
using the same encoding as the page, so if you're sending the page
as "latin1" then you'll see that a fair amount I should think. If you
send it as "utf-8" then you'll get 100% utf-8 back.
The only trick I know is to use <meta charset="utf-8">.
When you have your own web server or access to the settings
Generally speaking browser submisisons were/are supposed to be sent
using the same encoding as the page, so if you're sending the page
as "latin1" then you'll see that a fair amount I should think. If you
send it as "utf-8" then you'll get 100% utf-8 back.
The only trick I know is to use <meta charset="utf-8">. Would
that 'send' the post as utf-8? I always expected it had more
to do with the way the user entered the characters. How do
they by the way, enter things like Montréal, Quebéc. When they
enter that into a text box on a web page can we say it's in
a particular encoding at that time? At submit time?
You configure the web server to send:
Content-Type: text/html; charset=...
in the HTTP header when it serves HTML files.
When a person enters
Montréal, Quebéc into a form field, what are they
doing on the keyboard to make that happen?
As the
string sits there in the text box, is it latin1, or utf-8
or something else?
How does the browser know what
sort of data it has in that text box?
You configure the web server to send:
Content-Type: text/html; charset=...
in the HTTP header when it serves HTML files.
So how does this break down? When a person enters
Montréal, Quebéc into a form field, what are they
doing on the keyboard to make that happen?
As the string sits there in the text box, is it latin1, or utf-8
or something else?
How does the browser know what sort of data it has in that text box?
You configure the web server to send:
Content-Type: text/html; charset=...
in the HTTP header when it serves HTML files.
So how does this break down? When a person enters
Montréal, Quebéc into a form field, what are they
doing on the keyboard to make that happen? As the
string sits there in the text box, is it latin1, or utf-8
or something else? How does the browser know what
sort of data it has in that text box?
So how does this break down? When a person enters
Montréal, Quebéc into a form field, what are they
doing on the keyboard to make that happen? As the
string sits there in the text box, is it latin1, or utf-8
or something else? How does the browser know what
sort of data it has in that text box?
if there are any thoughts. What if we just globally decided to use
utf-8?
Could that ever happen?
You can let Python guess the encoding of a file.
def encoding_of( name ):
path = pathlib.Path( name )
for encoding in( "utf_8", "cp1252", "latin_1" ):
try:
with path.open( encoding=encoding, errors="strict" )as file:
On 25 Oct 2022, at 11:16, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
ram@zedat.fu-berlin.de (Stefan Ram) writes:
You can let Python guess the encoding of a file.
def encoding_of( name ):
path = pathlib.Path( name )
for encoding in( "utf_8", "cp1252", "latin_1" ):
try:
with path.open( encoding=encoding, errors="strict" )as file:
I also read a book which claimed that the tkinter.Text
widget would accept bytes and guess whether these are
encoded in UTF-8 or "ISO 8859-1" and decode them
accordingly. However, today I found that here it does
accept bytes but it always guesses "ISO 8859-1".
main.py
import tkinter
text = tkinter.Text()
text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='ISO 8859-1' )) text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='UTF-8' )) text.pack()
print( text.get( "1.0", "end" ))
output
AÄäÖöÜüßAÄäÖöÜüß
--
https://mail.python.org/mailman/listinfo/python-list
On 25 Oct 2022, at 11:16, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
ram@zedat.fu-berlin.de (Stefan Ram) writes:
You can let Python guess the encoding of a file.
def encoding_of( name ):
path = pathlib.Path( name )
for encoding in( "utf_8", "cp1252", "latin_1" ):
try:
with path.open( encoding=encoding, errors="strict" )as file:
I also read a book which claimed that the tkinter.Text
widget would accept bytes and guess whether these are
encoded in UTF-8 or "ISO 8859-1" and decode them
accordingly. However, today I found that here it does
accept bytes but it always guesses "ISO 8859-1".
The best you can do is assume that if the text cannot decode as utf-8 it may be 8859-1.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 75:56:39 |
Calls: | 6,716 |
Calls today: | 4 |
Files: | 12,247 |
Messages: | 5,357,553 |