Instead of reading *size* bytes, the method reads *size *UTF-8 byte
*sequences*.
Dear Python-list,
Yes, I know that Python 2.x is no longer supported.
I have found that the documentation for this method is misleading when the file being read is UTF-8-encoded:
Instead of reading *size* bytes, the method reads *size *UTF-8 byte *sequences*.
Has this error been corrected in the Python 3.x documentation?
Dear Python-list,
Yes, I know that Python 2.x is no longer supported.
I have found that the documentation for this method is misleading when the file being read is UTF-8-encoded:
Instead of reading *size* bytes, the method reads *size *UTF-8 byte *sequences*.
Has this error been corrected in the Python 3.x documentation?
On Tue, 10 Jan 2023 at 01:36, Stephen Tucker <stephen_tucker@sil.org>
wrote:
Dear Python-list,
Yes, I know that Python 2.x is no longer supported.
I have found that the documentation for this method is misleading whenthe
file being read is UTF-8-encoded:
Instead of reading *size* bytes, the method reads *size *UTF-8 byte *sequences*.
Has this error been corrected in the Python 3.x documentation?
What documentation is this? The builtin 'file' type doesn't know
anything about encodings, and only ever returns bytes.
ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Chris -EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as
In the Python 2.7.10 documentation, I am referring to section 5. Built-in Types, subsection 5.9 File Objects.
In that subsection, I have the following paragraph:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when
1. Create BOM.txt
2. Input three bytes at once from BOM.txt and print them
3. Input three bytes one at a time from BOM.txt and print them
4. Input three bytes at once from BOM.txt and print them
import codecs
myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
5. Attempt to input three bytes one at a time from BOM.txt and print them -------------------------------------------------------------------------
u'\ufeff'myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
myBOM_4 = myfil.read (1)
myBOM_4
A. The attempt at Part 5 actually inputs all three bytes when we ask it to input just the first one!
myfil = open ("BOM.txt", "wb")
myfil.write ("\xef" + "\xbb" + "\xbf")
myfil.close()
'\xef\xbb\xbf'myfil = open ("BOM.txt", "rb")
myBOM = myfil.read (3)
myBOM
myfil.close()
'\xef'myfil = open ("BOM.txt", "rb")
myBOM_1 = myfil.read (1)
myBOM_2 = myfil.read (1)
myBOM_3 = myfil.read (1)
myBOM_1
'\xbb'myBOM_2
'\xbf'myBOM_3
myfil.close()
u'\ufeff'import codecs
myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
myBOM = unicode (myfil.read (3))
myBOM
myfil.close ()
u'\ufeff'myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
myBOM_4 = myfil.read (1)
myBOM_5 = myfil.read (1)
myBOM_6 = myfil.read (1)
myBOM_4
u''myBOM_5
u''myBOM_6
myfil.close()
On Wed, 11 Jan 2023 at 21:31, Stephen Tucker <stephen_tucker@sil.org>
wrote:
Chris -
In the Python 2.7.10 documentation, I am referring to section 5.Built-in Types, subsection 5.9 File Objects.
In that subsection, I have the following paragraph:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF beforeobtaining size bytes). If the size argument is negative or omitted, read
all data until EOF is reached. The bytes are returned as a string object.
An empty string is returned when EOF is encountered immediately. (For
certain files, like ttys, it makes sense to continue reading after an EOF
is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may
be returned, even if no size parameter was given.
Yes, so it should be that number of bytes, which is what it does, isn't it?
ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
On Thu, 12 Jan 2023 at 04:31, Stephen Tucker <stephen_tucker@sil.org> wrote:
1. Create BOM.txt
2. Input three bytes at once from BOM.txt and print them
3. Input three bytes one at a time from BOM.txt and print them
All of these correctly show that a file, in binary mode, reads and writes bytes.
4. Input three bytes at once from BOM.txt and print them
import codecs
myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
This is now a codecs file, NOT a vanilla file object. See its docs here:
https://docs.python.org/2.7/library/codecs.html#codecs.open
The output is "codec-dependent" but I would assume that UTF-8 will
yield Unicode text strings.
5. Attempt to input three bytes one at a time from BOM.txt and print them -------------------------------------------------------------------------
u'\ufeff'myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
myBOM_4 = myfil.read (1)
myBOM_4
A. The attempt at Part 5 actually inputs all three bytes when we ask it to input just the first one!
On the contrary; you asked it for one *character* and it read one character.
On Thu, 12 Jan 2023 at 04:31, Stephen Tucker <stephen_tucker@sil.org>
wrote:
1. Create BOM.txt
2. Input three bytes at once from BOM.txt and print them
3. Input three bytes one at a time from BOM.txt and print them
All of these correctly show that a file, in binary mode, reads and writes bytes.
4. Input three bytes at once from BOM.txt and print them
import codecs
myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
This is now a codecs file, NOT a vanilla file object. See its docs here:
https://docs.python.org/2.7/library/codecs.html#codecs.open
The output is "codec-dependent" but I would assume that UTF-8 will
yield Unicode text strings.
5. Attempt to input three bytes one at a time from BOM.txt and print them -------------------------------------------------------------------------
u'\ufeff'myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
myBOM_4 = myfil.read (1)
myBOM_4
A. The attempt at Part 5 actually inputs all three bytes when we ask itto input just the first one!
On the contrary; you asked it for one *character* and it read one
character.
Where were you seeing documentation that disagreed with this?
ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Chris,
Thanks for this clarification.
I have not found documentation that disagrees with you. I simply observe that the documentation that I have alluded to earlier in this chain (section 5.9 File Objects)
could have been made clearer by the addition of a note along the lines that the behaviour of a file's read method (in particular, what the unit of information is that it reads (that is, "byte", "UTF-8 encoded character", or whatever)) depends on theway in which the file has been opened.
Thank you, Chris (and others) for your attention to my request. I consider this enquiry closed.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 57:51:06 |
Calls: | 6,712 |
Files: | 12,243 |
Messages: | 5,355,565 |