• Re: file.read Method Documentation (Python 2.7.10)

    From Stefan Ram@21:1/5 to Stephen Tucker on Mon Jan 9 14:52:28 2023
    Stephen Tucker <stephen_tucker@sil.org> writes:
    Instead of reading *size* bytes, the method reads *size *UTF-8 byte
    *sequences*.

    A file object is any object exposing a file-oriented API.
    If it's a binary stream, it might really return bytes.
    Objects of subclasses of io.TextIOBase return characters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Tucker@21:1/5 to All on Mon Jan 9 14:34:30 2023
    Dear Python-list,

    Yes, I know that Python 2.x is no longer supported.

    I have found that the documentation for this method is misleading when the
    file being read is UTF-8-encoded:

    Instead of reading *size* bytes, the method reads *size *UTF-8 byte *sequences*.

    Has this error been corrected in the Python 3.x documentation?

    Stephen Tucker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Barry Scott@21:1/5 to Stephen Tucker on Mon Jan 9 16:56:36 2023
    On 09/01/2023 14:34, Stephen Tucker wrote:
    Dear Python-list,

    Yes, I know that Python 2.x is no longer supported.

    I have found that the documentation for this method is misleading when the file being read is UTF-8-encoded:

    Instead of reading *size* bytes, the method reads *size *UTF-8 byte *sequences*.

    Has this error been corrected in the Python 3.x documentation?

    Please read the python 3 docs and let us know if you think its correct now.

    Barry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Stephen Tucker on Tue Jan 10 05:25:07 2023
    On Tue, 10 Jan 2023 at 01:36, Stephen Tucker <stephen_tucker@sil.org> wrote:

    Dear Python-list,

    Yes, I know that Python 2.x is no longer supported.

    I have found that the documentation for this method is misleading when the file being read is UTF-8-encoded:

    Instead of reading *size* bytes, the method reads *size *UTF-8 byte *sequences*.

    Has this error been corrected in the Python 3.x documentation?


    What documentation is this? The builtin 'file' type doesn't know
    anything about encodings, and only ever returns bytes.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Tucker@21:1/5 to rosuav@gmail.com on Wed Jan 11 10:31:31 2023
    Chris -

    In the Python 2.7.10 documentation, I am referring to section 5. Built-in Types, subsection 5.9 File Objects.

    In that subsection, I have the following paragraph:

    file.read([*size*])

    Read at most *size* bytes from the file (less if the read hits EOF before obtaining *size* bytes). If the *size* argument is negative or omitted,
    read all data until EOF is reached. The bytes are returned as a string
    object. An empty string is returned when EOF is encountered immediately.
    (For certain files, like ttys, it makes sense to continue reading after an
    EOF is hit.) Note that this method may call the underlying C function
    fread() more than once in an effort to acquire as close to *size* bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no *size* parameter was given.

    Note

    This function is simply a wrapper for the underlying fread() C function,
    and will behave the same in corner cases, such as whether the EOF value is cached.
    Stephen.

    On Mon, Jan 9, 2023 at 6:25 PM Chris Angelico <rosuav@gmail.com> wrote:

    On Tue, 10 Jan 2023 at 01:36, Stephen Tucker <stephen_tucker@sil.org>
    wrote:

    Dear Python-list,

    Yes, I know that Python 2.x is no longer supported.

    I have found that the documentation for this method is misleading when
    the
    file being read is UTF-8-encoded:

    Instead of reading *size* bytes, the method reads *size *UTF-8 byte *sequences*.

    Has this error been corrected in the Python 3.x documentation?


    What documentation is this? The builtin 'file' type doesn't know
    anything about encodings, and only ever returns bytes.

    ChrisA
    --
    https://mail.python.org/mailman/listinfo/python-list


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Stephen Tucker on Wed Jan 11 22:00:21 2023
    On Wed, 11 Jan 2023 at 21:31, Stephen Tucker <stephen_tucker@sil.org> wrote:

    Chris -

    In the Python 2.7.10 documentation, I am referring to section 5. Built-in Types, subsection 5.9 File Objects.

    In that subsection, I have the following paragraph:

    file.read([size])

    Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when
    EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as
    possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.


    Yes, so it should be that number of bytes, which is what it does, isn't it?

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Stephen Tucker on Thu Jan 12 04:36:30 2023
    On Thu, 12 Jan 2023 at 04:31, Stephen Tucker <stephen_tucker@sil.org> wrote:
    1. Create BOM.txt
    2. Input three bytes at once from BOM.txt and print them
    3. Input three bytes one at a time from BOM.txt and print them

    All of these correctly show that a file, in binary mode, reads and writes bytes.

    4. Input three bytes at once from BOM.txt and print them
    import codecs
    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")

    This is now a codecs file, NOT a vanilla file object. See its docs here:

    https://docs.python.org/2.7/library/codecs.html#codecs.open

    The output is "codec-dependent" but I would assume that UTF-8 will
    yield Unicode text strings.

    5. Attempt to input three bytes one at a time from BOM.txt and print them -------------------------------------------------------------------------

    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
    myBOM_4 = myfil.read (1)
    myBOM_4
    u'\ufeff'

    A. The attempt at Part 5 actually inputs all three bytes when we ask it to input just the first one!

    On the contrary; you asked it for one *character* and it read one character.

    Where were you seeing documentation that disagreed with this?

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Tucker@21:1/5 to rosuav@gmail.com on Wed Jan 11 17:31:26 2023
    Chris,

    Thanks for your reply.

    I hope the evidence below (taken from IDLE) clarifies my issue:

    Stephen.

    ======================


    1. Create BOM.txt
    -----------------

    myfil = open ("BOM.txt", "wb")
    myfil.write ("\xef" + "\xbb" + "\xbf")
    myfil.close()

    2. Input three bytes at once from BOM.txt and print them --------------------------------------------------------

    myfil = open ("BOM.txt", "rb")
    myBOM = myfil.read (3)
    myBOM
    '\xef\xbb\xbf'
    myfil.close()

    3. Input three bytes one at a time from BOM.txt and print them --------------------------------------------------------------

    myfil = open ("BOM.txt", "rb")
    myBOM_1 = myfil.read (1)
    myBOM_2 = myfil.read (1)
    myBOM_3 = myfil.read (1)
    myBOM_1
    '\xef'
    myBOM_2
    '\xbb'
    myBOM_3
    '\xbf'
    myfil.close()

    4. Input three bytes at once from BOM.txt and print them --------------------------------------------------------

    import codecs
    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
    myBOM = unicode (myfil.read (3))
    myBOM
    u'\ufeff'
    myfil.close ()

    5. Attempt to input three bytes one at a time from BOM.txt and print them -------------------------------------------------------------------------

    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
    myBOM_4 = myfil.read (1)
    myBOM_5 = myfil.read (1)
    myBOM_6 = myfil.read (1)
    myBOM_4
    u'\ufeff'
    myBOM_5
    u''
    myBOM_6
    u''
    myfil.close()

    Notes

    A. The attempt at Part 5 actually inputs all three bytes when we ask it to input just the first one!

    B. The outcome from Part 5 shows that, actually, the request to input text
    in Part 4 brought about a response from the program something like this:

    Input the UTF-8-encoded character as the first "byte";
    As expected, after reaching the end of the file, continue supplying an
    empty string for each of the requested extra bytes.

    ======================


    On Wed, Jan 11, 2023 at 11:00 AM Chris Angelico <rosuav@gmail.com> wrote:

    On Wed, 11 Jan 2023 at 21:31, Stephen Tucker <stephen_tucker@sil.org>
    wrote:

    Chris -

    In the Python 2.7.10 documentation, I am referring to section 5.
    Built-in Types, subsection 5.9 File Objects.

    In that subsection, I have the following paragraph:

    file.read([size])

    Read at most size bytes from the file (less if the read hits EOF before
    obtaining size bytes). If the size argument is negative or omitted, read
    all data until EOF is reached. The bytes are returned as a string object.
    An empty string is returned when EOF is encountered immediately. (For
    certain files, like ttys, it makes sense to continue reading after an EOF
    is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may
    be returned, even if no size parameter was given.


    Yes, so it should be that number of bytes, which is what it does, isn't it?

    ChrisA
    --
    https://mail.python.org/mailman/listinfo/python-list


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roel Schroeven@21:1/5 to Chris Angelico on Wed Jan 11 18:49:39 2023
    Chris Angelico schreef op 11/01/2023 om 18:36:
    On Thu, 12 Jan 2023 at 04:31, Stephen Tucker <stephen_tucker@sil.org> wrote:
    1. Create BOM.txt
    2. Input three bytes at once from BOM.txt and print them
    3. Input three bytes one at a time from BOM.txt and print them

    All of these correctly show that a file, in binary mode, reads and writes bytes.

    4. Input three bytes at once from BOM.txt and print them
    import codecs
    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")

    This is now a codecs file, NOT a vanilla file object. See its docs here:

    https://docs.python.org/2.7/library/codecs.html#codecs.open

    The output is "codec-dependent" but I would assume that UTF-8 will
    yield Unicode text strings.

    5. Attempt to input three bytes one at a time from BOM.txt and print them -------------------------------------------------------------------------

    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
    myBOM_4 = myfil.read (1)
    myBOM_4
    u'\ufeff'

    A. The attempt at Part 5 actually inputs all three bytes when we ask it to input just the first one!

    On the contrary; you asked it for one *character* and it read one character.

    Not exactly. You're right of course that things opened with
    codecs.open() behave differently from vanilla file objects.
    codecs.open() returns a StreamReaderWriter instance, which combines StreamReader and StreamWriter. For read(), StreamReader is what matters (documented at https://docs.python.org/3.11/library/codecs.html#codecs.StreamReader).
    It's read() method is:

    read(size=- 1, chars=- 1, firstline=False)

    _size_ indicates the approximate maximum number of encoded bytes or code
    points to read for decoding. The decoder can modify this setting as appropriate.

    _chars_ indicates the number of decoded code points or bytes to return.
    The read() method will never return more data than requested, but it
    might return less, if there is not enough available.

    When only one parameter is provided, without name, it's _size_. So myfil.read(1) asks to read enough bytes to decode 1 code point
    (approximately). That's totally consistent with the observer behavior.

    --
    "Peace cannot be kept by force. It can only be achieved through understanding."
    -- Albert Einstein

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Tucker@21:1/5 to rosuav@gmail.com on Thu Jan 12 10:25:11 2023
    Chris,

    Thanks for this clarification.

    I have not found documentation that disagrees with you. I simply observe
    that the documentation that I have alluded to earlier in this chain
    (section 5.9 File Objects) could have been made clearer by the addition of
    a note along the lines that the behaviour of a file's read method (in particular, what the unit of information is that it reads (that is, "byte", "UTF-8 encoded character", or whatever)) depends on the way in which the
    file has been opened.

    Thank you, Chris (and others) for your attention to my request. I consider
    this enquiry closed.

    Stephen.




    On Wed, Jan 11, 2023 at 5:36 PM Chris Angelico <rosuav@gmail.com> wrote:

    On Thu, 12 Jan 2023 at 04:31, Stephen Tucker <stephen_tucker@sil.org>
    wrote:
    1. Create BOM.txt
    2. Input three bytes at once from BOM.txt and print them
    3. Input three bytes one at a time from BOM.txt and print them

    All of these correctly show that a file, in binary mode, reads and writes bytes.

    4. Input three bytes at once from BOM.txt and print them
    import codecs
    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")

    This is now a codecs file, NOT a vanilla file object. See its docs here:

    https://docs.python.org/2.7/library/codecs.html#codecs.open

    The output is "codec-dependent" but I would assume that UTF-8 will
    yield Unicode text strings.

    5. Attempt to input three bytes one at a time from BOM.txt and print them -------------------------------------------------------------------------

    myfil = codecs.open ("BOM.txt", mode="rb", encoding="UTF-8")
    myBOM_4 = myfil.read (1)
    myBOM_4
    u'\ufeff'

    A. The attempt at Part 5 actually inputs all three bytes when we ask it
    to input just the first one!

    On the contrary; you asked it for one *character* and it read one
    character.

    Where were you seeing documentation that disagreed with this?

    ChrisA
    --
    https://mail.python.org/mailman/listinfo/python-list


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Stephen Tucker on Thu Jan 12 22:49:54 2023
    On Thu, 12 Jan 2023 at 21:25, Stephen Tucker <stephen_tucker@sil.org> wrote:

    Chris,

    Thanks for this clarification.

    I have not found documentation that disagrees with you. I simply observe that the documentation that I have alluded to earlier in this chain (section 5.9 File Objects)

    That's specifically the plain file objects. Other types of objects
    behave differently.

    could have been made clearer by the addition of a note along the lines that the behaviour of a file's read method (in particular, what the unit of information is that it reads (that is, "byte", "UTF-8 encoded character", or whatever)) depends on the
    way in which the file has been opened.

    Thank you, Chris (and others) for your attention to my request. I consider this enquiry closed.


    Cool cool!

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)