• Re: Strange UnicodeEncodeError in Windows image on Azure DevOps and Git

    From Eryk Sun@21:1/5 to Jessica Smith on Fri Nov 11 18:10:47 2022
    On 11/10/22, Jessica Smith <12jessicasmith34@gmail.com> wrote:

    Weird issue I've found on Windows images in Azure Devops Pipelines and
    Github actions. Printing Unicode characters fails on these images because, for some reason, the encoding is mapped to cp1252. What is particularly
    weird about the code page being set to 1252 is that if you execute "chcp"
    it shows that the code page is 65001.

    If stdout isn't a console (e.g. a pipe), it defaults to using the
    process code page (i.e. CP_ACP), such as legacy code page 1252
    (extended Latin-1). You can override just sys.std* to UTF-8 by setting
    the environment variable `PYTHONIOENCODING=UTF-8`. You can override
    all I/O to use UTF-8 by setting `PYTHONUTF8=1`, or by passing the
    command-line option `-X utf8`.

    Background

    The locale system in Windows supports a common system locale, plus a
    separate locale for each user. By default the process code page is
    based on the system locale, and the thread code page (i.e.
    CP_THREAD_ACP) is based on the user locale. The default locale of the
    Universal C runtime combines the user locale with the process code
    page. (This combination may be inconsistent.)

    In Windows 10 and later, the default process and thread code pages can
    be configured to use CP_UTF8 (65001). Applications can also override
    them to UTF-8 in their manifest via the "ActiveCodePage" setting. In
    either case, if the process code page is UTF-8, the C runtime will use
    UTF-8 for its default locale encoding (e.g. "en_uk.utf8").

    Unlike some frameworks, Python has never used the console input code
    page or output code page as a locale encoding. Personally, I wouldn't
    want Python to default to that old MS-DOS behavior. However, I'd be in
    favor of supporting a "console" encoding that's based on the console
    input code page that's returned by GetConsoleCP(). If the process
    doesn't have a console session, the "console" encoding would fall back
    on the process code page from GetACP().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 12Jessicasmith34@21:1/5 to All on Fri Nov 11 19:20:17 2022
    > If stdout isn't a console (e.g. a pipe), it defaults to using the process code page (i.e. CP_ACP), such as legacy code page 1252

    (extended Latin-1).



    First off, really helpful information, thank you. That was the exact background I was missing.



    Two questions: any idea why this would be happening in this situation? AFAIK, stdout *is* a console when these images are running the python process. Second - is there a way I can check the locale and code page values that you mentioned? I assume I could
    call GetACP using ctypes, but maybe there is a simpler way?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eryk Sun@21:1/5 to 12jessicasmith34@gmail.com on Fri Nov 11 20:16:01 2022
    On 11/11/22, 12Jessicasmith34 <12jessicasmith34@gmail.com> wrote:

    any idea why this would be happening in this situation? AFAIK, stdout
    *is* a console when these images are running the python process.

    If sys.std* are console files, then in Python 3.6+,
    sys.std*.buffer.raw will be _io._WindowsConsoleIO. The latter presents
    itself to Python code as a UTF-8 file stream, but internally it uses
    UTF-16LE with the wide-character API functions ReadConsoleW() and WriteConsoleW().

    is there a way I can check the locale and code page values that you mentioned? I assume I could call GetACP using ctypes, but maybe
    there is a simpler way?

    io.TextIOWrapper uses locale.getpreferredencoding(False) as the
    default encoding. Actually, in 3.11+ it uses locale.getencoding()
    unless UTF-8 mode is enabled, which is effectively the same as locale.getpreferredencoding(False). On Windows this calls GetACP() and
    formats the result as "cp%u" (e.g. "cp1252").

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Inada Naoki@21:1/5 to 12jessicasmith34@gmail.com on Sat Nov 12 11:53:42 2022
    On Sat, Nov 12, 2022 at 10:21 AM 12Jessicasmith34
    <12jessicasmith34@gmail.com> wrote:


    Two questions: any idea why this would be happening in this situation? AFAIK, stdout *is* a console when these images are running the python process. Second - is there a way I can check the locale and code page values that you mentioned? I assume I
    could call GetACP using ctypes, but maybe there is a simpler way?


    Maybe, python doesn't write to console in this case.

    python -(pipe)-> PowerShell -> Console

    In this case, python uses ACP for writing to pipe.
    And PowerShell uses OutputEncoding for reading from pipe.

    If you want to use UTF-8 on PowerShell in Windows,

    * Set PYTHONUTF8=1 (Python uses UTF-8 for writing into pipe).
    * Set `$OutputEncoding =
    [System.Text.Encoding]::GetEncoding('utf-8')` in PowerShell profile.

    Regards,

    --
    Inada Naoki <songofacandy@gmail.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Inada Naoki@21:1/5 to songofacandy@gmail.com on Sat Nov 12 12:23:39 2022
    On Sat, Nov 12, 2022 at 11:53 AM Inada Naoki <songofacandy@gmail.com> wrote:

    On Sat, Nov 12, 2022 at 10:21 AM 12Jessicasmith34 <12jessicasmith34@gmail.com> wrote:


    Two questions: any idea why this would be happening in this situation? AFAIK, stdout *is* a console when these images are running the python process. Second - is there a way I can check the locale and code page values that you mentioned? I assume I
    could call GetACP using ctypes, but maybe there is a simpler way?


    Maybe, python doesn't write to console in this case.

    python -(pipe)-> PowerShell -> Console

    In this case, python uses ACP for writing to pipe.
    And PowerShell uses OutputEncoding for reading from pipe.

    If you want to use UTF-8 on PowerShell in Windows,

    * Set PYTHONUTF8=1 (Python uses UTF-8 for writing into pipe).
    * Set `$OutputEncoding =
    [System.Text.Encoding]::GetEncoding('utf-8')` in PowerShell profile.


    I forgot [Console]::OutputEncoding. This is what PowerShell uses when
    reading from pipe. So PowerShell profile should be:

    $OutputEncoding = [Console]::OutputEncoding = [System.Text.Encoding]::UTF8

    --
    Inada Naoki <songofacandy@gmail.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Sun Nov 13 06:24:47 2022
    Le samedi 12 novembre 2022 à 03:54:31 UTC+1, Inada Naoki a écrit :
    If you want to use UTF-8 on PowerShell in Windows,

    * Set PYTHONUTF8=1 (Python uses UTF-8 for writing into pipe).
    * Set `$OutputEncoding =
    [System.Text.Encoding]::GetEncoding('utf-8')` in PowerShell profile.


    ... which just does not work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jessica Smith@21:1/5 to eryksun@gmail.com on Sun Nov 13 13:35:07 2022
    On Fri, Nov 11, 2022 at 8:16 PM Eryk Sun <eryksun@gmail.com> wrote:
    If sys.std* are console files, then in Python 3.6+, sys.std*.buffer.raw will be _io._WindowsConsoleIO
    io.TextIOWrapper uses locale.getpreferredencoding(False) as the default encoding

    Thank you for your replies - checking the sys.stdout.buffer.raw value
    is what finally helped me understand. Turns out, the Windows agent is redirecting the output of all python commands to a file, so sys.stdout
    is a file using the locale encoding of cp1252, instead of being a
    stream using encoding utf8. I wrote up a gist with my findings to
    hopefully help out some other poor soul in the future: https://gist.github.com/NodeJSmith/e7e37f2d3f162456869f015f842bcf15

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Tue Nov 15 01:41:14 2022
    Le dimanche 13 novembre 2022 à 20:35:43 UTC+1, Jessica Smith a écrit :
    On Fri, Nov 11, 2022 at 8:16 PM Eryk Sun <ery...@gmail.com> wrote:
    If sys.std* are console files, then in Python 3.6+, sys.std*.buffer.raw will be _io._WindowsConsoleIO
    io.TextIOWrapper uses locale.getpreferredencoding(False) as the default encoding
    Thank you for your replies - checking the sys.stdout.buffer.raw value
    is what finally helped me understand. Turns out, the Windows agent is redirecting the output of all python commands to a file, so sys.stdout
    is a file using the locale encoding of cp1252, instead of being a
    stream using encoding utf8. I wrote up a gist with my findings to
    hopefully help out some other poor soul in the future: https://gist.github.com/NodeJSmith/e7e37f2d3f162456869f015f842bcf15

    Jessica,

    Nice reading.

    May I suggest to toy with more elaborate strings so that you get
    things like this (from Windows-1252 chars)

    PS C:\xxx\zzz> $a = py38 -c "print('$jj')"; $a
    abc霟

    and also take into account that powershell does not read a text file
    properly.

    I'm very irritated and frustrated. I do love to toy with Python and
    its "coding of characters" buggyness and now even the Windows terminal
    does not work correctly...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)