• Win11 terminal or powershell not reading a file correctly : explanation

    From moi@21:1/5 to All on Wed Nov 23 08:10:52 2022
    The powershell variants, win11 terminal, pwsh.exe (7.2.7), or *x powershell, are not reading a file correctly.

    In order to read a file, powershell set the priority on the BOM detection over an explicitly declared encoding. This can not work.

    Illustration and stupid cases.

    ms52.txt a valid Windows-1252 encoded file.

    PS C:\humour> py38
    Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:34:34) [MSC v.1928 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    with open('ms52.txt', 'r', encoding='cp1252') as f:
    ... print(f.read())
    ...
    abc霟



    # cmd.exe : correct
    c:\humour>type ms52.txt
    abc霟
    c:\humour>

    PS C:\humour> get-content ms52.txt -encoding default
    abc霟
    PS C:\humour> # expected if utf-8 were the real default, utf-8-bom
    PS C:\humour> get-content ms52.txt
    abc霟
    PS C:\humour> get-content ms52.txt -encoding default
    abc霟

    # win-1252 -> utf-8 conversion : impossible

    PS C:\humour> py38 contenu.py conversion.txt
    bytes 46 b'\xef\xbb\xbf are the three characters you may see...\r\n'
    1252 46  are the three characters you may see...\r\n
    UTF-8 BOM 43 are the three characters you may see...\r\n
    PS C:\humour>
    PS C:\humour> get-content conversion.txt -encoding default | set-content zz.txt -encoding utf8
    PS C:\humour> py38 contenu.py zz.txt
    bytes 46 b'\xef\xbb\xbf are the three characters you may see...\r\n'
    1252 46  are the three characters you may see...\r\n
    UTF-8 BOM 43 are the three characters you may see...\r\n
    PS C:\humour>

    # Probably, the top of the absurdity, One can not save and read a file with
    # the same codec !

    PS C:\humour> $ll
    abc霟
    PS C:\humour> $ll | set-content zz.txt -encoding default
    PS C:\humour> py38 contenu.py zz.txt
    bytes 11 b'\xef\xbb\xbfabc\xe9\x9c\x9f\r\n'
    1252 11 abc霟\r\n
    UTF-8 BOM 6 abc霟\r\n
    PS C:\humour> $in = get-content zz.txt -encoding default
    PS C:\humour> $in
    abc霟
    PS C:\humour> $in -eq $ll
    False
    PS C:\humour>

    Ditto with pwsh.exe and a explicit 1252 or windows-1252 encoding names
    PS C:\humour> $a = get-content zz.txt -encoding 1252
    PS C:\humour> $a
    abc霟
    PS C:\humour>

    Amusing in win11 where the default codec is windows-1252!
    PS C:\humour> "abcéà€" | set-content a.txt -encoding default
    PS C:\humour> py38 contenu.jpy a.txt
    c:\Python38\python.exe: can't open file 'contenu.jpy': [Errno 2] No such file or directory
    PS C:\humour> py38 contenu.py a.txt
    bytes 8 b'abc\xe9\xe0\x80\r\n'
    1252 8 abcéà€\r\n
    UTF-8 NO BOM 8 abc���\r\n
    PS C:\humour>

    Miscellaneous

    PS C:\humour> $psversiontable

    Name Value
    ---- -----
    PSVersion 7.2.7
    PSEdition Core
    GitCommitId 7.2.7
    OS Microsoft Windows 10.0.22621
    Platform Win32NT
    PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…} PSRemotingProtocolVersion 2.3
    SerializationVersion 1.1.0.1
    WSManStackVersion 3.0

    and

    Name Value
    ---- -----
    PSVersion 5.1.22621.608
    PSEdition Desktop
    PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
    BuildVersion 10.0.22621.608
    CLRVersion 4.0.30319.42000
    WSManStackVersion 3.0
    PSRemotingProtocolVersion 2.3
    SerializationVersion 1.1.0.1

    PS C:\humour> get-content iso5.txt -encoding iso-8859-5
    abc
    PS C:\humour> # wrong
    PS C:\humour> get-content passwordiso2.txt -encoding iso-8859-2
    éz
    PS C:\humour> # wrong

    An é in a real iso-8859-2 ?
    PS C:\humour> py38 -c "print('é'.encode('iso-8859-2'))"
    b'\xe9'

    PS C:\humour>


    23.11.2022. Updated win11 22H2 version.
    Dear devs, you have 24/48 hours to fix this buggy behaviour.

    Regards.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Wed Nov 23 23:04:16 2022
    Le mercredi 23 novembre 2022 à 17:11:03 UTC+1, moi a écrit :
    The powershell variants, win11 terminal, pwsh.exe (7.2.7), or *x powershell, are not reading a file correctly.

    In order to read a file, powershell set the priority on the BOM detection over
    an explicitly declared encoding. This can not work.

    Illustration and stupid cases.

    ms52.txt a valid Windows-1252 encoded file.

    PS C:\humour> py38
    Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:34:34) [MSC v.1928 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    with open('ms52.txt', 'r', encoding='cp1252') as f:
    ... print(f.read())
    ...
    abc霟



    # cmd.exe : correct
    c:\humour>type ms52.txt
    abc霟
    c:\humour>

    PS C:\humour> get-content ms52.txt -encoding default
    abc霟
    PS C:\humour> # expected if utf-8 were the real default, utf-8-bom
    PS C:\humour> get-content ms52.txt
    abc霟
    PS C:\humour> get-content ms52.txt -encoding default
    abc霟

    # win-1252 -> utf-8 conversion : impossible

    PS C:\humour> py38 contenu.py conversion.txt
    bytes 46 b'\xef\xbb\xbf are the three characters you may see...\r\n'
    1252 46  are the three characters you may see...\r\n
    UTF-8 BOM 43 are the three characters you may see...\r\n
    PS C:\humour>
    PS C:\humour> get-content conversion.txt -encoding default | set-content zz.txt -encoding utf8
    PS C:\humour> py38 contenu.py zz.txt
    bytes 46 b'\xef\xbb\xbf are the three characters you may see...\r\n'
    1252 46  are the three characters you may see...\r\n
    UTF-8 BOM 43 are the three characters you may see...\r\n
    PS C:\humour>

    # Probably, the top of the absurdity, One can not save and read a file with # the same codec !

    PS C:\humour> $ll
    abc霟
    PS C:\humour> $ll | set-content zz.txt -encoding default
    PS C:\humour> py38 contenu.py zz.txt
    bytes 11 b'\xef\xbb\xbfabc\xe9\x9c\x9f\r\n'
    1252 11 abc霟\r\n
    UTF-8 BOM 6 abc霟\r\n
    PS C:\humour> $in = get-content zz.txt -encoding default
    PS C:\humour> $in
    abc霟
    PS C:\humour> $in -eq $ll
    False
    PS C:\humour>

    Ditto with pwsh.exe and a explicit 1252 or windows-1252 encoding names
    PS C:\humour> $a = get-content zz.txt -encoding 1252
    PS C:\humour> $a
    abc霟
    PS C:\humour>

    Amusing in win11 where the default codec is windows-1252!
    PS C:\humour> "abcéà€" | set-content a.txt -encoding default
    PS C:\humour> py38 contenu.jpy a.txt
    c:\Python38\python.exe: can't open file 'contenu.jpy': [Errno 2] No such file or directory
    PS C:\humour> py38 contenu.py a.txt
    bytes 8 b'abc\xe9\xe0\x80\r\n'
    1252 8 abcéà€\r\n
    UTF-8 NO BOM 8 abc���\r\n
    PS C:\humour>

    Miscellaneous

    PS C:\humour> $psversiontable

    Name Value
    ---- -----
    PSVersion 7.2.7
    PSEdition Core
    GitCommitId 7.2.7
    OS Microsoft Windows 10.0.22621
    Platform Win32NT
    PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
    PSRemotingProtocolVersion 2.3
    SerializationVersion 1.1.0.1
    WSManStackVersion 3.0

    and

    Name Value
    ---- -----
    PSVersion 5.1.22621.608
    PSEdition Desktop
    PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
    BuildVersion 10.0.22621.608
    CLRVersion 4.0.30319.42000
    WSManStackVersion 3.0
    PSRemotingProtocolVersion 2.3
    SerializationVersion 1.1.0.1

    PS C:\humour> get-content iso5.txt -encoding iso-8859-5
    abc
    PS C:\humour> # wrong
    PS C:\humour> get-content passwordiso2.txt -encoding iso-8859-2
    éz
    PS C:\humour> # wrong

    An é in a real iso-8859-2 ?
    PS C:\humour> py38 -c "print('é'.encode('iso-8859-2'))"
    b'\xe9'

    PS C:\humour>


    23.11.2022. Updated win11 22H2 version.
    Dear devs, you have 24/48 hours to fix this buggy behaviour.

    Regards.

    --------

    Correction. Cleaned version.

    PS C:\humour> $ll
    abc霟
    PS C:\humour> $ll | set-content a.txt -nonewline
    PS C:\humour> $ll2 = get-content a.txt
    PS C:\humour> $ll2
    abc霟
    PS C:\humour> $ll2 -eq $ll
    False
    PS C:\humour> licp($ll2)
    a U+0061
    b U+0062
    c U+0063
    霟 U+971F
    PS C:\humour> licp($ll)
    ï U+00EF
    » U+00BB
    ¿ U+00BF
    a U+0061
    b U+0062
    c U+0063
    é U+00E9
    œ U+0153
    Ÿ U+0178
    PS C:\humour>
    PS C:\humour> $ll3 = get-content52 a.txt
    PS C:\humour> $ll3
    abc霟
    PS C:\humour> $ll3 -eq $ll
    True
    PS C:\humour> licp($ll3)
    ï U+00EF
    » U+00BB
    ¿ U+00BF
    a U+0061
    b U+0062
    c U+0063
    é U+00E9
    œ U+0153
    Ÿ U+0178
    PS C:\humour>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)