• Tcl 8.7a5: why utf-8 is different?

    From Georgios Petasis@21:1/5 to All on Tue Jan 11 22:52:06 2022
    Hi all,

    I observe a difference in converting strings to utf-8 between Tcl
    8.6/8.7a1 and 8.7a5.

    In Tcl 8.6 I get:

    binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
    efbfbd

    which seems the correct utf-8 encoding.

    However, in Tcl 8.7a5 I get:

    binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
    eda080

    What is the reason for this difference? Shouldn't the utf-8 bytes be the
    same, no matter what internal representation for strings is used by Tcl?

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Benjamin Riefenstahl@21:1/5 to Georgios Petasis on Wed Jan 12 01:29:40 2022
    Hi Georgios,

    Georgios Petasis writes:
    I observe a difference in converting strings to utf-8 between Tcl
    8.6/8.7a1 and 8.7a5.

    In Tcl 8.6 I get:

    binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
    efbfbd

    which seems the correct utf-8 encoding.

    However, in Tcl 8.7a5 I get:

    binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
    eda080

    What is the reason for this difference?

    8.7 now implements UTF-16 surrogate pairs. 0xD800 is the first word in
    such a pair. Converting 0xD800 to UTF-8 without the second word of the
    pair is strictly speaking not possible, because first the words have to
    be combined and only than the whole codepoint can be encoded. IOW both
    results are wrong, because the input is invalid.

    HTH, benny

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Georgios Petasis@21:1/5 to All on Wed Jan 12 18:19:47 2022
    Στις 12/1/2022 02:29, ο/η Benjamin Riefenstahl έγραψε:
    Hi Georgios,

    Georgios Petasis writes:
    I observe a difference in converting strings to utf-8 between Tcl
    8.6/8.7a1 and 8.7a5.

    In Tcl 8.6 I get:

    binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
    efbfbd

    which seems the correct utf-8 encoding.

    However, in Tcl 8.7a5 I get:

    binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
    eda080

    What is the reason for this difference?

    8.7 now implements UTF-16 surrogate pairs. 0xD800 is the first word in
    such a pair. Converting 0xD800 to UTF-8 without the second word of the
    pair is strictly speaking not possible, because first the words have to
    be combined and only than the whole codepoint can be encoded. IOW both results are wrong, because the input is invalid.

    HTH, benny

    Yes, but the 8.6 had a usefulness, at least for me. The new value from
    8.7 is not of any use...

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Wed Jan 12 21:55:20 2022
    Am 12.01.22 um 17:19 schrieb Georgios Petasis:
    Στις 12/1/2022 02:29, ο/η Benjamin Riefenstahl έγραψε:
    8.7 now implements UTF-16 surrogate pairs.  0xD800 is the first word in
    such a pair.  Converting 0xD800 to UTF-8 without the second word of the
    pair is strictly speaking not possible, because first the words have to
    be combined and only than the whole codepoint can be encoded.  IOW both
    results are wrong, because the input is invalid.

    HTH, benny

    Yes, but the 8.6 had a usefulness, at least for me. The new value from
    8.7 is not of any use...

    YOu should talk to Jan Nijtmans, who implemented the UTF16 thing. I'm
    not sure if this case is set in stone already, however my understanding
    of this while thing is shallow.

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)