efbfbd
eda080
I observe a difference in converting strings to utf-8 between Tcl
8.6/8.7a1 and 8.7a5.
In Tcl 8.6 I get:
binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
efbfbd
which seems the correct utf-8 encoding.
However, in Tcl 8.7a5 I get:
binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
eda080
What is the reason for this difference?
Hi Georgios,
Georgios Petasis writes:
I observe a difference in converting strings to utf-8 between Tcl
8.6/8.7a1 and 8.7a5.
In Tcl 8.6 I get:
binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
efbfbd
which seems the correct utf-8 encoding.
However, in Tcl 8.7a5 I get:
binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex
eda080
What is the reason for this difference?
8.7 now implements UTF-16 surrogate pairs. 0xD800 is the first word in
such a pair. Converting 0xD800 to UTF-8 without the second word of the
pair is strictly speaking not possible, because first the words have to
be combined and only than the whole codepoint can be encoded. IOW both results are wrong, because the input is invalid.
HTH, benny
Στις 12/1/2022 02:29, ο/η Benjamin Riefenstahl έγραψε:
8.7 now implements UTF-16 surrogate pairs. 0xD800 is the first word in
such a pair. Converting 0xD800 to UTF-8 without the second word of the
pair is strictly speaking not possible, because first the words have to
be combined and only than the whole codepoint can be encoded. IOW both
results are wrong, because the input is invalid.
HTH, benny
Yes, but the 8.6 had a usefulness, at least for me. The new value from
8.7 is not of any use...
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 285 |
Nodes: | 16 (2 / 14) |
Uptime: | 29:12:15 |
Calls: | 6,448 |
Files: | 12,050 |
Messages: | 5,254,561 |