Forum: >>> Magnum BBS <<<

Dark
Log in

Username Password

Tcl 8.7a5: why utf-8 is different?

From Georgios Petasis@21:1/5 to All on Tue Jan 11 22:52:06 2022

Hi all,

I observe a difference in converting strings to utf-8 between Tcl
8.6/8.7a1 and 8.7a5.

In Tcl 8.6 I get:

binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex

efbfbd

which seems the correct utf-8 encoding.

However, in Tcl 8.7a5 I get:

binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex

eda080

What is the reason for this difference? Shouldn't the utf-8 bytes be the
same, no matter what internal representation for strings is used by Tcl?

George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Benjamin Riefenstahl@21:1/5 to Georgios Petasis on Wed Jan 12 01:29:40 2022

Hi Georgios,

Georgios Petasis writes:

I observe a difference in converting strings to utf-8 between Tcl
8.6/8.7a1 and 8.7a5.

In Tcl 8.6 I get:

binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex

efbfbd

which seems the correct utf-8 encoding.

However, in Tcl 8.7a5 I get:

binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex

eda080

What is the reason for this difference?

8.7 now implements UTF-16 surrogate pairs. 0xD800 is the first word in
such a pair. Converting 0xD800 to UTF-8 without the second word of the
pair is strictly speaking not possible, because first the words have to
be combined and only than the whole codepoint can be encoded. IOW both
results are wrong, because the input is invalid.

HTH, benny

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Georgios Petasis@21:1/5 to All on Wed Jan 12 18:19:47 2022

Στις 12/1/2022 02:29, ο/η Benjamin Riefenstahl έγραψε:

Hi Georgios,

Georgios Petasis writes:

I observe a difference in converting strings to utf-8 between Tcl
8.6/8.7a1 and 8.7a5.

In Tcl 8.6 I get:

binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex

efbfbd

which seems the correct utf-8 encoding.

However, in Tcl 8.7a5 I get:

binary scan [encoding convertto utf-8 \uD800] H* hex; puts $hex

eda080

What is the reason for this difference?

8.7 now implements UTF-16 surrogate pairs. 0xD800 is the first word in
such a pair. Converting 0xD800 to UTF-8 without the second word of the
pair is strictly speaking not possible, because first the words have to
be combined and only than the whole codepoint can be encoded. IOW both results are wrong, because the input is invalid.

HTH, benny

Yes, but the 8.6 had a usefulness, at least for me. The new value from
8.7 is not of any use...

George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Wed Jan 12 21:55:20 2022

Am 12.01.22 um 17:19 schrieb Georgios Petasis:

Στις 12/1/2022 02:29, ο/η Benjamin Riefenstahl έγραψε:

8.7 now implements UTF-16 surrogate pairs. 0xD800 is the first word in
such a pair. Converting 0xD800 to UTF-8 without the second word of the
pair is strictly speaking not possible, because first the words have to
be combined and only than the whole codepoint can be encoded. IOW both
results are wrong, because the input is invalid.

HTH, benny

Yes, but the 8.6 had a usefulness, at least for me. The new value from
8.7 is not of any use...

YOu should talk to Jan Nijtmans, who implemented the UTF16 thing. I'm
not sure if this case is set in stone already, however my understanding
of this while thing is shallow.

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Smithy
  Fri Apr 19 18:53:54 2024
  from Plymouth via Telnet
- Bob Worm
  Fri Apr 19 14:04:19 2024
  from Wales, Uk via Telnet
- Richard
  Fri Apr 19 12:43:01 2024
  from Leeds, Uk via SSH
- Bob Worm
  Fri Apr 19 09:15:26 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	230:11:26
Calls:	6,624
Calls today:	6
Files:	12,171
Messages:	5,319,300