Forum: >>> Magnum BBS <<<

UTF-8 overlong encodings

From Rainer Weikusat@21:1/5 to All on Fri Dec 17 22:55:51 2021

As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.

Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.

Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.

Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the
position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.

The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the
second byte, at positions 5, 4, 3 and 2.

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Rainer Weikusat on Sat Dec 18 00:04:05 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.

I'd express it in terms of magnitude. An overlong 2-byte sequence will
decode to a value than 0x80. An overlong encoded 3-byte value will be
less than 0x800 and so on. Or going the other way, you need at least
two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
so on.

When looking at the encoding itself, an overlong sequence is one that
starts with one of the bytes C0, C1, E0, F0, F8 or FC.

Unicode has said it won't use more than the 21 bits available in the
four-byte encoding, so all sequences of length 5 or 6 are "overlong",
although in an entirely different sense.

Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.

Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.

Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.

The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the second byte, at positions 5, 4, 3 and 2.

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to
the initial definition.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Keith Thompson@21:1/5 to Rainer Weikusat on Fri Dec 17 16:37:00 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.

Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.

Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.

Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.

The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the second byte, at positions 5, 4, 3 and 2.

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

Unicode only defines character values up to 0x10fffd, so there are no
valid encodings longer than 4 octets.

Here's a table I came up with a while ago:

00-7F (7 bits) 0xxxxxxx
0080-07FF (11 bits) 110xxxxx 10xxxxxx
0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The character code is determined by concatenating the 'x's together.
A 1-octet encoding has 7 value bits.
A 2-octet encoding has 11 value bits.
And so on.

An octet starting with 0 is always a single-octet character (ASCII compatible). An octet starting with 11 is always the first octet of a multi-octet encoding. An octet starting with 10 is always a continuation octet.

Overlong encodings that use more octets than necessary are invalid. For example, the letter 'k' is 0x6b or 1101011 and is encoded in a single
octet:
01101011
-------
A two-octet encoding of the same character code is invalid:
11000001 10101011
----- ------

You could extrapolate UTF-8 to up to 8-octet encodings, representing up to
42 bits, but that's also not valid UTF-8 (though I can imagine it being
useful for some purposes).

110000-3FFFFFF (26 bits) 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
4000000-7FFFFFF (31 bits) 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
8000000-FFFFFFFFF (36 bits) 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1000000000-3FFFFFFFFFF (42 bits) 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Ben Bacarisse on Sat Dec 18 17:15:12 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.

I'd express it in terms of magnitude. An overlong 2-byte sequence will decode to a value than 0x80. An overlong encoded 3-byte value will be
less than 0x800 and so on. Or going the other way, you need at least
two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
so on.

Yes. That's an error I made: An overlong sequence is one where none of
the bits between the end of the prefix and the start bit (inclusive) are
set.

[...]

Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.

Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.

Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the
position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.

The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the
second byte, at positions 5, 4, 3 and 2.

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to
the initial definition.

I want to calculate that with a general algorithm.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Keith Thompson on Sat Dec 18 17:19:17 2021

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

[...]

Unicode only defines character values up to 0x10fffd, so there are no
valid encodings longer than 4 octets.

Here's a table I came up with a while ago:

00-7F (7 bits) 0xxxxxxx
0080-07FF (11 bits) 110xxxxx 10xxxxxx
0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The Linux UTF-8 man page also has 5 and 6 byte sequences.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Rainer Weikusat on Sat Dec 18 23:14:15 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.

I'd express it in terms of magnitude. An overlong 2-byte sequence will
decode to a value than 0x80. An overlong encoded 3-byte value will be
less than 0x800 and so on. Or going the other way, you need at least
two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
so on.

Yes. That's an error I made: An overlong sequence is one where none of
the bits between the end of the prefix and the start bit (inclusive) are
set.

[...]

Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21 >>> and 26.

Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of >>> the left-most six bit blocks are 5, 11, 17, 23, 29.

Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the
position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.

The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the
second byte, at positions 5, 4, 3 and 2.

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>> of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to
the initial definition.

I want to calculate that with a general algorithm.

I don't know what "that" refers to. Do you want to calculate the UTF-8 sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John McCue@21:1/5 to Rainer Weikusat on Sun Dec 19 00:49:58 2021

Rainer Weikusat <rweikusat@talktalk.net> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

<snip>

Unicode only defines character values up to 0x10fffd, so there are no
valid encodings longer than 4 octets.

This is my understanding also.

<snip>

The Linux UTF-8 man page also has 5 and 6 byte sequences.

I saw somewhere 5 and 6 byte sequences were originally
defined or thought it would be needed, but now limited
to 4 bytes.

--
csh(1) - "An elegant shell, for a more... civilized age."
- Paraphrasing Star Wars

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Rainer Weikusat on Sun Dec 19 21:53:25 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

[...]

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>>>> of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to >>>> the initial definition.

I want to calculate that with a general algorithm.

I don't know what "that" refers to. Do you want to calculate the UTF-8
sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?

I want to determine if a sequence is overlong using a generalized
algorithm for that, ie, not by special-casing start byte values. So far,
the untested (and very likely buggy) code for this looks like follows:

u_len is the length of the sequence in bytes, p a pointer to the first
byte. Some unrelated consistency checks removed.

At the moment, I'm convinced that this algorithm is complete
nonsense. :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Ben Bacarisse on Sun Dec 19 21:39:02 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

[...]

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>>> of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to
the initial definition.

I want to calculate that with a general algorithm.

I don't know what "that" refers to. Do you want to calculate the UTF-8 sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?

I want to determine if a sequence is overlong using a generalized
algorithm for that, ie, not by special-casing start byte values. So far,
the untested (and very likely buggy) code for this looks like follows:

u_len is the length of the sequence in bytes, p a pointer to the first
byte. Some unrelated consistency checks removed.

mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
x = *p & mask;
if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

y = *++p;

if (!x) { /* x == 0 implies u_len > 2 */
mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Rainer Weikusat on Mon Dec 20 03:57:24 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

[...]

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>>>> of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to >>>> the initial definition.

I want to calculate that with a general algorithm.

I don't know what "that" refers to. Do you want to calculate the UTF-8
sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?

I want to determine if a sequence is overlong using a generalized
algorithm for that, ie, not by special-casing start byte values.

I don't think I follow what you mean. Over long sequences are special
case so you have to special-case something. Why not the first byte? It
seems to be such a simple method.

So far,
the untested (and very likely buggy) code for this looks like follows:

u_len is the length of the sequence in bytes,

How have you calculated u_len? You can detect and overlong sequence
without knowing it, so there is some risk in using it when it's not
needed.

p a pointer to the first
byte. Some unrelated consistency checks removed.

mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */

That includes one more bit than you want. In a proper UTF-8 sequence,
that bit will be zero, so it's harmless, but have you already checked
that the sequence is valid (other than possibly being overlong).

By the way, I'd use 0xff >> u_len to get the mask. It seems more
natural.

x = *p & mask;
if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

(or if no bits are set, but you include that in your test)

y = *++p;

I don't see why you need to look at the next byte.

if (!x) { /* x == 0 implies u_len > 2 */

x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.

mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
}

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nicolas George@21:1/5 to All on Mon Dec 20 09:42:06 2021

John McCue , dans le message <splvjm$msm$1@dont-email.me>, a �crit�:

I saw somewhere 5 and 6 byte sequences were originally
defined or thought it would be needed, but now limited
to 4 bytes.

Unicode was limited to 20-21 bits because Microsoft and Sun decided to use UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
to sizeof(wchar_t) or equivalent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nicolas George@21:1/5 to All on Mon Dec 20 13:34:08 2021

Rainer Weikusat , dans le message <87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>, a �crit�:

I want to determine if a sequence is overlong using a generalized
algorithm for that

Just decode and re-encode and see if the length is the same.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Keith Thompson@21:1/5 to Nicolas George on Mon Dec 20 10:56:19 2021

Nicolas George <nicolas$george@salle-s.org> writes:

John McCue , dans le message <splvjm$msm$1@dont-email.me>, a écrit :

I saw somewhere 5 and 6 byte sequences were originally
defined or thought it would be needed, but now limited
to 4 bytes.

Unicode was limited to 20-21 bits because Microsoft and Sun decided to use UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
to sizeof(wchar_t) or equivalent.

I don't see how that follows. UTF-16, at least in its current form, can represent all 1,112,064 valid Unicode code points.

Sun used UTF-16 for Java. UTF-16, as far as I know, was rare on Solaris.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Ben Bacarisse on Mon Dec 20 18:28:47 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

[...]

u_len is the length of the sequence in bytes,

How have you calculated u_len? You can detect and overlong sequence
without knowing it, so there is some risk in using it when it's not
needed.

Assuming x is the start byte of a UTF-8 sequences stored as unsigned
32-bit integer, the length of the sequence is (using a gcc extension)

__builtin_clz(x ^ 0xff) - 24

All bits left of 0x80 will already be clear. x ^ 0xff will clear all
bits up to the trailing 0 bit of the prefix and set that to 1.

p a pointer to the first
byte. Some unrelated consistency checks removed.

mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */

That includes one more bit than you want.

Indeed. 8 - u_len is the shift index of the last non-zero prefix bit. It
should have been 7 - u_len or (8 - (u_len + 1)).

[...]

I don't see why you need to look at the next byte.

if (!x) { /* x == 0 implies u_len > 2 */

x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.

According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:

1110xxxx 10xxxxxx 10xxxxxx

Program encoding 0x800 in this way:

--------
#include <stdio.h>

int main(void)
{
unsigned u = 0x800;
unsigned char utf[3], *p;

p = utf;
*p++ = 0xe0 | (u >> 12);
*p++ = 0x80 | ((u >> 6) & 63);
*p = 0x80 | (u & 63);

printf("%x %x %x\n", *utf, utf[1], utf[2]);

return 0;
}
-------

And the output is e0 a0 80. The situation is similar for all other
ranges, including the 4-byte sequence which is actually supposed to be
used: The first number of the range corresponds with a set bit in the second byte.

I may again have gotten something wrong here. But I've done all the calculations twice and got the same result.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Rainer Weikusat on Mon Dec 20 19:30:33 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

[...]

I don't see why you need to look at the next byte.

if (!x) { /* x == 0 implies u_len > 2 */

x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.

According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:

1110xxxx 10xxxxxx 10xxxxxx

Program encoding 0x800 in this way:

[...]

And the output is e0 a0 80.

Addition: Not technically a proof of correctness but the ActionCable
(rot in hell) UTF-8 checker I have to placate accepts 0xe0 0xa0 0x80 as
valid UTF-8 sequence.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Rainer Weikusat on Mon Dec 20 20:46:21 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

<cut>

I don't see why you need to look at the next byte.

if (!x) { /* x == 0 implies u_len > 2 */

x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.

According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:

Yup. I was not thinking. You need to look at the first two bytes when
the sequence length is > 2.

If b1 is 0xE0 then b2 must be >= 0xA0.
If b1 is 0xF0 then b2 must be >= 0x90.
If b1 is 0xF8 then b2 must be >= 0x88.
If b1 is 0xFC then b2 must be >= 0x84.

In this diagram, the ^ marks the least significant bit that must be set
for the encoding not to be overlong:

110x xxxx 10xx xxxx
^
1110 xxxx 10xx xxxx 10xx xxxx
^
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
^
1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
^
1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
^
In terms of bit masks,

(b2 & 0x3f) >> 8 - ulen

must be non zero for ulen > 2.

Sorry for the noise.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Ben Bacarisse on Mon Dec 20 21:24:19 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

<cut>

I don't see why you need to look at the next byte.

if (!x) { /* x == 0 implies u_len > 2 */

x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.

According to the Linux man page, a number in the range 0x800 - 0xffff is
encoded as three bytes:

[...]

In terms of bit masks,

(b2 & 0x3f) >> 8 - ulen

Idea I had meanwhile myself, too: The expressions become simpler when shifting out the
unwanted bits instead of selecting the wanted ones via &-masking. The
one above is for the second byte, the one for the first would be

b1 << u_len

with the special case that a result of 4 means it's overlong when the
length of the sequence is 2.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Apr 16 21:34:28 2024
  from Wales, Uk via Telnet
- Jasonrx8
  Wed Apr 17 13:08:27 2024
  from Sydney Nsw via SSH
- Bob Worm
  Wed Apr 17 10:35:26 2024
  from Wales, Uk via Telnet
- Bob Worm
  Wed Apr 17 08:38:18 2024
  from Wales, Uk via Raw

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	292
Nodes:	16 (2 / 14)
Uptime:	179:34:09
Calls:	6,616
Calls today:	3
Files:	12,165
Messages:	5,314,103

UTF-8 overlong encodings

Who's Online

Recent Visitors

System Info