As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.
The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.
Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.
Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.
Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.
The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the second byte, at positions 5, 4, 3 and 2.
An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:
3 -> 5
4 -> 4
5 -> 3
6 -> 2
Any corrections or other comments very much welcome.
As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.
The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.
Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.
Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.
Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.
The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the second byte, at positions 5, 4, 3 and 2.
An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:
3 -> 5
4 -> 4
5 -> 3
6 -> 2
Any corrections or other comments very much welcome.
Rainer Weikusat <rweikusat@talktalk.net> writes:
As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.
The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.
I'd express it in terms of magnitude. An overlong 2-byte sequence will decode to a value than 0x80. An overlong encoded 3-byte value will be
less than 0x800 and so on. Or going the other way, you need at least
two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
so on.
Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.
Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.
Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the
position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.
The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the
second byte, at positions 5, 4, 3 and 2.
An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:
3 -> 5
4 -> 4
5 -> 3
6 -> 2
Any corrections or other comments very much welcome.
I was not sure what this part of the description was supposed to add to
the initial definition.
Rainer Weikusat <rweikusat@talktalk.net> writes:
As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.
Unicode only defines character values up to 0x10fffd, so there are no
valid encodings longer than 4 octets.
Here's a table I came up with a while ago:
00-7F (7 bits) 0xxxxxxx
0080-07FF (11 bits) 110xxxxx 10xxxxxx
0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
Rainer Weikusat <rweikusat@talktalk.net> writes:
As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.
The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.
I'd express it in terms of magnitude. An overlong 2-byte sequence will
decode to a value than 0x80. An overlong encoded 3-byte value will be
less than 0x800 and so on. Or going the other way, you need at least
two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
so on.
Yes. That's an error I made: An overlong sequence is one where none of
the bits between the end of the prefix and the start bit (inclusive) are
set.
[...]
Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21 >>> and 26.
Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of >>> the left-most six bit blocks are 5, 11, 17, 23, 29.
Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the
position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.
The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the
second byte, at positions 5, 4, 3 and 2.
An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>> of the first six bit block for each encoded sequence is 8 -
its length:
3 -> 5
4 -> 4
5 -> 3
6 -> 2
Any corrections or other comments very much welcome.
I was not sure what this part of the description was supposed to add to
the initial definition.
I want to calculate that with a general algorithm.
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:<snip>
Rainer Weikusat <rweikusat@talktalk.net> writes:
Unicode only defines character values up to 0x10fffd, so there are no
valid encodings longer than 4 octets.
The Linux UTF-8 man page also has 5 and 6 byte sequences.
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
Rainer Weikusat <rweikusat@talktalk.net> writes:
[...]
An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>>>> of the first six bit block for each encoded sequence is 8 -
its length:
3 -> 5
4 -> 4
5 -> 3
6 -> 2
Any corrections or other comments very much welcome.
I was not sure what this part of the description was supposed to add to >>>> the initial definition.
I want to calculate that with a general algorithm.
I don't know what "that" refers to. Do you want to calculate the UTF-8
sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?
I want to determine if a sequence is overlong using a generalized
algorithm for that, ie, not by special-casing start byte values. So far,
the untested (and very likely buggy) code for this looks like follows:
u_len is the length of the sequence in bytes, p a pointer to the first
byte. Some unrelated consistency checks removed.
Rainer Weikusat <rweikusat@talktalk.net> writes:
An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>>> of the first six bit block for each encoded sequence is 8 -
its length:
3 -> 5
4 -> 4
5 -> 3
6 -> 2
Any corrections or other comments very much welcome.
I was not sure what this part of the description was supposed to add to
the initial definition.
I want to calculate that with a general algorithm.
I don't know what "that" refers to. Do you want to calculate the UTF-8 sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
Rainer Weikusat <rweikusat@talktalk.net> writes:
[...]
An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start >>>>> of the first six bit block for each encoded sequence is 8 -
its length:
3 -> 5
4 -> 4
5 -> 3
6 -> 2
Any corrections or other comments very much welcome.
I was not sure what this part of the description was supposed to add to >>>> the initial definition.
I want to calculate that with a general algorithm.
I don't know what "that" refers to. Do you want to calculate the UTF-8
sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?
I want to determine if a sequence is overlong using a generalized
algorithm for that, ie, not by special-casing start byte values.
So far,
the untested (and very likely buggy) code for this looks like follows:
u_len is the length of the sequence in bytes,
p a pointer to the first
byte. Some unrelated consistency checks removed.
mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
x = *p & mask;
if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */
y = *++p;
if (!x) { /* x == 0 implies u_len > 2 */
mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
}
I saw somewhere 5 and 6 byte sequences were originally
defined or thought it would be needed, but now limited
to 4 bytes.
I want to determine if a sequence is overlong using a generalized
algorithm for that
John McCue , dans le message <splvjm$msm$1@dont-email.me>, a écrit :
I saw somewhere 5 and 6 byte sequences were originally
defined or thought it would be needed, but now limited
to 4 bytes.
Unicode was limited to 20-21 bits because Microsoft and Sun decided to use UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
to sizeof(wchar_t) or equivalent.
Rainer Weikusat <rweikusat@talktalk.net> writes:
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
u_len is the length of the sequence in bytes,
How have you calculated u_len? You can detect and overlong sequence
without knowing it, so there is some risk in using it when it's not
needed.
p a pointer to the first
byte. Some unrelated consistency checks removed.
mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
That includes one more bit than you want.
I don't see why you need to look at the next byte.
if (!x) { /* x == 0 implies u_len > 2 */
x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
I don't see why you need to look at the next byte.
if (!x) { /* x == 0 implies u_len > 2 */
x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.
According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:
1110xxxx 10xxxxxx 10xxxxxx
Program encoding 0x800 in this way:
And the output is e0 a0 80.
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:<cut>
I don't see why you need to look at the next byte.
if (!x) { /* x == 0 implies u_len > 2 */
x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.
According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:
Rainer Weikusat <rweikusat@talktalk.net> writes:
Ben Bacarisse <ben.usenet@bsb.me.uk> writes:<cut>
I don't see why you need to look at the next byte.
if (!x) { /* x == 0 implies u_len > 2 */
x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.
According to the Linux man page, a number in the range 0x800 - 0xffff is
encoded as three bytes:
In terms of bit masks,
(b2 & 0x3f) >> 8 - ulen
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 292 |
Nodes: | 16 (2 / 14) |
Uptime: | 179:34:09 |
Calls: | 6,616 |
Calls today: | 3 |
Files: | 12,165 |
Messages: | 5,314,103 |