Forum: >>> Magnum BBS <<<

Simple(?) Unicode questions

From Janis Papanagnou@21:1/5 to All on Sat Dec 9 08:04:20 2023

After decades I'm again writing some C code and intended to use some
Unicode characters for output. I'm using C99. I have two questions.

I am able to inline the character in the code like: printf ("█\n");

But I also want to make it a printf argument: printf ("%c\n", '█');
which doesn't work (at least not in the depicted way).

And I want to declare such characters, like: char ch = '█';
which also doesn't work, and neither does: wchar_t ch = '█';
And ideally the character should not be copy/pasted into the code
but given by some standard representation like '\u2588' (or so).

Without giving all the gory details about the "problems of Unicode",
are there practical answers to those questions that "simply work"
and reliably?

I have experimented and observed that working with strings at least
*seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
Is that an acceptable/reliable and the usual way in C to tackle the
issue?

Thanks.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Janis Papanagnou on Sat Dec 9 08:01:09 2023

On 12/9/23 2:04 AM, Janis Papanagnou wrote:

After decades I'm again writing some C code and intended to use some
Unicode characters for output. I'm using C99. I have two questions.

There are several things that are considered a "Character" in C.

we have the "char", which is a single "narrow" character,
we have character strings, which can represent multi-byte-characters
we have "wchar", which can represent "wide" characters as a single unit.

I am able to inline the character in the code like: printf ("█\n");

Because, while it isn't a single "narrow character", but can be
converted into a "multi-byte-character-string" that represents that
character.

But I also want to make it a printf argument: printf ("%c\n", '█');
which doesn't work (at least not in the depicted way).

Because it isn't a "narrow character" and thus can't be put into a
single "char"

And I want to declare such characters, like: char ch = '█';
which also doesn't work, and neither does: wchar_t ch = '█';
And ideally the character should not be copy/pasted into the code
but given by some standard representation like '\u2588' (or so).

you can use wchar ch = L'█'; or wchar ch = L'\u2588';
The key is that you are creating a WIDE character, not a narrow character.

Without giving all the gory details about the "problems of Unicode",
are there practical answers to those questions that "simply work"
and reliably?

I have experimented and observed that working with strings at least
*seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
Is that an acceptable/reliable and the usual way in C to tackle the
issue?

Thanks.

Janis

You need to make a decision if you will represent the bigger set of
characters as always using wide characters, or
multi-byte-character-strings.

Most often, it is the multi-byte-character-string, as wide characters
are less well supported in most systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sat Dec 9 15:59:08 2023

Janis Papanagnou ha scritto:

After decades I'm again writing some C code and intended to use some
Unicode characters for output. I'm using C99. I have two questions.

I am able to inline the character in the code like: printf ("█\n");

But I also want to make it a printf argument: printf ("%c\n", '█');
which doesn't work (at least not in the depicted way).

And I want to declare such characters, like: char ch = '█';
which also doesn't work, and neither does: wchar_t ch = '█';
And ideally the character should not be copy/pasted into the code
but given by some standard representation like '\u2588' (or so).

Without giving all the gory details about the "problems of Unicode",
are there practical answers to those questions that "simply work"
and reliably?

I have experimented and observed that working with strings at least
*seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
Is that an acceptable/reliable and the usual way in C to tackle the
issue?

Thanks.

Janis

HI,
You merged two questions together. I will try to divide them:
Initialization of wchar_t types:
like char strings can be initialized with literal strings:

char str[] = "Hello";

the same can be done for wchar_t type strings using the prefix L:

wchar_t wstr[] = L"Hello";
wchar_t wstr[] = L"█\n";
wchar_t wstr[] = L"\u2588\n";

A similar thing is possible for individual characters:

char ch = 'a';
wchar_t wch = L'a';

with the prefix L, it is therefore possible to use extensive characters:

wchar_t wch = L'█';
or:
wchar_t wch = 0x2588;
or:
wchar_t wch = L'\u2588';
or:
wchar_t wch = L"\u2588"[0];
or:
wchar_t wch = *L"█";

Also for the printf there is the relative formatting prefix ('l') for
the wchar_t type:

printf("%s", str);
printf("%ls", wstr);

printf("%c", ch);
printf("%lc", wch);

But it would be more correct to use the suitable version of the wchar_t
(on many occasions it is also more comfortable):

wprintf(L"%ls", wstr);
wprintf(L"%lc", wch);

However, remember to configure the 'locale' for viewing on your
terminal, otherwise the characters you will see may not be the ones you
expect or you will not see at all. Using the 'setlocale' function will
allow the program to convert between the character that prints and the
one corresponding to the locale of your terminal.
To explain myself better if I write a program that prints an extended
unicode character and my terminal uses the UTF-8 characters if the
program does not convert the character from Unicode to UTF-8 I will not
see anything. To prove it I will send the character to a file:

cat foo.c

#include <stdio.h>
#include <stddef.h>
#include <wchar.h>
#include <locale.h>

int main()
{
wchar_t wch = L'\u2588';
FILE *fp;

setlocale(LC_ALL, "");

if((fp = fopen("char.txt", "wb")) != NULL)
{
fwprintf(fp, L"%lc", wch);
fclose(fp);
}
return 0;
}

hexdump -C char.txt

00000000 e2 96 88 |...|
00000003

As you can see the character code is not the same that I sent. With
python it is easy to highlight the conversion:

python

>>> u'\u2588'.encode('utf-8')
b'\xe2\x96\x88'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Spiros Bousbouras on Sat Dec 9 17:59:32 2023

Thanks Richard, jak, and Spiros, for your explanations!

Some comments on the net about building wrappers around libraries,
and whatnot, irritated me.

In my initial tries I got confused about the error/warning message;
I had omitted the 'L' prefix for the character literal definition.
So that hint helped to get some assurance here.

On 09.12.2023 16:12, Spiros Bousbouras wrote:

My own approach would be to do as much as possible in my own code.

Same here.

If possible, I want to avoid external libraries, unnecessary
dependencies, and language constructs that are not guaranteed to
work reliably or that are non-portable, and I like simplicity and
transparency.

A lot
depends on whether you need to pass your own characters (of whatever type) to some external library which expects a specific type like wchar_t or not. There are many different scenarios so I will cover what would be most likely to occur in my own code.

My requirements are quite trivial and there's no exchange of data
between systems, processes, or applications. It's only data to be
displayed at the local screen.

- No external library involved.
- Output encoded in UTF-8
- The text editor I use to write the code stores everything as UTF-8.

With the above assumptions I would simply use ordinary C strings and put UTF-8 in them like "ΑΒΓΔΕΖΗΘ..." and output them in the ordinary way.

It's not guaranteed to work but it most likely will.

That exactly was my uncertainty.

[...]

And ideally the character should not be copy/pasted into the code
but given by some standard representation like '\u2588' (or so).

Why is that ? It seems to me that it makes the code harder to understand.

I'm not encoding non-latin texts (like your Greek example above).

In my case the characters are just "graphical candy", so it's not
important to "read" them; a comment behind the \u encoding appears
to me to be sufficient.

It may also be a habit to have a program coded as ASCII source;
during my first decades of programming there were no languages
that I used that supported anything else than ASCII (or EBCDIC,
or even less, like 6-bit character sets, in some cases [CDC]).

This way (so my assumption goes) also less things will possibly
go wrong. I also never programmed in languages where the program
could be written in ones native (non-English) language by using
Unicode or UTF-8 encoding. I think I had the possibility in Java
(but these days were nothing but an episode as seen from today).

What works reliably depends a lot on what you're trying to do. Unicode in general is messy.

Yeah, that's why I want to keep it as simple as possible; but it
should of course work reliably.

I have experimented and observed that working with strings at least
*seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
Is that an acceptable/reliable and the usual way in C to tackle the
issue?

If you do

char * ch = "\u2588"
size_t i ;
for (i = 0 ; ch[i] != 0 ; i++) {
printf("%d " , ch[i]) ;
}
puts("") ;

what output do you get ? I will guess that you see the bytes
226 150 136 .

Almost. I get the complementary values: -30 -106 -120

But why are you asking? - To show that "\u2588" is internally
represented by a [UTF-8] code sequence? - Ideally the interface
should not make me care about internal representations. :-)

The explanations and hints were all helpful - thanks again!

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Spiros Bousbouras on Sat Dec 9 18:43:41 2023

On 09.12.2023 18:19, Spiros Bousbouras wrote:

On Sat, 9 Dec 2023 17:59:32 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

In my case the characters are just "graphical candy", so it's not
important to "read" them; a comment behind the \u encoding appears
to me to be sufficient.

Well , it's your code. If it is some kind of block characters based
"art" then it may even be more important to be able to see it in the
source.

I actually do have them visibly in my code; but non-functional,
as a comment. That way I have both, the [functional] safety and
the "readability". (And I don't mind the redundancy here.)

BTW, I also had situations the other way round, where I encode
programmatically characters and add comments with their values
(in decimal, hex, or binary, as it fits best for the purpose).
As an example, I had a case with similar or even equal glyphs,
and I wanted to have them specified exactly. A copy/paste from
some Web resource would, in my book, not have been good enough
for specification purposes; you couldn't tell them apart.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sat Dec 9 18:57:01 2023

Spiros Bousbouras ha scritto:

On Sat, 9 Dec 2023 15:59:08 +0100
jak <nospam@please.ty> wrote:

To explain myself better if I write a program that prints an extended
unicode character and my terminal uses the UTF-8 characters if the
program does not convert the character from Unicode to UTF-8 I will not
see anything. To prove it I will send the character to a file:

cat foo.c

#include <stdio.h>
#include <stddef.h>
#include <wchar.h>
#include <locale.h>

int main()
{
wchar_t wch = L'\u2588';
FILE *fp;

setlocale(LC_ALL, "");

if((fp = fopen("char.txt", "wb")) != NULL)
{
fwprintf(fp, L"%lc", wch);
fclose(fp);
}
return 0;
}

hexdump -C char.txt

00000000 e2 96 88 |...|
00000003

As you can see the character code is not the same that I sent.

In what way is it not the same as what you sent ? With hexdump you
can only hope to see octets regardless of what the octets encode. So
you read back the octets which are the UTF-8 encoding of codepoint
U+2588 .What you got is exactly what I would expect to see. If you
use a terminal which supports UTF-8 and has the necessary font and
you do

Sorry but your comment is not clear to me. I gave this explanation
because it seemed to me that it was not clear to the OP that a
conversion takes place during the printf. Also I wouldn't take what
you say for granted:

cat foo.c

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main()
{
union {
unsigned char c[0];
wchar_t w[10];
} str = {.w = L"\u2588"};

setlocale(LC_ALL, "");

printf("\nraw data: ");
for(size_t i = 0; str.c[i] != '\0'; i++)
printf("%02X ", str.c[i]);
printf("\n");

FILE *fp;
if((fp = fopen("char.txt", "wb")))
{
fwprintf(fp, L"%ls", str.w);
fclose(fp);
}
}

compiled with gcc:

gcc foo.c -o foo
foo

raw data: 88 25

od -t x1 char.txt

0000000 e2 96 88
0000003

compiled with tcc:

tcc foo.c
foo

raw data: 88 25

od -t x1 char.txt

0000000 88 25
0000002

ops...

cat char.txt

what do you see ? I expect you will see the block character.

With python it is easy to highlight the conversion:

python

>>> u'\u2588'.encode('utf-8')
b'\xe2\x96\x88'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From spender@21:1/5 to All on Wed Dec 13 11:05:45 2023

printf("%c",ch), the ch must <0xFF, <255

In c lang, The character must be a character of an ASCII table, i.e. < (int)255. A string is a collection of characters.

在 2023/12/9 15:04, Janis Papanagnou 写道:

After decades I'm again writing some C code and intended to use some
Unicode characters for output. I'm using C99. I have two questions.

I am able to inline the character in the code like: printf ("█\n");

But I also want to make it a printf argument: printf ("%c\n", '█');
which doesn't work (at least not in the depicted way).

And I want to declare such characters, like: char ch = '█';
which also doesn't work, and neither does: wchar_t ch = '█';
And ideally the character should not be copy/pasted into the code
but given by some standard representation like '\u2588' (or so).

Without giving all the gory details about the "problems of Unicode",
are there practical answers to those questions that "simply work"
and reliably?

I have experimented and observed that working with strings at least
*seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
Is that an acceptable/reliable and the usual way in C to tackle the
issue?

Thanks.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to spender on Wed Dec 13 04:24:25 2023

在 2023/12/9 15:04, Janis Papanagnou 写道:

[...] intended to use some Unicode characters for output. [...]

On 13.12.2023 04:05, spender wrote:

printf("%c",ch), the ch must <0xFF, <255

The question was about the output of multi-octet Unicode characters,
it was not about single octet characters.

Though the question has also already been addressed by the other
replies, so don't bother.

In c lang, The character must be a character of an ASCII table,
i.e. < (int)255. A string is a collection of characters.

(Note, ASCII is 7 bit.) In the C language ordinary single-octet
characters may have values of -128..+127 or 0..255, depending on
whether the char type is defined as signed or unsigned.

And you can also output Unicode characters as had been showed in
this thread.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to spender on Wed Dec 13 00:40:01 2023

On 12/12/23 22:05, spender wrote:

printf("%c",ch), the ch must <0xFF, <255

The only 'ch' in the code that you responded to was declared as "char
*", not char, and that value was used with a "%s" format specifier, for
which char* is the appropriate type.
*ch has char type, and as such must have a value between CHAR_MIN and
CHAR_MAX. If char is signed, CHAR_MIN == SCHAR_MIN, and SCHAR_MIN <=
-128. If char is unsigned, CHAR_MAX == UCHAR_MAX, and UCHAR_MAX >= 255.
Those are inequalities, not equalities, because 8 is the minimum value
for CHAR_BIT, rather than the only permitted value, and there are
real-world systems with other sizes (not many, to be fair), with
CHAR_BIT==16 being the most common alternative.

When ch is passed to printf(), it's gets converted to unsigned char. The maximum resulting value is UCHAR_MAX, which as noted above, is allowed
to be >255.

In c lang, The character must be a character of an ASCII table, i.e. <

There is no such requirement. The standard explicitly describes the
encoding recognized by C standard library functions such as printf() as implementation-defined and locale-dependent, and describes it as a
multibyte encoding, though MB_CUR_MAX and MB_LEN_MAX are both allowed to
== 1.

On most Unix-like platforms, the default encoding is UTF-8. For
characters that can be represented in a single byte, that is equivalent
to 7-bit ASCII, not 8-bit, so the maximum is 127, not 255. There are
also a number of other encodings still in use, such as EBCDIC.

The standard only mentions ASCII twice, both times in non-normative
footnotes:
"17) The trigraph sequences enable the input of characters that are not
defined in the Invariant Code Set as described in ISO/IEC 646, which is
a subset of the seven-bit US ASCII code set."

In footnote 215 it mentions 7-bit ASCII as an example, not as something
that is mandated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lew Pitcher@21:1/5 to spender on Wed Dec 13 14:56:06 2023

On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:

printf("%c",ch), the ch must <0xFF, <255

Not quite.
1) ch /must/ represent an integer value.
2) ch /should/ represent a C char value. Note that a C char /is not/
defined as an 8-bit unsigned quantity, but as a CHAR_BIT quantity,
with implementation-defined sign, where CHAR_BIT is /at least/
8 bits. printf() will happily /mis-interpret/ any other integer
for you, when given the '%c' format specifier.

In c lang, The character must be a character of an ASCII table, i.e. < (int)255. A string is a collection of characters.

Nonsense.

1) The C language does /not/ specify the representation
of char, other than it's size in bits and whether or not it carries
a sign. The C language has been implemented in EBCDIC environments
(for instance), which is not even close to ASCII.

2) ASCII is a 7-bit encoding scheme; all valid ASCII values exist between
0 and 127. /Some software/ extend ASCII to 8 bits, with the high-order
bit either extending the characterset, or representing some
meta-characteristic (such as parity or sign).

--
Lew Pitcher
"In Skills We Trust"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Lew Pitcher on Mon Dec 25 02:03:59 2023

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:

printf("%c",ch), the ch must <0xFF, <255

Not quite.
1) ch /must/ represent an integer value.

More specifically, it must have a type that is or promotes
to int, or a type that is or promotes to unsigned int, with
a value that is in the common range of int and unsigned int.

2) ch /should/ represent a C char value. Note that a C char /is not/
defined as an 8-bit unsigned quantity, but as a CHAR_BIT quantity,
with implementation-defined sign, where CHAR_BIT is /at least/
8 bits. [...]

This part isn't exactly right. Any value in the range of char
is okay. However, any value in the range of unsigned char is
also okay. The type 'int' for the argument is meant to include
values returned by, for example, getchar(), and such functions
always return non-negative values (not counting EOF). The rules
for character input/output functions generally convert characters
to unsigned char, and such values are meant to be admissible as
arguments for a %c conversion specifier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to James Kuyper on Fri Jan 19 07:43:39 2024

James Kuyper <jameskuyper@alumni.caltech.edu> writes:

On 12/12/23 22:05, spender wrote:

printf("%c",ch), the ch must <0xFF, <255

The only 'ch' in the code that you responded to was declared as
"char *", not char, [...]

The posting in question also gave declarations

char ch = [...];

and

wchar_t ch = [...];

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Keith Thompson on Sat Jan 20 09:33:42 2024

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:

printf("%c",ch), the ch must <0xFF, <255

Not quite.
1) ch /must/ represent an integer value.

More specifically, it must have a type that is or promotes
to int, or a type that is or promotes to unsigned int, with
a value that is in the common range of int and unsigned int.

Not quite. "If no l length modifier is present, the int argument
is converted to an unsigned char, and the resulting character is
written." For example printf("%c", -193) is equivalent to
printf("%c", 63), which assuming an ASCII-based character set will
print '?'.

The rule for arguments to printf() is the same as the rule for
accessing variadic arguments using va_arg(). That has always
been true, although not expressed clearly in early versions of
the C standard. Fortunately that shortcoming is addressed in
the upcoming C23 (is it still not yet ratified?): in N3096,
paragraph 9 in section 7.23.6.1 says in part

fprintf shall behave as if it uses va_arg with a type
argument naming the type resulting from applying the
default argument promotions to the type corresponding
to the conversion specification [...]

and the rule for va_arg (in 7.16.1.1 p2) says in part

one type is a signed integer type, the other type is
the corresponding unsigned integer type, and the value
is representable in both types

So supplying an unsigned int argument is okay, provided of
course the value is in the range of values of signed int.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Keith Thompson on Wed Jan 24 20:38:26 2024

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:

printf("%c",ch), the ch must <0xFF, <255

Not quite.
1) ch /must/ represent an integer value.

More specifically, it must have a type that is or promotes
to int, or a type that is or promotes to unsigned int, with
a value that is in the common range of int and unsigned int.

Not quite. "If no l length modifier is present, the int argument
is converted to an unsigned char, and the resulting character is
written." For example printf("%c", -193) is equivalent to
printf("%c", 63), which assuming an ASCII-based character set will
print '?'.

The rule for arguments to printf() is the same as the rule for
accessing variadic arguments using va_arg(). That has always
been true, although not expressed clearly in early versions of
the C standard. Fortunately that shortcoming is addressed in
the upcoming C23 (is it still not yet ratified?): in N3096,
paragraph 9 in section 7.23.6.1 says in part

fprintf shall behave as if it uses va_arg with a type
argument naming the type resulting from applying the
default argument promotions to the type corresponding
to the conversion specification [...]

and the rule for va_arg (in 7.16.1.1 p2) says in part

one type is a signed integer type, the other type is
the corresponding unsigned integer type, and the value
is representable in both types

So supplying an unsigned int argument is okay, provided of
course the value is in the range of values of signed int.

Re-reading what you wrote, I think I misunderstood your intent (and I
think what you wrote was ambiguous).

"%c" specifies an int argument.

You wrote:

More specifically, it must have a type that is or promotes to int,
or a type that is or promotes to unsigned int, with a value that is
in the common range of int and unsigned int.

I read that as:

More specifically,
(it must have a type that is or promotes to int, or a type that is
or promotes to unsigned int),
with a value that is in the common range of int and unsigned int.

which would incorrectly imply that a negative int value is not allowed.

It's now clear to me that you meant was:

More specifically,
(it must have a type that is or promotes to int),
or
(a type that is or promotes to unsigned int, with a value that is in
the common range of int and unsigned int).

I agree with that.

Right. Sorry for the confusion.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu May 2 20:36:49 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu May 2 20:26:33 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu May 2 20:10:37 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu May 2 19:46:45 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	299
Nodes:	16 (2 / 14)
Uptime:	51:05:58
Calls:	6,689
Calls today:	7
Files:	12,225
Messages:	5,344,600
Posted today:	1

Simple(?) Unicode questions

Who's Online

Recent Visitors

System Info