• Determine size demand of (Unicode-)characters on terminal from shell

    From Janis Papanagnou@21:1/5 to All on Mon Dec 27 14:07:46 2021
    I'm using ANSI escape codes ("\033[%d;%dH") to position Unicode
    characters on a terminal window. The indices to provide for %d
    are suited for (e.g.) the Latin character sets, but not for
    character sets where characters require more than one unit for
    the displayed glyph, e.g. like the Chinese characters. So with
    a Latin character set I'd use indices 1, 2, 3, ... and for the
    Asian sets I's use 1, 3, 5, ... to position the characters at
    the screen. My question:

    Is the size that the character glyphs need for representation
    on a terminal somehow retrievable, so that I get, say, for
    Unicode character \U0041 a value of 1 and for \U30ee a value
    of 2, so that I can automatize the displaying on a terminal?

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From marrgol@21:1/5 to Janis Papanagnou on Mon Dec 27 14:39:52 2021
    On 27/12/2021 at 14.07, Janis Papanagnou wrote:
    I'm using ANSI escape codes ("\033[%d;%dH") to position Unicode
    characters on a terminal window. The indices to provide for %d
    are suited for (e.g.) the Latin character sets, but not for
    character sets where characters require more than one unit for
    the displayed glyph, e.g. like the Chinese characters. So with
    a Latin character set I'd use indices 1, 2, 3, ... and for the
    Asian sets I's use 1, 3, 5, ... to position the characters at
    the screen. My question:

    Is the size that the character glyphs need for representation
    on a terminal somehow retrievable, so that I get, say, for
    Unicode character \U0041 a value of 1 and for \U30ee a value
    of 2, so that I can automatize the displaying on a terminal?

    Quick search reveals: https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters


    --
    mrg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to marrgol on Mon Dec 27 15:56:11 2021
    On 27.12.2021 14:39, marrgol wrote:
    On 27/12/2021 at 14.07, Janis Papanagnou wrote:
    I'm using ANSI escape codes ("\033[%d;%dH") to position Unicode
    characters on a terminal window. The indices to provide for %d
    are suited for (e.g.) the Latin character sets, but not for
    character sets where characters require more than one unit for
    the displayed glyph, e.g. like the Chinese characters. So with
    a Latin character set I'd use indices 1, 2, 3, ... and for the
    Asian sets I's use 1, 3, 5, ... to position the characters at
    the screen. My question:

    Is the size that the character glyphs need for representation
    on a terminal somehow retrievable, so that I get, say, for
    Unicode character \U0041 a value of 1 and for \U30ee a value
    of 2, so that I can automatize the displaying on a terminal?

    Quick search reveals: https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

    Interesting, Stephane asked that question. And wc -L seems to be
    the solution; non-standard but at least works on my system. Thanks!

    $ printf "\U30ee" | wc -L
    2
    $ printf "\U0041" | wc -L
    1


    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Janis Papanagnou on Mon Dec 27 13:38:28 2021
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
    On 27.12.2021 14:39, marrgol wrote:
    On 27/12/2021 at 14.07, Janis Papanagnou wrote:
    I'm using ANSI escape codes ("\033[%d;%dH") to position Unicode
    characters on a terminal window. The indices to provide for %d
    are suited for (e.g.) the Latin character sets, but not for
    character sets where characters require more than one unit for
    the displayed glyph, e.g. like the Chinese characters. So with
    a Latin character set I'd use indices 1, 2, 3, ... and for the
    Asian sets I's use 1, 3, 5, ... to position the characters at
    the screen. My question:

    Is the size that the character glyphs need for representation
    on a terminal somehow retrievable, so that I get, say, for
    Unicode character \U0041 a value of 1 and for \U30ee a value
    of 2, so that I can automatize the displaying on a terminal?

    Quick search reveals:
    https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

    Interesting, Stephane asked that question. And wc -L seems to be
    the solution; non-standard but at least works on my system. Thanks!

    $ printf "\U30ee" | wc -L
    2
    $ printf "\U0041" | wc -L
    1

    Interally, `wc -L` uses the POSIX `wcwidth()` function.

    https://pubs.opengroup.org/onlinepubs/9699919799/functions/wcwidth.html

    I'm not 100% clear on how the number of column positions for a given
    character is defined.

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Mon Dec 27 23:45:22 2021
    On 27.12.2021 22:38, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
    On 27.12.2021 14:39, marrgol wrote:
    On 27/12/2021 at 14.07, Janis Papanagnou wrote:
    I'm using ANSI escape codes ("\033[%d;%dH") to position Unicode
    characters on a terminal window. The indices to provide for %d
    are suited for (e.g.) the Latin character sets, but not for
    character sets where characters require more than one unit for
    the displayed glyph, e.g. like the Chinese characters. So with
    a Latin character set I'd use indices 1, 2, 3, ... and for the
    Asian sets I's use 1, 3, 5, ... to position the characters at
    the screen. My question:

    Is the size that the character glyphs need for representation
    on a terminal somehow retrievable, so that I get, say, for
    Unicode character \U0041 a value of 1 and for \U30ee a value
    of 2, so that I can automatize the displaying on a terminal?

    Quick search reveals:
    https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

    Interesting, Stephane asked that question. And wc -L seems to be
    the solution; non-standard but at least works on my system. Thanks!

    $ printf "\U30ee" | wc -L
    2
    $ printf "\U0041" | wc -L
    1

    Interally, `wc -L` uses the POSIX `wcwidth()` function.

    Yes, that function seems to be the standard base for a couple tools.
    It's good to have access to that function on Linux in such a simple
    way. (Not sure how reliable that is, though; see below.)

    https://pubs.opengroup.org/onlinepubs/9699919799/functions/wcwidth.html

    I'm not 100% clear on how the number of column positions for a given character is defined.

    The issue seems to be quite a mess. In the SE thread Stefane gave a link
    to an article on the Unicode topic that I found interesting and amusing: https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/#combining-characters-and-character-width

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Janis Papanagnou on Tue Dec 28 03:49:35 2021
    On Mon, 27 Dec 2021 23:45:22 +0100
    Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
    On 27.12.2021 22:38, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
    On 27.12.2021 14:39, marrgol wrote:
    Quick search reveals:
    https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

    Interesting, Stephane asked that question. And wc -L seems to be
    the solution; non-standard but at least works on my system. Thanks!

    $ printf "\U30ee" | wc -L
    2
    $ printf "\U0041" | wc -L
    1

    Interally, `wc -L` uses the POSIX `wcwidth()` function.

    Yes, that function seems to be the standard base for a couple tools.
    It's good to have access to that function on Linux in such a simple
    way. (Not sure how reliable that is, though; see below.)

    https://pubs.opengroup.org/onlinepubs/9699919799/functions/wcwidth.html

    I'm not 100% clear on how the number of column positions for a given character is defined.

    I was wondering about that myself. I'm sure the Unicode standard has something to say about it but whether there are other factors , I don't know. Then this made me wonder whether there are newsgroups discussing Unicode. I was only
    able to find fr.comp.normes.unicode which doesn't have discussion.

    The issue seems to be quite a mess. In the SE thread Stefane gave a link
    to an article on the Unicode topic that I found interesting and amusing: https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/#combining-characters-and-character-width

    I had a look at that link myself. I know little about Unicode so I didn't
    see anything to doubt that page. But exploring further on the site I found https://eev.ee/blog/2012/04/09/php-a-fractal-of-bad-design :

    In C, functions like strpos return -1 if the item isn't found. If you
    don't check for that case and try to use that as an index, you'll hit
    junk memory and your program will blow up. (Probably. It's C. Who the
    fuck knows. I'm sure there are tools for this, at least.)
    [...]
    For those not down with the C: INT_MAX is the biggest integer that will
    fit in a variable, ever.

    This kind of sloppiness and his style of posting and the fact that comments
    are only allowed through Disqus make me wary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Fri Jan 14 06:26:46 2022
    On 27.12.2021 15:56, Janis Papanagnou wrote:
    On 27.12.2021 14:39, marrgol wrote:
    On 27/12/2021 at 14.07, Janis Papanagnou wrote:
    I'm using ANSI escape codes ("\033[%d;%dH") to position Unicode
    characters on a terminal window. The indices to provide for %d
    are suited for (e.g.) the Latin character sets, but not for
    character sets where characters require more than one unit for
    the displayed glyph, e.g. like the Chinese characters. So with
    a Latin character set I'd use indices 1, 2, 3, ... and for the
    Asian sets I's use 1, 3, 5, ... to position the characters at
    the screen. My question:

    Is the size that the character glyphs need for representation
    on a terminal somehow retrievable, so that I get, say, for
    Unicode character \U0041 a value of 1 and for \U30ee a value
    of 2, so that I can automatize the displaying on a terminal?

    Quick search reveals:
    https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

    Interesting, Stephane asked that question. And wc -L seems to be
    the solution; non-standard but at least works on my system. Thanks!

    $ printf "\U30ee" | wc -L
    2
    $ printf "\U0041" | wc -L
    1

    Just tried that for the Unicode-smileys starting in the Unicode tables
    from position U+1F600 (128512), but for these symbols 'wc -L' returns
    0, as if these symbols wouldn't require any space. - Too bad.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Janis Papanagnou on Fri Jan 14 20:30:35 2022
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    On 27.12.2021 15:56, Janis Papanagnou wrote:
    On 27.12.2021 14:39, marrgol wrote:
    On 27/12/2021 at 14.07, Janis Papanagnou wrote:
    I'm using ANSI escape codes ("\033[%d;%dH") to position Unicode
    characters on a terminal window. The indices to provide for %d
    are suited for (e.g.) the Latin character sets, but not for
    character sets where characters require more than one unit for
    the displayed glyph, e.g. like the Chinese characters. So with
    a Latin character set I'd use indices 1, 2, 3, ... and for the
    Asian sets I's use 1, 3, 5, ... to position the characters at
    the screen. My question:

    Is the size that the character glyphs need for representation
    on a terminal somehow retrievable, so that I get, say, for
    Unicode character \U0041 a value of 1 and for \U30ee a value
    of 2, so that I can automatize the displaying on a terminal?

    Quick search reveals:
    https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

    Interesting, Stephane asked that question. And wc -L seems to be
    the solution; non-standard but at least works on my system. Thanks!

    $ printf "\U30ee" | wc -L
    2
    $ printf "\U0041" | wc -L
    1

    Just tried that for the Unicode-smileys starting in the Unicode tables
    from position U+1F600 (128512), but for these symbols 'wc -L' returns
    0, as if these symbols wouldn't require any space. - Too bad.

    $ printf "\U1f600" | wc -L
    2

    Maybe a locale setting?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Ben Bacarisse on Sat Jan 15 01:06:05 2022
    On 14.01.2022 21:30, Ben Bacarisse wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    Just tried that for the Unicode-smileys starting in the Unicode tables
    from position U+1F600 (128512), but for these symbols 'wc -L' returns
    0, as if these symbols wouldn't require any space. - Too bad.

    $ printf "\U1f600" | wc -L
    2

    Hmm..

    Maybe a locale setting?

    I tried a couple UTF-8 locales along with plain C locale; all return 0
    in my environment.

    Given your post I now tried it also on a machine with newer OS version.
    And there it works as expected. - It seems that the locale definitions
    in the system files of that older Linux version are broken?

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to janis_papanagnou@hotmail.com on Sat Jan 15 22:01:10 2022
    In comp.unix.shell, Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
    On 27.12.2021 22:38, Keith Thompson wrote:
    Interally, `wc -L` uses the POSIX `wcwidth()` function.
    Yes, that function seems to be the standard base for a couple tools.
    It's good to have access to that function on Linux in such a simple
    way. (Not sure how reliable that is, though; see below.)

    I tried it out on NetBSD 9.2 today and found an interesting quirk. As
    it's not Gnu, no --version, but I pulled this out of wc with `strings`:

    GCC: (NetBSD nb4 20200810) 7.5.0
    $NetBSD: crt0.S,v 1.4 2018/11/26 17:37:46 joerg Exp $
    $NetBSD: crt0-common.c,v 1.23 2018/12/28 20:12:35 christos Exp $
    $NetBSD: crti.S,v 1.1 2010/08/07 18:01:35 joerg Exp $
    $NetBSD: crtbegin.S,v 1.2 2010/11/30 18:37:59 joerg Exp $
    $NetBSD: wc.c,v 1.35 2011/09/16 15:39:30 joerg Exp $
    $NetBSD: crtend.S,v 1.1 2010/08/07 18:01:34 joerg Exp $
    $NetBSD: crtn.S,v 1.1 2010/08/07 18:01:35 joerg Exp $
    @(#) Copyright (c) 1980, 1987, 1991, 1993 The Regents of the University of California. All rights reserved.

    $ printf "\U30ee" | wc -L
    0
    $ printf "\U30ee\n" | wc -L
    1
    $

    Compared with a Gnu wc:

    $ printf "\U30ee" | gwc -L
    2
    $ printf "\U30ee\n" | gwc -L
    2
    $

    So not only is the NetBSD one not using something to detect "wide"
    characters, line length only counts complete lines.

    Elijah
    ------
    has been looking for a good way to find terminal display length

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)