• using Unicode codepoints in a bash script

    From paris2venice@21:1/5 to All on Wed Nov 29 00:01:06 2023
    I am trying to use Unicode codepoints along with Unicode UTF8s in a bash script in order to compare codepoints and their matching UTF8 in a database of the 1071 Egyptian hieroglyphs.

    So what I am trying to do is ensure that each record is accurately validated in the 1071 lines of my file.

    It works in the bash shell e.g.
    codepoint=13000
    printf "\U000${codepoint}\n"
    ð“€€

    However, if I put the same code into a script, e.g.
    cat > cpt
    #!/bin/bash
    codepoint=13000
    printf "\U000${codepoint}\n"

    chmod +x cpt
    bash -x ./cpt
    + codepoint=13000
    + printf '\U00013000\n'
    \U00013000

    So instead of creating the hieroglyph, the script just ignores the same exact code. Is there any way around this? Thanks for any help.

    bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From paris2venice@21:1/5 to All on Wed Nov 29 00:07:31 2023
    I did chsh to the 5.0.17 version of bash but had the same issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russell Marks@21:1/5 to paris2venice@gmail.com on Wed Nov 29 15:01:05 2023
    paris2venice <paris2venice@gmail.com> wrote:

    I am trying to use Unicode codepoints along with Unicode UTF8s in a
    bash script in order to compare codepoints and their matching UTF8 in
    a database of the 1071 Egyptian hieroglyphs.

    So what I am trying to do is ensure that each record is accurately
    validated in the 1071 lines of my file.

    It works in the bash shell e.g.
    codepoint=13000
    printf "\U000${codepoint}\n"
    ð“€€

    It sounds like you're on macOS, so I suspect the interactive shell
    you're using may be zsh, not bash - and probably a newer version.

    However, if I put the same code into a script, e.g.
    cat > cpt
    #!/bin/bash
    codepoint=13000
    printf "\U000${codepoint}\n"

    chmod +x cpt
    bash -x ./cpt
    + codepoint=13000
    + printf '\U00013000\n'
    \U00013000

    So instead of creating the hieroglyph, the script just ignores the
    same exact code. Is there any way around this? Thanks for any
    help.

    bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.

    That's a very old version of bash. To quote the CHANGES file from a
    newer version, "Fixed several bugs with the handling of valid and
    invalid unicode character values when used with the \u and \U escape
    sequences to printf and $'...'." So the old version not having those
    fixes might be the problem.

    -Rus.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Weisgerber@21:1/5 to paris2venice@gmail.com on Wed Nov 29 14:54:29 2023
    On 2023-11-29, paris2venice <paris2venice@gmail.com> wrote:

    It works in the bash shell e.g.
    codepoint=13000
    printf "\U000${codepoint}\n"
    ð“€€

    However, if I put the same code into a script, e.g.
    cat > cpt
    #!/bin/bash
    codepoint=13000
    printf "\U000${codepoint}\n"
    \U00013000

    So instead of creating the hieroglyph, the script just ignores the same exact code.

    That doesn't happen. Something isn't like you say it is.

    bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.

    Presumably that is the version you use to execute the script.
    It is ancient and may not yet support the \U syntax.

    What bash are you using interactively?

    $ echo $BASH_VERSION

    --
    Christian "naddy" Weisgerber naddy@mips.inka.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From paris2venice@21:1/5 to Russell Marks on Wed Nov 29 10:13:16 2023
    On Wednesday, November 29, 2023 at 7:01:12 AM UTC-8, Russell Marks wrote:
    paris2venice wrote:

    I am trying to use Unicode codepoints along with Unicode UTF8s in a
    bash script in order to compare codepoints and their matching UTF8 in
    a database of the 1071 Egyptian hieroglyphs.

    So what I am trying to do is ensure that each record is accurately validated in the 1071 lines of my file.

    It works in the bash shell e.g.
    codepoint=13000
    printf "\U000${codepoint}\n"
    ð“€€
    It sounds like you're on macOS, so I suspect the interactive shell
    you're using may be zsh, not bash - and probably a newer version.

    Thanks for your reply, Russell. I do not know zsh so I don't use it. And I disliked Apple trying to decide for me which shell I should use so I ignored them. That was years ago.



    However, if I put the same code into a script, e.g.
    cat > cpt
    #!/bin/bash
    codepoint=13000
    printf "\U000${codepoint}\n"

    chmod +x cpt
    bash -x ./cpt
    + codepoint=13000
    + printf '\U00013000\n'
    \U00013000

    So instead of creating the hieroglyph, the script just ignores the
    same exact code. Is there any way around this? Thanks for any
    help.

    bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.

    That's a very old version of bash. To quote the CHANGES file from a
    newer version, "Fixed several bugs with the handling of valid and
    invalid unicode character values when used with the \u and \U escape sequences to printf and $'...'." So the old version not having those
    fixes might be the problem.

    -Rus.

    That's interesting. Did you see my following comment about trying it with version 5.0.17? I had the same exact results.

    In any case, the UTF-8 does not fail even with the 3.2.57(1) release:

    utf8a=80 utf8b=80
    utf8_hg=$( printf "\xF0\x93\x${utf8a}\x${utf8b}" )
    echo $utf8_hg
    ð“€€

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From paris2venice@21:1/5 to Russell Marks on Wed Nov 29 11:13:31 2023
    On Wednesday, November 29, 2023 at 7:01:12 AM UTC-8, Russell Marks wrote:
    paris2venice wrote:

    I am trying to use Unicode codepoints along with Unicode UTF8s in a
    bash script in order to compare codepoints and their matching UTF8 in
    a database of the 1071 Egyptian hieroglyphs.

    So what I am trying to do is ensure that each record is accurately validated in the 1071 lines of my file.

    It works in the bash shell e.g.
    codepoint=13000
    printf "\U000${codepoint}\n"
    ð“€€
    It sounds like you're on macOS, so I suspect the interactive shell
    you're using may be zsh, not bash - and probably a newer version.
    However, if I put the same code into a script, e.g.
    cat > cpt
    #!/bin/bash
    codepoint=13000
    printf "\U000${codepoint}\n"

    chmod +x cpt
    bash -x ./cpt
    + codepoint=13000
    + printf '\U00013000\n'
    \U00013000

    So instead of creating the hieroglyph, the script just ignores the
    same exact code. Is there any way around this? Thanks for any
    help.

    bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.
    That's a very old version of bash. To quote the CHANGES file from a
    newer version, "Fixed several bugs with the handling of valid and
    invalid unicode character values when used with the \u and \U escape sequences to printf and $'...'." So the old version not having those
    fixes might be the problem.

    -Rus.

    Ciao again.

    I just realized that the codepoint uses the \U and the UTF-8 only uses the \x so by replacing my shebang from /bin/bash i.e. 3.2.57(1) to /usr/local/bin/bash i.e. 5.0.17, the script does succeed. So many thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From paris2venice@21:1/5 to Christian Weisgerber on Wed Nov 29 10:58:49 2023
    On Wednesday, November 29, 2023 at 7:30:10 AM UTC-8, Christian Weisgerber wrote:
    On 2023-11-29, paris2venice wrote:

    It works in the bash shell e.g.
    codepoint=13000
    printf "\U000${codepoint}\n"
    ð“€€

    However, if I put the same code into a script, e.g.
    cat > cpt
    #!/bin/bash
    codepoint=13000
    printf "\U000${codepoint}\n"
    \U00013000

    So instead of creating the hieroglyph, the script just ignores the same exact code.
    That doesn't happen. Something isn't like you say it is.

    Well, did you try replicating my results? The very simple bash code is right there.


    bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.
    Presumably that is the version you use to execute the script.
    It is ancient and may not yet support the \U syntax.

    But it does support the \U syntax as with the UTF-8, just not the codepoint. The basic function of my shell script is to compare Unicode's codepoint with the UTF-8 for all 1071 hieroglyphs.

    The first hieroglyph defined in Unicode is a seated man referred to as A1 in the Gardiner classification.
    Its codepoint is "13000" and its UTF-8 is the matched pair of "80 80" and its hieroglyph is ð“€€.

    utf8a=80 utf8b=80
    utf8_hg=$( printf "\xF0\x93\x${utf8a}\x${utf8b}" )
    echo $utf8_hg
    ð“€€

    So the UTF-8 works fine.

    codepoint=13000
    cp_hg=$( echo -e "\U000$codepoint" )
    echo $cp_hg
    ð“€€
    bash -version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.

    So the codepoint works fine in the shell and, as you can see, using the old version.
    Unfortunately, not in the shell.


    What bash are you using interactively?

    As I wrote at the end, I typically use the old version ... 3.2.57(1). I downloaded 5.0.17 years ago but I stopped using it because I wanted my other much more intensive script (which uses both bash and AppleScript) to work for any user who might
    download it. I can't really use a script for public consumption if Apple doesn't stay up to date which they don't.

    $ echo $BASH_VERSION
    echo $BASH_VERSION
    5.0.17(1)-release

    That's just at the moment though.

    --
    Christian "naddy" Weisgerber

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russell Marks@21:1/5 to paris2venice@gmail.com on Thu Nov 30 11:49:33 2023
    paris2venice <paris2venice@gmail.com> wrote:

    Russell Marks wrote:
    paris2venice wrote:
    [...]
    So instead of creating the hieroglyph, the script just ignores the
    same exact code. Is there any way around this? Thanks for any
    help.

    bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.

    That's a very old version of bash. To quote the CHANGES file from a
    newer version, "Fixed several bugs with the handling of valid and
    invalid unicode character values when used with the \u and \U escape
    sequences to printf and $'...'." So the old version not having those
    fixes might be the problem.
    [...]
    That's interesting. Did you see my following comment about trying
    it with version 5.0.17? I had the same exact results.

    That version is also a bit old still, but I'm surprised at it giving
    you the same trouble (assuming that printf is the builtin version).

    Playing around with this on Linux, one way to nearly replicate your
    result with a newer bash is "LC_ALL=C printf '\U00013000\n'" which for
    me will output "\u00013000". So I suppose there could be a locale
    issue involved.

    In any case, the UTF-8 does not fail even with the 3.2.57(1)
    release:

    utf8a=80 utf8b=80
    utf8_hg=$( printf "\xF0\x93\x${utf8a}\x${utf8b}" )
    echo $utf8_hg
    ð“€€

    That's something at least, and given what you said in a later post
    about wanting to cope with Apple's old bash version for the sake of
    users, you might not have much alternative.

    -Rus.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Weisgerber@21:1/5 to Russell Marks on Thu Nov 30 14:09:55 2023
    On 2023-11-30, Russell Marks <zgedneil@spam^H^H^H^Hgmail.com> wrote:

    Playing around with this on Linux, one way to nearly replicate your
    result with a newer bash is "LC_ALL=C printf '\U00013000\n'" which for
    me will output "\u00013000". So I suppose there could be a locale
    issue involved.

    But if you execute the commands in question first on the command
    line, then in a minimal script as shown, the same locale settings
    will be used for both.

    --
    Christian "naddy" Weisgerber naddy@mips.inka.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russell Marks@21:1/5 to Christian Weisgerber on Thu Nov 30 18:52:56 2023
    Christian Weisgerber <naddy@mips.inka.de> wrote:

    On 2023-11-30, Russell Marks <zgedneil@spam^H^H^H^Hgmail.com> wrote:

    Playing around with this on Linux, one way to nearly replicate your
    result with a newer bash is "LC_ALL=C printf '\U00013000\n'" which for
    me will output "\u00013000". So I suppose there could be a locale
    issue involved.

    But if you execute the commands in question first on the command
    line, then in a minimal script as shown, the same locale settings
    will be used for both.

    True. The old 3.x bash presumably had Unicode bugs though, and the
    differing output of "\U" vs. "\u" could hint at differing causes.
    Also, if the 3.x bash binary is from Apple while the 5.x one isn't (as
    seems likely), I imagine that the binaries could potentially be using
    different libraries and/or locale configs.

    Still, I have to admit this is all pretty speculative.

    -Rus.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Qruqs@21:1/5 to All on Sun Aug 18 08:04:28 2024
    On Wed, 29 Nov 2023 00:01:06 -0800 (PST), paris2venice wrote:

    bash --version GNU bash, version 3.2.57(1)-release
    (x86_64-apple-darwin20)
    Copyright (C) 2007 Free Software Foundation, Inc.

    Like many already said Bash version might be an issue, I use:

    $ bash --version
    GNU bash, version 5.2.26(1)-release (x86_64-pc-linux-gnu)
    Copyright (C) 2022 Free Software Foundation, Inc.

    Plus, I don't know if there is anyone still around to even bother about my posting this (seems it went dead after that stupid googlegroups was turned
    off, finally! "do no harm", right...), and I also know this isn't a Python group, but anyhoo, since it's more or less dead anyway...

    Is there a reason it _has_ to be Bash?

    You could try another language. Python for instance, can also be run like
    an executable text file, just like a shell script. You can also call other scripts from Python, for example running the stuff that is easier to do in Python and when it's done call some other script and have it have a go at solving the rest. You can also use the "sys" module and use pipes into
    your Python script and it can send its result out via stdout. Your
    imagination is the limit here.

    Python 3.12: test-hieroglyphs-post-20240818.py

    CODE:
    ---8<-------------------------------------------------------------------
    #! /usr/bin/env python3
    #coding: utf-8
    print("As is:", "ð“€€")
    print("Using character names:", chr(ord('\N{EGYPTIAN HIEROGLYPH A001}'))) print("The code points as hex:", "ð“€€".encode('utf-8')) ---8<-------------------------------------------------------------------

    Running it:
    $ ./test-hieroglyphs-post-20240818.py
    As is: ð“€€
    Using character names: ð“€€
    The code points as hex bytes: b'\xf0\x93\x80\x80'


    This works because Python 3.x is Unicode aware. All strings are Unicode by default.

    There are other languages that might be suitable also. Pick one and try it
    out.


    There are more ways than one to skin a cat.

    Q.
    --
    Currently using: https://manjaro.org/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)