Forum: >>> Magnum BBS <<<

using Unicode codepoints in a bash script

From paris2venice@21:1/5 to All on Wed Nov 29 00:01:06 2023

I am trying to use Unicode codepoints along with Unicode UTF8s in a bash script in order to compare codepoints and their matching UTF8 in a database of the 1071 Egyptian hieroglyphs.

So what I am trying to do is ensure that each record is accurately validated in the 1071 lines of my file.

It works in the bash shell e.g.
codepoint=13000
printf "\U000${codepoint}\n"
𓀀

However, if I put the same code into a script, e.g.
cat > cpt
#!/bin/bash
codepoint=13000
printf "\U000${codepoint}\n"

chmod +x cpt
bash -x ./cpt
+ codepoint=13000
+ printf '\U00013000\n'
\U00013000

So instead of creating the hieroglyph, the script just ignores the same exact code. Is there any way around this? Thanks for any help.

bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From paris2venice@21:1/5 to All on Wed Nov 29 00:07:31 2023

I did chsh to the 5.0.17 version of bash but had the same issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russell Marks@21:1/5 to paris2venice@gmail.com on Wed Nov 29 15:01:05 2023

paris2venice <paris2venice@gmail.com> wrote:

I am trying to use Unicode codepoints along with Unicode UTF8s in a
bash script in order to compare codepoints and their matching UTF8 in
a database of the 1071 Egyptian hieroglyphs.

So what I am trying to do is ensure that each record is accurately
validated in the 1071 lines of my file.

It works in the bash shell e.g.
codepoint=13000
printf "\U000${codepoint}\n"
𓀀

It sounds like you're on macOS, so I suspect the interactive shell
you're using may be zsh, not bash - and probably a newer version.

However, if I put the same code into a script, e.g.
cat > cpt
#!/bin/bash
codepoint=13000
printf "\U000${codepoint}\n"

chmod +x cpt
bash -x ./cpt
+ codepoint=13000
+ printf '\U00013000\n'
\U00013000

So instead of creating the hieroglyph, the script just ignores the
same exact code. Is there any way around this? Thanks for any
help.

bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

That's a very old version of bash. To quote the CHANGES file from a
newer version, "Fixed several bugs with the handling of valid and
invalid unicode character values when used with the \u and \U escape
sequences to printf and $'...'." So the old version not having those
fixes might be the problem.

-Rus.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Weisgerber@21:1/5 to paris2venice@gmail.com on Wed Nov 29 14:54:29 2023

On 2023-11-29, paris2venice <paris2venice@gmail.com> wrote:

It works in the bash shell e.g.
codepoint=13000
printf "\U000${codepoint}\n"
𓀀

However, if I put the same code into a script, e.g.
cat > cpt
#!/bin/bash
codepoint=13000
printf "\U000${codepoint}\n"
\U00013000

So instead of creating the hieroglyph, the script just ignores the same exact code.

That doesn't happen. Something isn't like you say it is.

bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

Presumably that is the version you use to execute the script.
It is ancient and may not yet support the \U syntax.

What bash are you using interactively?

$ echo $BASH_VERSION

--
Christian "naddy" Weisgerber naddy@mips.inka.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From paris2venice@21:1/5 to Russell Marks on Wed Nov 29 10:13:16 2023

On Wednesday, November 29, 2023 at 7:01:12 AM UTC-8, Russell Marks wrote:

paris2venice wrote:

I am trying to use Unicode codepoints along with Unicode UTF8s in a
bash script in order to compare codepoints and their matching UTF8 in
a database of the 1071 Egyptian hieroglyphs.

So what I am trying to do is ensure that each record is accurately validated in the 1071 lines of my file.

It works in the bash shell e.g.
codepoint=13000
printf "\U000${codepoint}\n"
𓀀

It sounds like you're on macOS, so I suspect the interactive shell
you're using may be zsh, not bash - and probably a newer version.

Thanks for your reply, Russell. I do not know zsh so I don't use it. And I disliked Apple trying to decide for me which shell I should use so I ignored them. That was years ago.

However, if I put the same code into a script, e.g.
cat > cpt
#!/bin/bash
codepoint=13000
printf "\U000${codepoint}\n"

chmod +x cpt
bash -x ./cpt
+ codepoint=13000
+ printf '\U00013000\n'
\U00013000

So instead of creating the hieroglyph, the script just ignores the
same exact code. Is there any way around this? Thanks for any
help.

bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

That's a very old version of bash. To quote the CHANGES file from a
newer version, "Fixed several bugs with the handling of valid and
invalid unicode character values when used with the \u and \U escape sequences to printf and $'...'." So the old version not having those
fixes might be the problem.

-Rus.

That's interesting. Did you see my following comment about trying it with version 5.0.17? I had the same exact results.

In any case, the UTF-8 does not fail even with the 3.2.57(1) release:

utf8a=80 utf8b=80
utf8_hg=$( printf "\xF0\x93\x${utf8a}\x${utf8b}" )
echo $utf8_hg
𓀀

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From paris2venice@21:1/5 to Russell Marks on Wed Nov 29 11:13:31 2023

On Wednesday, November 29, 2023 at 7:01:12 AM UTC-8, Russell Marks wrote:

paris2venice wrote:

I am trying to use Unicode codepoints along with Unicode UTF8s in a
bash script in order to compare codepoints and their matching UTF8 in
a database of the 1071 Egyptian hieroglyphs.

So what I am trying to do is ensure that each record is accurately validated in the 1071 lines of my file.

It works in the bash shell e.g.
codepoint=13000
printf "\U000${codepoint}\n"
𓀀

It sounds like you're on macOS, so I suspect the interactive shell
you're using may be zsh, not bash - and probably a newer version.

However, if I put the same code into a script, e.g.
cat > cpt
#!/bin/bash
codepoint=13000
printf "\U000${codepoint}\n"

chmod +x cpt
bash -x ./cpt
+ codepoint=13000
+ printf '\U00013000\n'
\U00013000

So instead of creating the hieroglyph, the script just ignores the
same exact code. Is there any way around this? Thanks for any
help.

bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

That's a very old version of bash. To quote the CHANGES file from a
newer version, "Fixed several bugs with the handling of valid and
invalid unicode character values when used with the \u and \U escape sequences to printf and $'...'." So the old version not having those
fixes might be the problem.

-Rus.

Ciao again.

I just realized that the codepoint uses the \U and the UTF-8 only uses the \x so by replacing my shebang from /bin/bash i.e. 3.2.57(1) to /usr/local/bin/bash i.e. 5.0.17, the script does succeed. So many thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From paris2venice@21:1/5 to Christian Weisgerber on Wed Nov 29 10:58:49 2023

On Wednesday, November 29, 2023 at 7:30:10 AM UTC-8, Christian Weisgerber wrote:

On 2023-11-29, paris2venice wrote:

It works in the bash shell e.g.
codepoint=13000
printf "\U000${codepoint}\n"
𓀀

However, if I put the same code into a script, e.g.
cat > cpt
#!/bin/bash
codepoint=13000
printf "\U000${codepoint}\n"
\U00013000

So instead of creating the hieroglyph, the script just ignores the same exact code.

That doesn't happen. Something isn't like you say it is.

Well, did you try replicating my results? The very simple bash code is right there.

bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

Presumably that is the version you use to execute the script.
It is ancient and may not yet support the \U syntax.

But it does support the \U syntax as with the UTF-8, just not the codepoint. The basic function of my shell script is to compare Unicode's codepoint with the UTF-8 for all 1071 hieroglyphs.

The first hieroglyph defined in Unicode is a seated man referred to as A1 in the Gardiner classification.
Its codepoint is "13000" and its UTF-8 is the matched pair of "80 80" and its hieroglyph is 𓀀.

utf8a=80 utf8b=80
utf8_hg=$( printf "\xF0\x93\x${utf8a}\x${utf8b}" )
echo $utf8_hg
𓀀

So the UTF-8 works fine.

codepoint=13000
cp_hg=$( echo -e "\U000$codepoint" )
echo $cp_hg
𓀀
bash -version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

So the codepoint works fine in the shell and, as you can see, using the old version.
Unfortunately, not in the shell.

What bash are you using interactively?

As I wrote at the end, I typically use the old version ... 3.2.57(1). I downloaded 5.0.17 years ago but I stopped using it because I wanted my other much more intensive script (which uses both bash and AppleScript) to work for any user who might
download it. I can't really use a script for public consumption if Apple doesn't stay up to date which they don't.

$ echo $BASH_VERSION

echo $BASH_VERSION
5.0.17(1)-release

That's just at the moment though.

--
Christian "naddy" Weisgerber

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russell Marks@21:1/5 to paris2venice@gmail.com on Thu Nov 30 11:49:33 2023

paris2venice <paris2venice@gmail.com> wrote:

Russell Marks wrote:

paris2venice wrote:

[...]

So instead of creating the hieroglyph, the script just ignores the
same exact code. Is there any way around this? Thanks for any
help.

bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

That's a very old version of bash. To quote the CHANGES file from a
newer version, "Fixed several bugs with the handling of valid and
invalid unicode character values when used with the \u and \U escape
sequences to printf and $'...'." So the old version not having those
fixes might be the problem.

[...]

That's interesting. Did you see my following comment about trying
it with version 5.0.17? I had the same exact results.

That version is also a bit old still, but I'm surprised at it giving
you the same trouble (assuming that printf is the builtin version).

Playing around with this on Linux, one way to nearly replicate your
result with a newer bash is "LC_ALL=C printf '\U00013000\n'" which for
me will output "\u00013000". So I suppose there could be a locale
issue involved.

In any case, the UTF-8 does not fail even with the 3.2.57(1)
release:

utf8a=80 utf8b=80
utf8_hg=$( printf "\xF0\x93\x${utf8a}\x${utf8b}" )
echo $utf8_hg
𓀀

That's something at least, and given what you said in a later post
about wanting to cope with Apple's old bash version for the sake of
users, you might not have much alternative.

-Rus.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Weisgerber@21:1/5 to Russell Marks on Thu Nov 30 14:09:55 2023

On 2023-11-30, Russell Marks <zgedneil@spam^H^H^H^Hgmail.com> wrote:

Playing around with this on Linux, one way to nearly replicate your
result with a newer bash is "LC_ALL=C printf '\U00013000\n'" which for
me will output "\u00013000". So I suppose there could be a locale
issue involved.

But if you execute the commands in question first on the command
line, then in a minimal script as shown, the same locale settings
will be used for both.

--
Christian "naddy" Weisgerber naddy@mips.inka.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russell Marks@21:1/5 to Christian Weisgerber on Thu Nov 30 18:52:56 2023

Christian Weisgerber <naddy@mips.inka.de> wrote:

On 2023-11-30, Russell Marks <zgedneil@spam^H^H^H^Hgmail.com> wrote:

Playing around with this on Linux, one way to nearly replicate your
result with a newer bash is "LC_ALL=C printf '\U00013000\n'" which for
me will output "\u00013000". So I suppose there could be a locale
issue involved.

But if you execute the commands in question first on the command
line, then in a minimal script as shown, the same locale settings
will be used for both.

True. The old 3.x bash presumably had Unicode bugs though, and the
differing output of "\U" vs. "\u" could hint at differing causes.
Also, if the 3.x bash binary is from Apple while the 5.x one isn't (as
seems likely), I imagine that the binaries could potentially be using
different libraries and/or locale configs.

Still, I have to admit this is all pretty speculative.

-Rus.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Qruqs@21:1/5 to All on Sun Aug 18 08:04:28 2024

On Wed, 29 Nov 2023 00:01:06 -0800 (PST), paris2venice wrote:

bash --version GNU bash, version 3.2.57(1)-release
(x86_64-apple-darwin20)
Copyright (C) 2007 Free Software Foundation, Inc.

Like many already said Bash version might be an issue, I use:

$ bash --version
GNU bash, version 5.2.26(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2022 Free Software Foundation, Inc.

Plus, I don't know if there is anyone still around to even bother about my posting this (seems it went dead after that stupid googlegroups was turned
off, finally! "do no harm", right...), and I also know this isn't a Python group, but anyhoo, since it's more or less dead anyway...

Is there a reason it _has_ to be Bash?

You could try another language. Python for instance, can also be run like
an executable text file, just like a shell script. You can also call other scripts from Python, for example running the stuff that is easier to do in Python and when it's done call some other script and have it have a go at solving the rest. You can also use the "sys" module and use pipes into
your Python script and it can send its result out via stdout. Your
imagination is the limit here.

Python 3.12: test-hieroglyphs-post-20240818.py

CODE:
---8<-------------------------------------------------------------------
#! /usr/bin/env python3
#coding: utf-8
print("As is:", "𓀀")
print("Using character names:", chr(ord('\N{EGYPTIAN HIEROGLYPH A001}'))) print("The code points as hex:", "𓀀".encode('utf-8')) ---8<-------------------------------------------------------------------

Running it:
$ ./test-hieroglyphs-post-20240818.py
As is: 𓀀
Using character names: 𓀀
The code points as hex bytes: b'\xf0\x93\x80\x80'

This works because Python 3.x is Unicode aware. All strings are Unicode by default.

There are other languages that might be suitable also. Pick one and try it
out.

There are more ways than one to skin a cat.

Q.
--
Currently using: https://manjaro.org/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Wed Dec 11 21:17:39 2024
  from Wales, Uk via Telnet
- Keyop
  Wed Dec 11 18:11:48 2024
  from Huddersfield, West Yorkshire via SSH
- Gwylbert
  Wed Dec 11 07:48:12 2024
  from Sydney, Nsw via Telnet
- Johnnyv
  Wed Dec 11 01:12:22 2024
  from Bilbao, Spain via Raw
- Johnnyv
  Wed Dec 11 00:37:56 2024
  from Bilbao, Spain via Raw
- Keyop
  Tue Dec 10 21:26:06 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Tue Dec 10 21:25:48 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Tue Dec 10 20:50:10 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	376
Nodes:	16 (2 / 14)
Uptime:	25:12:42
Calls:	8,035
Calls today:	5
Files:	13,034
Messages:	5,829,274

using Unicode codepoints in a bash script

Who's Online

Recent Visitors

System Info