Forum: >>> Magnum BBS <<<

gsub() & Escaped characters in strings

From J Naman@21:1/5 to All on Wed Apr 28 18:37:44 2021

# I do not understand why gsub() seems to quadruple scan escaped characters inside
# strings when the number of escaped characters is > 6. See below.
BEGIN{ # quick test of regexpr
# to match path = ...\foo\...
# (path ~ "\\\\foo\\\\") is required
# looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
# in every case below, gsub() returns 2 = number of substitutions (as expected) x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
}
# 2 \s returns 1 escaped: \foo\ as expected
# 4 \s returns 2 escaped: \\foo\\ as expected
# 6 \s returns 3 escaped: \\\foo\\\ as expected
# 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
# 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
# 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
# Can anyone explain why 8+ \s are different?
# Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
Thanks, john

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to J Naman on Thu Apr 29 14:02:01 2021

On 29.04.2021 03:37, J Naman wrote:

# I do not understand why gsub() seems to quadruple scan escaped characters inside
# strings when the number of escaped characters is > 6. See below.
BEGIN{ # quick test of regexpr
# to match path = ...\foo\...
# (path ~ "\\\\foo\\\\") is required
# looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
# in every case below, gsub() returns 2 = number of substitutions (as expected)
x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
}
# 2 \s returns 1 escaped: \foo\ as expected
# 4 \s returns 2 escaped: \\foo\\ as expected
# 6 \s returns 3 escaped: \\\foo\\\ as expected
# 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
# 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
# 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
# Can anyone explain why 8+ \s are different?
# Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
Thanks, john

### gawk ### ### nawk ### ### mawk ###
--- --- ---
Repl: \ Repl: \ Repl: \
Where: \foo Where: \foo Where: \foo
--- --- ---
Repl: \\ Repl: \\ Repl: \\
Where: \\foo Where: \\foo Where: \foo
--- --- ---
Repl: \\\ Repl: \\\ Repl: \\\
Where: \\\foo Where: \\\foo Where: \\foo
--- --- ---
Repl: \\\\ Repl: \\\\ Repl: \\\\
Where: \\foo Where: \\\\foo Where: \\foo
--- --- ---
Repl: \\\\\ Repl: \\\\\ Repl: \\\\\
Where: \\\foo Where: \\\\\foo Where: \\\foo
--- --- ---
Repl: \\\\\\ Repl: \\\\\\ Repl: \\\\\\
Where: \\\\foo Where: \\\\\\foo Where: \\\foo
--- --- ---

Three awks, three different results.

Note: 'Repl' contains the actual pattern modulo the string handling
(i.e. the corresponding string is twice as long, e.g. \\\\ -> \\ ).

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From J Naman@21:1/5 to Janis Papanagnou on Thu Apr 29 07:47:43 2021

On Thursday, 29 April 2021 at 08:02:03 UTC-4, Janis Papanagnou wrote:

On 29.04.2021 03:37, J Naman wrote:

# I do not understand why gsub() seems to quadruple scan escaped characters inside
# strings when the number of escaped characters is > 6. See below.
BEGIN{ # quick test of regexpr
# to match path = ...\foo\...
# (path ~ "\\\\foo\\\\") is required
# looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
# in every case below, gsub() returns 2 = number of substitutions (as expected)
x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
}
# 2 \s returns 1 escaped: \foo\ as expected
# 4 \s returns 2 escaped: \\foo\\ as expected
# 6 \s returns 3 escaped: \\\foo\\\ as expected
# 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
# 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
# 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
# Can anyone explain why 8+ \s are different?
# Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
Thanks, john

### gawk ### ### nawk ### ### mawk ###
--- --- ---
Repl: \ Repl: \ Repl: \
Where: \foo Where: \foo Where: \foo
--- --- ---
Repl: \\ Repl: \\ Repl: \\
Where: \\foo Where: \\foo Where: \foo
--- --- ---
Repl: \\\ Repl: \\\ Repl: \\\
Where: \\\foo Where: \\\foo Where: \\foo
--- --- ---
Repl: \\\\ Repl: \\\\ Repl: \\\\
Where: \\foo Where: \\\\foo Where: \\foo
--- --- ---
Repl: \\\\\ Repl: \\\\\ Repl: \\\\\
Where: \\\foo Where: \\\\\foo Where: \\\foo
--- --- ---
Repl: \\\\\\ Repl: \\\\\\ Repl: \\\\\\
Where: \\\\foo Where: \\\\\\foo Where: \\\foo
--- --- ---

Three awks, three different results.

Note: 'Repl' contains the actual pattern modulo the string handling
(i.e. the corresponding string is twice as long, e.g. \\\\ -> \\ ).

Janis

I got smart and thought of a much better solution to the entire problem, presumably in any version awk:
a) gsub(/\\/,"<non-escaped delim>", path_str);
b) then look for match()s as "normal" using the <non-escaped delim> in a regex.
That by-passes all the "Gory Details" and whole escaping back slashes. Much cleaner, similar to method using tolower(path) to avoid IGNORECASE. Sometimes I forget to look at the forest when playing with twigs ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to J Naman on Thu Apr 29 10:36:04 2021

On 4/28/2021 8:37 PM, J Naman wrote:

# I do not understand why gsub() seems to quadruple scan escaped characters inside
# strings when the number of escaped characters is > 6. See below.
BEGIN{ # quick test of regexpr
# to match path = ...\foo\...
# (path ~ "\\\\foo\\\\") is required
# looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
# in every case below, gsub() returns 2 = number of substitutions (as expected)
x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
}
# 2 \s returns 1 escaped: \foo\ as expected
# 4 \s returns 2 escaped: \\foo\\ as expected
# 6 \s returns 3 escaped: \\\foo\\\ as expected
# 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
# 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
# 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
# Can anyone explain why 8+ \s are different?
# Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
Thanks, john

I'm _guessing_ it's because the string gets interpreted twice, once when
the awk interpreter reads it and then again when it uses it, so for the
2 passes of interpretation, depending on how the "use" phase interprets
pairs of backslashes, we could get:

\\ -> read -> \ -> use -> \
\\\\ -> read -> \\ -> use -> \ or \\
\\\\\\ -> read -> \\\ -> use -> \\\
\\\\\\\\ -> read -> \\\\ -> use -> \\ or \\\\

Now WHY any given awk when using the string would interpret 4
backslashes as 2 but not 2 backslashes as 1, I can't guess.

Regards,

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?B?T8SfdXo=?=@21:1/5 to All on Thu Apr 29 10:35:28 2021

On Thursday, April 29, 2021 at 8:33:21 PM UTC+3, Oğuz wrote:

Now, I have busybox awk, gawk, mawk, nawk, and NetBSD awk installed on my computer, and none of them gives that output.

No, sorry, actually mawk does.
$ mawk 'BEGIN { x = "y"; sub(/y/, "\\y\\\\y\\\\\\y\\\\\\\\y", x); print x }' \y\y\\y\\y

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?B?T8SfdXo=?=@21:1/5 to Ed Morton on Thu Apr 29 10:33:19 2021

On Thursday, April 29, 2021 at 6:36:06 PM UTC+3, Ed Morton wrote:

On 4/28/2021 8:37 PM, J Naman wrote:

# I do not understand why gsub() seems to quadruple scan escaped characters inside
# strings when the number of escaped characters is > 6. See below.
BEGIN{ # quick test of regexpr
# to match path = ...\foo\...
# (path ~ "\\\\foo\\\\") is required
# looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
# in every case below, gsub() returns 2 = number of substitutions (as expected)
x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
}
# 2 \s returns 1 escaped: \foo\ as expected
# 4 \s returns 2 escaped: \\foo\\ as expected
# 6 \s returns 3 escaped: \\\foo\\\ as expected
# 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
# 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
# 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
# Can anyone explain why 8+ \s are different?
# Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
Thanks, john

I'm _guessing_ it's because the string gets interpreted twice, once when
the awk interpreter reads it and then again when it uses it, so for the
2 passes of interpretation, depending on how the "use" phase interprets
pairs of backslashes, we could get:

\\ -> read -> \ -> use -> \
\\\\ -> read -> \\ -> use -> \ or \\
\\\\\\ -> read -> \\\ -> use -> \\\
\\\\\\\\ -> read -> \\\\ -> use -> \\ or \\\\

According to POSIX, when used as the second argument to sub or gsub, "\\" and "\\\\" are the same (one backslash) unless the former is followed by an ampersand. The same goes for "\\\\\\" and "\\\\\\\\" (two backslashes). Like, given the following
program,
BEGIN { x = "y"; sub(/y/, "\\y\\\\y\\\\\\y\\\\\\\\y", x); print x }
a POSIX-conformant awk should output:
\y\y\\y\\y

Now, I have busybox awk, gawk, mawk, nawk, and NetBSD awk installed on my computer, and none of them gives that output.

Now WHY any given awk when using the string would interpret 4
backslashes as 2 but not 2 backslashes as 1, I can't guess.

Regards,

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luuk@21:1/5 to All on Thu Apr 29 21:19:51 2021

On 29-4-2021 19:35, Oğuz wrote:

On Thursday, April 29, 2021 at 8:33:21 PM UTC+3, Oğuz wrote:

Now, I have busybox awk, gawk, mawk, nawk, and NetBSD awk installed on my computer, and none of them gives that output.

No, sorry, actually mawk does.
$ mawk 'BEGIN { x = "y"; sub(/y/, "\\y\\\\y\\\\\\y\\\\\\\\y", x); print x }' \y\y\\y\\y

D:\TEMP>gawk "BEGIN { x = \"y\"; sub(/y/, \"\\y\\\\y\\\\\\y\\\\\\\\y\",
x); print x }"
\y\\y\\\y\\y

D:\TEMP>gawk -P "BEGIN { x = \"y\"; sub(/y/,
\"\\y\\\\y\\\\\\y\\\\\\\\y\", x); print x }"
\y\y\\y\\y

D:\TEMP>gawk --version
GNU Awk 5.1.0, API: 3.0 (GNU MPFR 3.1.5, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2020 Free Software Foundation.

This pro.......

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Michal Wronka
  Wed Apr 24 14:13:57 2024
  from Wroclaw, Poland via SSH
- Michal Wronka
  Wed Apr 24 14:02:51 2024
  from Wroclaw, Poland via SSH
- Michal Wronka
  Thu Apr 25 14:02:21 2024
  from Wroclaw, Poland via SSH
- Bob Worm
  Thu Apr 25 11:52:12 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	53:43:06
Calls:	6,650
Calls today:	2
Files:	12,200
Messages:	5,330,494

gsub() & Escaped characters in strings

Who's Online

Recent Visitors

System Info