• gsub() & Escaped characters in strings

    From J Naman@21:1/5 to All on Wed Apr 28 18:37:44 2021
    # I do not understand why gsub() seems to quadruple scan escaped characters inside
    # strings when the number of escaped characters is > 6. See below.
    BEGIN{ # quick test of regexpr
    # to match path = ...\foo\...
    # (path ~ "\\\\foo\\\\") is required
    # looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
    # in every case below, gsub() returns 2 = number of substitutions (as expected) x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
    }
    # 2 \s returns 1 escaped: \foo\ as expected
    # 4 \s returns 2 escaped: \\foo\\ as expected
    # 6 \s returns 3 escaped: \\\foo\\\ as expected
    # 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
    # 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
    # 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
    # Can anyone explain why 8+ \s are different?
    # Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
    Thanks, john

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to J Naman on Thu Apr 29 14:02:01 2021
    On 29.04.2021 03:37, J Naman wrote:
    # I do not understand why gsub() seems to quadruple scan escaped characters inside
    # strings when the number of escaped characters is > 6. See below.
    BEGIN{ # quick test of regexpr
    # to match path = ...\foo\...
    # (path ~ "\\\\foo\\\\") is required
    # looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
    # in every case below, gsub() returns 2 = number of substitutions (as expected)
    x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
    }
    # 2 \s returns 1 escaped: \foo\ as expected
    # 4 \s returns 2 escaped: \\foo\\ as expected
    # 6 \s returns 3 escaped: \\\foo\\\ as expected
    # 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
    # 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
    # 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
    # Can anyone explain why 8+ \s are different?
    # Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
    Thanks, john


    ### gawk ### ### nawk ### ### mawk ###
    --- --- ---
    Repl: \ Repl: \ Repl: \
    Where: \foo Where: \foo Where: \foo
    --- --- ---
    Repl: \\ Repl: \\ Repl: \\
    Where: \\foo Where: \\foo Where: \foo
    --- --- ---
    Repl: \\\ Repl: \\\ Repl: \\\
    Where: \\\foo Where: \\\foo Where: \\foo
    --- --- ---
    Repl: \\\\ Repl: \\\\ Repl: \\\\
    Where: \\foo Where: \\\\foo Where: \\foo
    --- --- ---
    Repl: \\\\\ Repl: \\\\\ Repl: \\\\\
    Where: \\\foo Where: \\\\\foo Where: \\\foo
    --- --- ---
    Repl: \\\\\\ Repl: \\\\\\ Repl: \\\\\\
    Where: \\\\foo Where: \\\\\\foo Where: \\\foo
    --- --- ---

    Three awks, three different results.

    Note: 'Repl' contains the actual pattern modulo the string handling
    (i.e. the corresponding string is twice as long, e.g. \\\\ -> \\ ).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From J Naman@21:1/5 to Janis Papanagnou on Thu Apr 29 07:47:43 2021
    On Thursday, 29 April 2021 at 08:02:03 UTC-4, Janis Papanagnou wrote:
    On 29.04.2021 03:37, J Naman wrote:
    # I do not understand why gsub() seems to quadruple scan escaped characters inside
    # strings when the number of escaped characters is > 6. See below.
    BEGIN{ # quick test of regexpr
    # to match path = ...\foo\...
    # (path ~ "\\\\foo\\\\") is required
    # looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
    # in every case below, gsub() returns 2 = number of substitutions (as expected)
    x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
    }
    # 2 \s returns 1 escaped: \foo\ as expected
    # 4 \s returns 2 escaped: \\foo\\ as expected
    # 6 \s returns 3 escaped: \\\foo\\\ as expected
    # 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
    # 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
    # 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
    # Can anyone explain why 8+ \s are different?
    # Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
    Thanks, john

    ### gawk ### ### nawk ### ### mawk ###
    --- --- ---
    Repl: \ Repl: \ Repl: \
    Where: \foo Where: \foo Where: \foo
    --- --- ---
    Repl: \\ Repl: \\ Repl: \\
    Where: \\foo Where: \\foo Where: \foo
    --- --- ---
    Repl: \\\ Repl: \\\ Repl: \\\
    Where: \\\foo Where: \\\foo Where: \\foo
    --- --- ---
    Repl: \\\\ Repl: \\\\ Repl: \\\\
    Where: \\foo Where: \\\\foo Where: \\foo
    --- --- ---
    Repl: \\\\\ Repl: \\\\\ Repl: \\\\\
    Where: \\\foo Where: \\\\\foo Where: \\\foo
    --- --- ---
    Repl: \\\\\\ Repl: \\\\\\ Repl: \\\\\\
    Where: \\\\foo Where: \\\\\\foo Where: \\\foo
    --- --- ---

    Three awks, three different results.

    Note: 'Repl' contains the actual pattern modulo the string handling
    (i.e. the corresponding string is twice as long, e.g. \\\\ -> \\ ).

    Janis
    I got smart and thought of a much better solution to the entire problem, presumably in any version awk:
    a) gsub(/\\/,"<non-escaped delim>", path_str);
    b) then look for match()s as "normal" using the <non-escaped delim> in a regex.
    That by-passes all the "Gory Details" and whole escaping back slashes. Much cleaner, similar to method using tolower(path) to avoid IGNORECASE. Sometimes I forget to look at the forest when playing with twigs ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to J Naman on Thu Apr 29 10:36:04 2021
    On 4/28/2021 8:37 PM, J Naman wrote:
    # I do not understand why gsub() seems to quadruple scan escaped characters inside
    # strings when the number of escaped characters is > 6. See below.
    BEGIN{ # quick test of regexpr
    # to match path = ...\foo\...
    # (path ~ "\\\\foo\\\\") is required
    # looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
    # in every case below, gsub() returns 2 = number of substitutions (as expected)
    x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
    }
    # 2 \s returns 1 escaped: \foo\ as expected
    # 4 \s returns 2 escaped: \\foo\\ as expected
    # 6 \s returns 3 escaped: \\\foo\\\ as expected
    # 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
    # 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
    # 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
    # Can anyone explain why 8+ \s are different?
    # Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
    Thanks, john


    I'm _guessing_ it's because the string gets interpreted twice, once when
    the awk interpreter reads it and then again when it uses it, so for the
    2 passes of interpretation, depending on how the "use" phase interprets
    pairs of backslashes, we could get:

    \\ -> read -> \ -> use -> \
    \\\\ -> read -> \\ -> use -> \ or \\
    \\\\\\ -> read -> \\\ -> use -> \\\
    \\\\\\\\ -> read -> \\\\ -> use -> \\ or \\\\

    Now WHY any given awk when using the string would interpret 4
    backslashes as 2 but not 2 backslashes as 1, I can't guess.

    Regards,

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?T8SfdXo=?=@21:1/5 to All on Thu Apr 29 10:35:28 2021
    On Thursday, April 29, 2021 at 8:33:21 PM UTC+3, Oğuz wrote:
    Now, I have busybox awk, gawk, mawk, nawk, and NetBSD awk installed on my computer, and none of them gives that output.
    No, sorry, actually mawk does.
    $ mawk 'BEGIN { x = "y"; sub(/y/, "\\y\\\\y\\\\\\y\\\\\\\\y", x); print x }' \y\y\\y\\y

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?T8SfdXo=?=@21:1/5 to Ed Morton on Thu Apr 29 10:33:19 2021
    On Thursday, April 29, 2021 at 6:36:06 PM UTC+3, Ed Morton wrote:
    On 4/28/2021 8:37 PM, J Naman wrote:
    # I do not understand why gsub() seems to quadruple scan escaped characters inside
    # strings when the number of escaped characters is > 6. See below.
    BEGIN{ # quick test of regexpr
    # to match path = ...\foo\...
    # (path ~ "\\\\foo\\\\") is required
    # looking for "\\\\foo\\\\" string constant <==> regexp /\\foo\\/
    # in every case below, gsub() returns 2 = number of substitutions (as expected)
    x=";foo;"; n=gsub(/;/,"\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\",x); printf("# returns %s\n",x)
    x=";foo;"; n=gsub(/;/,"\\\\\\\\\\",x); printf("# returns %s\n",x) x=";foo;"; n=gsub(/;/,"\\\\\\\\\\\\",x); printf("# returns %s\n",x)
    }
    # 2 \s returns 1 escaped: \foo\ as expected
    # 4 \s returns 2 escaped: \\foo\\ as expected
    # 6 \s returns 3 escaped: \\\foo\\\ as expected
    # 8 \s returns 2 escaped: \\foo\\ NOT 4 \s
    # 10 \s returns 3 escaped: \\\foo\\\ NOT 6 \s
    # 12 \s returns 4 escaped: \\\\foo\\\\ ! finally get 4 \s!
    # Can anyone explain why 8+ \s are different?
    # Why gsub(/;/,"\\\\",x) == gsub(/;/,"\\\\\\\\",x)
    Thanks, john

    I'm _guessing_ it's because the string gets interpreted twice, once when
    the awk interpreter reads it and then again when it uses it, so for the
    2 passes of interpretation, depending on how the "use" phase interprets
    pairs of backslashes, we could get:

    \\ -> read -> \ -> use -> \
    \\\\ -> read -> \\ -> use -> \ or \\
    \\\\\\ -> read -> \\\ -> use -> \\\
    \\\\\\\\ -> read -> \\\\ -> use -> \\ or \\\\

    According to POSIX, when used as the second argument to sub or gsub, "\\" and "\\\\" are the same (one backslash) unless the former is followed by an ampersand. The same goes for "\\\\\\" and "\\\\\\\\" (two backslashes). Like, given the following
    program,
    BEGIN { x = "y"; sub(/y/, "\\y\\\\y\\\\\\y\\\\\\\\y", x); print x }
    a POSIX-conformant awk should output:
    \y\y\\y\\y

    Now, I have busybox awk, gawk, mawk, nawk, and NetBSD awk installed on my computer, and none of them gives that output.


    Now WHY any given awk when using the string would interpret 4
    backslashes as 2 but not 2 backslashes as 1, I can't guess.

    Regards,

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luuk@21:1/5 to All on Thu Apr 29 21:19:51 2021
    On 29-4-2021 19:35, Oğuz wrote:
    On Thursday, April 29, 2021 at 8:33:21 PM UTC+3, Oğuz wrote:
    Now, I have busybox awk, gawk, mawk, nawk, and NetBSD awk installed on my computer, and none of them gives that output.
    No, sorry, actually mawk does.
    $ mawk 'BEGIN { x = "y"; sub(/y/, "\\y\\\\y\\\\\\y\\\\\\\\y", x); print x }' \y\y\\y\\y


    D:\TEMP>gawk "BEGIN { x = \"y\"; sub(/y/, \"\\y\\\\y\\\\\\y\\\\\\\\y\",
    x); print x }"
    \y\\y\\\y\\y

    D:\TEMP>gawk -P "BEGIN { x = \"y\"; sub(/y/,
    \"\\y\\\\y\\\\\\y\\\\\\\\y\", x); print x }"
    \y\y\\y\\y

    D:\TEMP>gawk --version
    GNU Awk 5.1.0, API: 3.0 (GNU MPFR 3.1.5, GNU MP 6.1.2)
    Copyright (C) 1989, 1991-2020 Free Software Foundation.

    This pro.......

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)