• Gawk IGNORECASE=0 vs =1

    From J Naman@21:1/5 to All on Wed Feb 23 13:34:40 2022
    Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
    The result is no surprise, 128% difference for this one benchmark.
    I am just reporting the quantitative difference. The wall clock time difference can be non-trivial for processing large files having hundreds of
    thousands of lines of text.
    The benchmark was for a set of six statements:
    four of the form:
    if(str~/^text/) {return;}
    plus two statements:
    if(str~/^[A-Z]+$/) {return;}
    if(str~/^[a-z]+$/) {return;}
    The results over a aggregate 3 million loops:
    Score Test
    817 Avg IGN=1
    640 Avg IGN=0
    128% difference

    Also, I reran the same test with "? at the beginning
    and end of all six regexps. The scores were
    not significantly different than the above.

    * These Scores are scaled. I was previously warned
    not to report actual CPU or clock times for one particular system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to J Naman on Wed Feb 23 15:55:33 2022
    On 2/23/2022 3:34 PM, J Naman wrote:
    Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
    The result is no surprise, 128% difference for this one benchmark.
    I am just reporting the quantitative difference. The wall clock time difference
    can be non-trivial for processing large files having hundreds of
    thousands of lines of text.
    The benchmark was for a set of six statements:
    four of the form:
    if(str~/^text/) {return;}
    plus two statements:
    if(str~/^[A-Z]+$/) {return;}
    if(str~/^[a-z]+$/) {return;}
    The results over a aggregate 3 million loops:
    Score Test
    817 Avg IGN=1
    640 Avg IGN=0
    128% difference

    Also, I reran the same test with "? at the beginning
    and end of all six regexps. The scores were
    not significantly different than the above.

    * These Scores are scaled. I was previously warned
    not to report actual CPU or clock times for one particular system.

    Were there 128% more matches or some other difference in the matched
    strings? Without knowing what the input contained it's hard to know what
    those results mean. What were you hoping to test by adding `"?` to the
    regexps? Without knowing how IGN=1 compares to the alternative of
    `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
    this information.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From J Naman@21:1/5 to Ed Morton on Wed Feb 23 20:45:07 2022
    On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
    On 2/23/2022 3:34 PM, J Naman wrote:
    Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
    The result is no surprise, 128% difference for this one benchmark.
    I am just reporting the quantitative difference. The wall clock time difference
    can be non-trivial for processing large files having hundreds of
    thousands of lines of text.
    The benchmark was for a set of six statements:
    four of the form:
    if(str~/^text/) {return;}
    plus two statements:
    if(str~/^[A-Z]+$/) {return;}
    if(str~/^[a-z]+$/) {return;}
    The results over a aggregate 3 million loops:
    Score Test
    817 Avg IGN=1
    640 Avg IGN=0
    128% difference

    Also, I reran the same test with "? at the beginning
    and end of all six regexps. The scores were
    not significantly different than the above.

    * These Scores are scaled. I was previously warned
    not to report actual CPU or clock times for one particular system.
    Were there 128% more matches or some other difference in the matched
    strings? Without knowing what the input contained it's hard to know what those results mean. What were you hoping to test by adding `"?` to the regexps? Without knowing how IGN=1 compares to the alternative of `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
    this information.

    Ed.
    Here are six results, scaled: (not surprising to me)
    IC=0 IC=1 1/0
    low 108 226 109% longer
    Mix 100 228 128% longer
    UP 111 225 102% longer

    low = "include variable function namespace x"
    Mix = "Include Variable Function NameSpace x"
    UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

    function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower
    if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
    if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
    if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}
    if(str~/^[a-z ]+$/) {x++}
    } #eofunc testmatch(str)

    So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
    I forced testing all 7 regexp are every call because my real data doesn't match very often.
    All of my regexp are mixed case and the file data are supposed to be.
    tolower() on both input and regexp looks to be no better than
    mixed case input to mixed case regexp
    btw: 'random case' is a quirky feature of my editor I never had any use for before.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to J Naman on Thu Feb 24 06:51:08 2022
    On 2/23/2022 10:45 PM, J Naman wrote:
    On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
    On 2/23/2022 3:34 PM, J Naman wrote:
    Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
    The result is no surprise, 128% difference for this one benchmark.
    I am just reporting the quantitative difference. The wall clock time difference
    can be non-trivial for processing large files having hundreds of
    thousands of lines of text.
    The benchmark was for a set of six statements:
    four of the form:
    if(str~/^text/) {return;}
    plus two statements:
    if(str~/^[A-Z]+$/) {return;}
    if(str~/^[a-z]+$/) {return;}
    The results over a aggregate 3 million loops:
    Score Test
    817 Avg IGN=1
    640 Avg IGN=0
    128% difference

    Also, I reran the same test with "? at the beginning
    and end of all six regexps. The scores were
    not significantly different than the above.

    * These Scores are scaled. I was previously warned
    not to report actual CPU or clock times for one particular system.
    Were there 128% more matches or some other difference in the matched
    strings? Without knowing what the input contained it's hard to know what
    those results mean. What were you hoping to test by adding `"?` to the
    regexps? Without knowing how IGN=1 compares to the alternative of
    `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
    this information.

    Ed.
    Here are six results, scaled: (not surprising to me)
    IC=0 IC=1 1/0
    low 108 226 109% longer
    Mix 100 228 128% longer
    UP 111 225 102% longer

    low = "include variable function namespace x"
    Mix = "Include Variable Function NameSpace x"
    UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

    function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper if(str~/^Include Variable Function NameSpace x/) {x++} # mixed if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}
    if(str~/^[a-z ]+$/) {x++}
    } #eofunc testmatch(str)

    Am I right in thinking that by the above you mean your test script is
    basically a script that calls that function some large number of times
    in a loop with 1 of the stated strings, e.g.

    BEGIN {
    low = "include variable function namespace x"
    for (i=1;i<=1000000;i++) testmatch(low)
    }


    So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
    I forced testing all 7 regexp are every call because my real data doesn't match very often.
    All of my regexp are mixed case and the file data are supposed to be. tolower() on both input and regexp looks to be no better than
    mixed case input to mixed case regexp
    btw: 'random case' is a quirky feature of my editor I never had any use for before.

    I'm still struggling to understand what we're supposed to **do** with
    the above information. I mean if we need to match a regexp against
    mixed-case input we have 2 choices:

    1) IGNORECASE=1; .. $0 ~ /foo/
    2) tolower($0) ~ /foo/

    and what we cannot do is just:

    3) $0 ~ /foo/

    so what can we do with the information that "1" would be slower than "3"
    since we can't use "3" for this anyway? If you told us that "1" was
    slower than "2" then we could use that information to write scripts
    using "2" instead of "1" but I just don't see how the speed of "1" vs
    the speed of "3" is something we can act on.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From J Naman@21:1/5 to Ed Morton on Thu Feb 24 15:55:42 2022
    On Thursday, 24 February 2022 at 07:51:12 UTC-5, Ed Morton wrote:
    On 2/23/2022 10:45 PM, J Naman wrote:
    On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
    On 2/23/2022 3:34 PM, J Naman wrote:
    Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
    The result is no surprise, 128% difference for this one benchmark.
    I am just reporting the quantitative difference. The wall clock time difference
    can be non-trivial for processing large files having hundreds of
    thousands of lines of text.
    The benchmark was for a set of six statements:
    four of the form:
    if(str~/^text/) {return;}
    plus two statements:
    if(str~/^[A-Z]+$/) {return;}
    if(str~/^[a-z]+$/) {return;}
    The results over a aggregate 3 million loops:
    Score Test
    817 Avg IGN=1
    640 Avg IGN=0
    128% difference

    Also, I reran the same test with "? at the beginning
    and end of all six regexps. The scores were
    not significantly different than the above.

    * These Scores are scaled. I was previously warned
    not to report actual CPU or clock times for one particular system.
    Were there 128% more matches or some other difference in the matched
    strings? Without knowing what the input contained it's hard to know what >> those results mean. What were you hoping to test by adding `"?` to the
    regexps? Without knowing how IGN=1 compares to the alternative of
    `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with >> this information.

    Ed.
    Here are six results, scaled: (not surprising to me)
    IC=0 IC=1 1/0
    low 108 226 109% longer
    Mix 100 228 128% longer
    UP 111 225 102% longer

    low = "include variable function namespace x"
    Mix = "Include Variable Function NameSpace x"
    UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

    function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper if(str~/^Include Variable Function NameSpace x/) {x++} # mixed if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}
    if(str~/^[a-z ]+$/) {x++}
    } #eofunc testmatch(str)
    Am I right in thinking that by the above you mean your test script is basically a script that calls that function some large number of times
    in a loop with 1 of the stated strings, e.g.

    BEGIN {
    low = "include variable function namespace x"
    for (i=1;i<=1000000;i++) testmatch(low)
    }


    So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
    I forced testing all 7 regexp are every call because my real data doesn't match very often.
    All of my regexp are mixed case and the file data are supposed to be. tolower() on both input and regexp looks to be no better than
    mixed case input to mixed case regexp
    btw: 'random case' is a quirky feature of my editor I never had any use for before.
    I'm still struggling to understand what we're supposed to **do** with
    the above information. I mean if we need to match a regexp against mixed-case input we have 2 choices:

    1) IGNORECASE=1; .. $0 ~ /foo/
    2) tolower($0) ~ /foo/

    and what we cannot do is just:

    3) $0 ~ /foo/

    so what can we do with the information that "1" would be slower than "3" since we can't use "3" for this anyway? If you told us that "1" was
    slower than "2" then we could use that information to write scripts
    using "2" instead of "1" but I just don't see how the speed of "1" vs
    the speed of "3" is something we can act on.

    Ed.
    Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and
    everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I
    have wasted people's time. John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to J Naman on Thu Feb 24 18:16:40 2022
    On 2/24/2022 5:55 PM, J Naman wrote:
    On Thursday, 24 February 2022 at 07:51:12 UTC-5, Ed Morton wrote:
    On 2/23/2022 10:45 PM, J Naman wrote:
    On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
    On 2/23/2022 3:34 PM, J Naman wrote:
    Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
    The result is no surprise, 128% difference for this one benchmark.
    I am just reporting the quantitative difference. The wall clock time difference
    can be non-trivial for processing large files having hundreds of
    thousands of lines of text.
    The benchmark was for a set of six statements:
    four of the form:
    if(str~/^text/) {return;}
    plus two statements:
    if(str~/^[A-Z]+$/) {return;}
    if(str~/^[a-z]+$/) {return;}
    The results over a aggregate 3 million loops:
    Score Test
    817 Avg IGN=1
    640 Avg IGN=0
    128% difference

    Also, I reran the same test with "? at the beginning
    and end of all six regexps. The scores were
    not significantly different than the above.

    * These Scores are scaled. I was previously warned
    not to report actual CPU or clock times for one particular system.
    Were there 128% more matches or some other difference in the matched
    strings? Without knowing what the input contained it's hard to know what >>>> those results mean. What were you hoping to test by adding `"?` to the >>>> regexps? Without knowing how IGN=1 compares to the alternative of
    `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with >>>> this information.

    Ed.
    Here are six results, scaled: (not surprising to me)
    IC=0 IC=1 1/0
    low 108 226 109% longer
    Mix 100 228 128% longer
    UP 111 225 102% longer

    low = "include variable function namespace x"
    Mix = "Include Variable Function NameSpace x"
    UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

    function testmatch(str, x){ # all 7 regexp are tested every call
    if(str~/^include variable function namespace x/) {x++} # lower
    if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
    if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
    if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
    if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
    if(str~/^[A-Z ]+$/) {x++}
    if(str~/^[a-z ]+$/) {x++}
    } #eofunc testmatch(str)
    Am I right in thinking that by the above you mean your test script is
    basically a script that calls that function some large number of times
    in a loop with 1 of the stated strings, e.g.

    BEGIN {
    low = "include variable function namespace x"
    for (i=1;i<=1000000;i++) testmatch(low)
    }


    So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
    I forced testing all 7 regexp are every call because my real data doesn't match very often.
    All of my regexp are mixed case and the file data are supposed to be.
    tolower() on both input and regexp looks to be no better than
    mixed case input to mixed case regexp
    btw: 'random case' is a quirky feature of my editor I never had any use for before.
    I'm still struggling to understand what we're supposed to **do** with
    the above information. I mean if we need to match a regexp against
    mixed-case input we have 2 choices:

    1) IGNORECASE=1; .. $0 ~ /foo/
    2) tolower($0) ~ /foo/

    and what we cannot do is just:

    3) $0 ~ /foo/

    so what can we do with the information that "1" would be slower than "3"
    since we can't use "3" for this anyway? If you told us that "1" was
    slower than "2" then we could use that information to write scripts
    using "2" instead of "1" but I just don't see how the speed of "1" vs
    the speed of "3" is something we can act on.

    Ed.
    Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and
    everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I
    have wasted people's time. John


    Ah, now I understand what this was about. Thanks for the information.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kpop 2GM@21:1/5 to All on Mon Mar 28 12:59:03 2022
    % ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'tolower($0)~"^include variable function namespace x$"' FS='^$' ) | pvE9)| wc5

    in0: 3.40GiB 0:00:04 [ 818MiB/s] [ 818MiB/s] [=============================>] 100%
    out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
    ( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.07s user 0.79s system 113% cpu 4.275 total
    rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.


    % ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5
    in0: 80.2MiB 0:00:00 [ 801MiB/s] [ 801MiB/s] [> ] 2% ETA 0:00:00
    out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
    in0: 3.40GiB 0:00:04 [ 820MiB/s] [ 820MiB/s] [=============================>] 100%
    ( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.06s user 0.79s system 113% cpu 4.268 total
    rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.


    % ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^$' )| pvE9) | wc5

    out9: 64.3MiB 0:00:02 [32.0MiB/s] [32.0MiB/s] [ <=> ]
    in0: 3.40GiB 0:00:02 [1.70GiB/s] [1.70GiB/s] [=============================>] 100%
    ( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 1.58s user 0.82s system 118% cpu 2.026 total
    rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.


    What I'm seeing is if you simply make EVERY letter a combo test of both upper and lower cases, and also prevent it from splitting fields, it's more than 200% time savings. And that's only for mawk-2. For gawk, the savings are unearthly :



    out9: 64.3MiB 0:00:43 [1.48MiB/s] [1.48MiB/s] [ <=> ]
    in0: 3.40GiB 0:00:43 [80.5MiB/s] [80.5MiB/s] [=============================>] 100%
    ( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 43.01s user 1.12s system 101% cpu 43.317 total
    rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.



    in0: 3.40GiB 0:00:44 [78.0MiB/s] [78.0MiB/s] [=============================>] 100%
    out9: 64.3MiB 0:00:44 [1.44MiB/s] [1.44MiB/s] [ <=> ]
    ( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 44.35s user 1.14s system 101% cpu 44.671 total
    rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.



    out9: 64.3MiB 0:00:05 [10.7MiB/s] [10.7MiB/s] [ <=> ]
    in0: 3.40GiB 0:00:05 [ 582MiB/s] [ 582MiB/s] [=============================>] 100%
    ( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 5.83s user 0.81s system 110% cpu 6.006 total
    rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

    =====================
    echo; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'tolower($0)~"^include variable function
    namespace x$"' FS='^$' ) | pvE9)| wc5 ; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^
    $' )| pvE9) | wc5 ===============================================================

    The 4Chan Teller

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)