Forum: >>> Magnum BBS <<<

Gawk IGNORECASE=0 vs =1

From J Naman@21:1/5 to All on Wed Feb 23 13:34:40 2022

Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to J Naman on Wed Feb 23 15:55:33 2022

On 2/23/2022 3:34 PM, J Naman wrote:

Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

Were there 128% more matches or some other difference in the matched
strings? Without knowing what the input contained it's hard to know what
those results mean. What were you hoping to test by adding `"?` to the
regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
this information.

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From J Naman@21:1/5 to Ed Morton on Wed Feb 23 20:45:07 2022

On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:

On 2/23/2022 3:34 PM, J Naman wrote:

Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

Were there 128% more matches or some other difference in the matched
strings? Without knowing what the input contained it's hard to know what those results mean. What were you hoping to test by adding `"?` to the regexps? Without knowing how IGN=1 compares to the alternative of `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
this information.

Ed.

Here are six results, scaled: (not surprising to me)
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer

low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower
if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)

So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be.
tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to J Naman on Thu Feb 24 06:51:08 2022

On 2/23/2022 10:45 PM, J Naman wrote:

On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:

On 2/23/2022 3:34 PM, J Naman wrote:

Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

Were there 128% more matches or some other difference in the matched
strings? Without knowing what the input contained it's hard to know what
those results mean. What were you hoping to test by adding `"?` to the
regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
this information.

Ed.

Here are six results, scaled: (not surprising to me)
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer

low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper if(str~/^Include Variable Function NameSpace x/) {x++} # mixed if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)

Am I right in thinking that by the above you mean your test script is
basically a script that calls that function some large number of times
in a loop with 1 of the stated strings, e.g.

BEGIN {
low = "include variable function namespace x"
for (i=1;i<=1000000;i++) testmatch(low)
}

So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be. tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.

I'm still struggling to understand what we're supposed to **do** with
the above information. I mean if we need to match a regexp against
mixed-case input we have 2 choices:

1) IGNORECASE=1; .. $0 ~ /foo/
2) tolower($0) ~ /foo/

and what we cannot do is just:

3) $0 ~ /foo/

so what can we do with the information that "1" would be slower than "3"
since we can't use "3" for this anyway? If you told us that "1" was
slower than "2" then we could use that information to write scripts
using "2" instead of "1" but I just don't see how the speed of "1" vs
the speed of "3" is something we can act on.

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From J Naman@21:1/5 to Ed Morton on Thu Feb 24 15:55:42 2022

On Thursday, 24 February 2022 at 07:51:12 UTC-5, Ed Morton wrote:

On 2/23/2022 10:45 PM, J Naman wrote:

On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:

On 2/23/2022 3:34 PM, J Naman wrote:

Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

Were there 128% more matches or some other difference in the matched
strings? Without knowing what the input contained it's hard to know what >> those results mean. What were you hoping to test by adding `"?` to the
regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with >> this information.

Ed.

Here are six results, scaled: (not surprising to me)
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer

low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper if(str~/^Include Variable Function NameSpace x/) {x++} # mixed if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)

Am I right in thinking that by the above you mean your test script is basically a script that calls that function some large number of times
in a loop with 1 of the stated strings, e.g.

BEGIN {
low = "include variable function namespace x"
for (i=1;i<=1000000;i++) testmatch(low)
}

So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be. tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.

I'm still struggling to understand what we're supposed to **do** with
the above information. I mean if we need to match a regexp against mixed-case input we have 2 choices:

1) IGNORECASE=1; .. $0 ~ /foo/
2) tolower($0) ~ /foo/

and what we cannot do is just:

3) $0 ~ /foo/

so what can we do with the information that "1" would be slower than "3" since we can't use "3" for this anyway? If you told us that "1" was
slower than "2" then we could use that information to write scripts
using "2" instead of "1" but I just don't see how the speed of "1" vs
the speed of "3" is something we can act on.

Ed.

Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and
everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I
have wasted people's time. John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to J Naman on Thu Feb 24 18:16:40 2022

On 2/24/2022 5:55 PM, J Naman wrote:

On Thursday, 24 February 2022 at 07:51:12 UTC-5, Ed Morton wrote:

On 2/23/2022 10:45 PM, J Naman wrote:

On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:

On 2/23/2022 3:34 PM, J Naman wrote:

Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

Were there 128% more matches or some other difference in the matched
strings? Without knowing what the input contained it's hard to know what >>>> those results mean. What were you hoping to test by adding `"?` to the >>>> regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with >>>> this information.

Ed.

Here are six results, scaled: (not surprising to me)
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer

low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

function testmatch(str, x){ # all 7 regexp are tested every call
if(str~/^include variable function namespace x/) {x++} # lower
if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)

Am I right in thinking that by the above you mean your test script is
basically a script that calls that function some large number of times
in a loop with 1 of the stated strings, e.g.

BEGIN {
low = "include variable function namespace x"
for (i=1;i<=1000000;i++) testmatch(low)
}

So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be.
tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.

I'm still struggling to understand what we're supposed to **do** with
the above information. I mean if we need to match a regexp against
mixed-case input we have 2 choices:

1) IGNORECASE=1; .. $0 ~ /foo/
2) tolower($0) ~ /foo/

and what we cannot do is just:

3) $0 ~ /foo/

so what can we do with the information that "1" would be slower than "3"
since we can't use "3" for this anyway? If you told us that "1" was
slower than "2" then we could use that information to write scripts
using "2" instead of "1" but I just don't see how the speed of "1" vs
the speed of "3" is something we can act on.

Ed.

Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and

everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I
have wasted people's time. John

Ah, now I understand what this was about. Thanks for the information.

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kpop 2GM@21:1/5 to All on Mon Mar 28 12:59:03 2022

% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'tolower($0)~"^include variable function namespace x$"' FS='^$' ) | pvE9)| wc5

in0: 3.40GiB 0:00:04 [ 818MiB/s] [ 818MiB/s] [=============================>] 100%
out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.07s user 0.79s system 113% cpu 4.275 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5
in0: 80.2MiB 0:00:00 [ 801MiB/s] [ 801MiB/s] [> ] 2% ETA 0:00:00
out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
in0: 3.40GiB 0:00:04 [ 820MiB/s] [ 820MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.06s user 0.79s system 113% cpu 4.268 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^$' )| pvE9) | wc5

out9: 64.3MiB 0:00:02 [32.0MiB/s] [32.0MiB/s] [ <=> ]
in0: 3.40GiB 0:00:02 [1.70GiB/s] [1.70GiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 1.58s user 0.82s system 118% cpu 2.026 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

What I'm seeing is if you simply make EVERY letter a combo test of both upper and lower cases, and also prevent it from splitting fields, it's more than 200% time savings. And that's only for mawk-2. For gawk, the savings are unearthly :

out9: 64.3MiB 0:00:43 [1.48MiB/s] [1.48MiB/s] [ <=> ]
in0: 3.40GiB 0:00:43 [80.5MiB/s] [80.5MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 43.01s user 1.12s system 101% cpu 43.317 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

in0: 3.40GiB 0:00:44 [78.0MiB/s] [78.0MiB/s] [=============================>] 100%
out9: 64.3MiB 0:00:44 [1.44MiB/s] [1.44MiB/s] [ <=> ]
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 44.35s user 1.14s system 101% cpu 44.671 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

out9: 64.3MiB 0:00:05 [10.7MiB/s] [10.7MiB/s] [ <=> ]
in0: 3.40GiB 0:00:05 [ 582MiB/s] [ 582MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 5.83s user 0.81s system 110% cpu 6.006 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

=====================
echo; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'tolower($0)~"^include variable function
namespace x$"' FS='^$' ) | pvE9)| wc5 ; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^
$' )| pvE9) | wc5 ===============================================================

The 4Chan Teller

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	05:44:40
Calls:	6,706
Files:	12,236
Messages:	5,350,499

Gawk IGNORECASE=0 vs =1

Who's Online

System Info