Given the awk language traditionally only provides constructs that are
hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm missing
- what is it that patsplit() provides that's hard to implement with split()?
In gawk 4.0 two similar changes were introduced:
1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match the separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are
hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm missing
- what is it that patsplit() provides that's hard to implement with
split()?
Ed.
In gawk 4.0 two similar changes were introduced:
1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match
the separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are
hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm
missing - what is it that patsplit() provides that's hard to implement
with split()?
In gawk 4.0 two similar changes were introduced:
1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match
the separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are
hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm
missing - what is it that patsplit() provides that's hard to implement
with split()?
In gawk 4.0 two similar changes were introduced:
1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match the separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are
hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm missing
- what is it that patsplit() provides that's hard to implement with
split()?
Ed.
On 4/17/2021 10:08 AM, Ed Morton wrote:
In gawk 4.0 two similar changes were introduced:
1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match the separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm missing
- what is it that patsplit() provides that's hard to implement with split()?
Ed.Thanks for the response Ben & Janis. You both gave an example of a case
I hadn't considered which is where an empty string could match the regexp:
Janis:
-----
$ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals) print i, vals[i] }'
1 Hello
2
3 World
$ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
vals) print i, vals[i] }'
1 Hello
2 World
-----
Ben:
-----
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals) print i, vals[i] }'
1
2
3
4 13
5
6
7 27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals) print i, vals[i] }'
1 13
2 27
-----
Is that the only difference - whether or not an empty string can match
the regexp?
Ed.
On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:aggregations of text lines from multiple sources. The text lines that I get have misspellings, non-standard abbreviations, bizarre punctuation -- "unNatural Language Processing". The extracted numeric data then clue me to how to process the sep[] text
On 4/17/2021 10:08 AM, Ed Morton wrote:
In gawk 4.0 two similar changes were introduced:Thanks for the response Ben & Janis. You both gave an example of a case
1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match the
separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) >>> print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are
hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm missing
- what is it that patsplit() provides that's hard to implement with
split()?
Ed.
I hadn't considered which is where an empty string could match the regexp: >>
Janis:
-----
$ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals)
print i, vals[i] }'
1 Hello
2
3 World
$ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
vals) print i, vals[i] }'
1 Hello
2 World
-----
Ben:
-----
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals)
print i, vals[i] }'
1
2
3
4 13
5
6
7 27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals)
print i, vals[i] }'
1 13
2 27
-----
Is that the only difference - whether or not an empty string can match
the regexp?
Ed.
Maybe patsplit() is a convenience, but it is very important to me. In addition to CSV files, I use patsplit() to extract all numeric percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from highly UNSTRUCTURED text files that are
Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract all embedded yields (none, some, a lot)
On 4/17/2021 10:39 PM, J Naman wrote:
On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:
On 4/17/2021 10:08 AM, Ed Morton wrote:
In gawk 4.0 two similar changes were introduced:Thanks for the response Ben & Janis. You both gave an example of a case
1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match the >>>> separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) >>>> print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in
vals)
print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are >>>> hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm missing >>>> - what is it that patsplit() provides that's hard to implement with
split()?
Ed.
I hadn't considered which is where an empty string could match the
regexp:
Janis:
-----
$ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals) >>> print i, vals[i] }'
1 Hello
2
3 World
$ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
vals) print i, vals[i] }'
1 Hello
2 World
-----
Ben:
-----
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals)
print i, vals[i] }'
1
2
3
4 13
5
6
7 27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals) >>> print i, vals[i] }'
1 13
2 27
-----
Is that the only difference - whether or not an empty string can match
the regexp?
Ed.
Maybe patsplit() is a convenience, but it is very important to me. In
addition to CSV files, I use patsplit() to extract all numeric
percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from
highly UNSTRUCTURED text files that are aggregations of text lines
from multiple sources. The text lines that I get have misspellings,
non-standard abbreviations, bizarre punctuation -- "unNatural Language
Processing". The extracted numeric data then clue me to how to process
the sep[] text data.
Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract
all embedded yields (none, some, a lot)
Wouldn't you get the same output from
split($0, seps, /[0-9]*[.][0-9]*%/, arr)
though?
I'm just trying to understand what patsplit() does differently from
split() with the array names swapped and so far Ben and Janis gave an
example where it handles null strings differently - best I can tell that wouldn't apply in the case you describe so is there some other difference?
Ed.
On 4/18/2021 8:53 AM, Ed Morton wrote:
On 4/17/2021 10:39 PM, J Naman wrote:
On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:
On 4/17/2021 10:08 AM, Ed Morton wrote:
In gawk 4.0 two similar changes were introduced:Thanks for the response Ben & Janis. You both gave an example of a case >>> I hadn't considered which is where an empty string could match the
1) patsplit() - a new function to split a string into array elements >>>> that match a regexp
2) split() was given a 4th argument to store the strings that match the >>>> separator regexp in an array.
For example:
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) >>>> print vals[i] }'
13
27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in
vals)
print vals[i] }'
13
27
Given the awk language traditionally only provides constructs that are >>>> hard to implement with other existing constructs and that both items >>>> were introduced in the same release there must be something I'm missing >>>> - what is it that patsplit() provides that's hard to implement with >>>> split()?
Ed.
regexp:
Janis:
-----
$ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals) >>> print i, vals[i] }'
1 Hello
2
3 World
$ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
vals) print i, vals[i] }'
1 Hello
2 World
-----
Ben:
-----
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals) >>> print i, vals[i] }'
1
2
3
4 13
5
6
7 27
$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals) >>> print i, vals[i] }'
1 13
2 27
-----
Is that the only difference - whether or not an empty string can match >>> the regexp?
Ed.
Maybe patsplit() is a convenience, but it is very important to me. In
addition to CSV files, I use patsplit() to extract all numeric
percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from
highly UNSTRUCTURED text files that are aggregations of text lines
from multiple sources. The text lines that I get have misspellings,
non-standard abbreviations, bizarre punctuation -- "unNatural Language
Processing". The extracted numeric data then clue me to how to process
the sep[] text data.
Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract
all embedded yields (none, some, a lot)
Wouldn't you get the same output from
split($0, seps, /[0-9]*[.][0-9]*%/, arr)
though?
I'm just trying to understand what patsplit() does differently from split() with the array names swapped and so far Ben and Janis gave an example where it handles null strings differently - best I can tell that wouldn't apply in the case you describe so is there some other difference?
Ed.Well, best I can tell that handling of null strings that match the
regexp is the only difference between the 3rd arg for patsplit() and the
3rd arg for split() other than the cases where split() is using either
of the special-case FSs of "" or " ".
So the key is that split() takes an FS for the 3rd arg while patsplit() takes a regexp and while a FS is regexp-like, it has 3 special cases
that make it different from regexps:
1) FS = "" -> undefined by POSIX, some awks split into chars.
2) FS = " " -> leading/trailing spaces ignored, split on contiguous spaces. 3) FS = a regexp that can match a null string -> treat it like a regexp
that cannot match a null string (e.g. `,*` gets treated like `,+`).
While that 3rd point makes sense I couldn't actually find anything documenting the fact that a field separator isn't allowed to match a
null string (except in the case of FS="" in some awks). POSIX says:
---------
The following describes FS behavior:
If FS is a null string, the behavior is unspecified.
If FS is a single character:
If FS is <space>, skip leading and trailing <blank> and
<newline> characters; fields shall be delimited by sets of one or more <blank> or <newline> characters.
Otherwise, if FS is any other character c, fields shall be
delimited by each single occurrence of c.
Otherwise, the string value of FS shall be considered to be an
extended regular expression. Each occurrence of a sequence matching the extended regular expression shall delimit fields.
---------
so in the case of `-F'[^,]*', for example, that falls into the final
case above. It should really say "...a sequence _of 1 or more
characters_ matching..." I suppose.
That difference makes it non-trivial to implement patsplit() using
existing functionality (i.e. split() with the args swapped). Thanks to
all who replied.
Ed.
I wrote this proof-of-concept for emulating patsplit functionality even without gawk :
mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 286 |
Nodes: | 16 (2 / 14) |
Uptime: | 90:51:13 |
Calls: | 6,496 |
Calls today: | 7 |
Files: | 12,100 |
Messages: | 5,277,686 |