• why does patsplit() exist?

    From Ed Morton@21:1/5 to All on Sat Apr 17 10:08:59 2021
    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match the separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
    print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals)
    print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are
    hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm missing
    - what is it that patsplit() provides that's hard to implement with split()?

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to mortonspam@gmail.com on Sat Apr 17 15:31:53 2021
    In article <s5etmc$jjm$1@dont-email.me>,
    Ed Morton <mortonspam@gmail.com> wrote:
    ...
    Given the awk language traditionally only provides constructs that are
    hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm missing
    - what is it that patsplit() provides that's hard to implement with split()?

    You need to re-read the parts of the GAWK documentation that explain the
    FPAT concept.

    Note: I do get what you mean about how, with enough gymnastics, you could
    (sort of) do with split() what FPAT/patsplit() do, but who needs that kind
    of pain?

    --
    The coronavirus is the first thing, in his 74 pathetic years of existence,
    that the orange menace has come into contact with, that he couldn't browbeat, bully, bullshit, bribe, sue, legally harrass, get Daddy to fix, get his siblings to bail him out of, or, if all else fails, simply wish it away.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Ed Morton on Sat Apr 17 18:48:54 2021
    On 17.04.2021 17:08, Ed Morton wrote:
    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match the separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
    print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are
    hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm missing
    - what is it that patsplit() provides that's hard to implement with
    split()?

    I think it's probably a convenience function, although it's conceptually
    also clearer to distinguish the two cases, and finally you can construct use-cases that produce different output

    echo 'Hello,,world!' |
    awk 'x=patsplit($0,vals,/[^,]*/) {
    print x ; for (i in vals) print vals[i]
    }'

    echo 'Hello,,world!' |
    awk 'x=split($0,tmp,/[^,]*/,vals) {
    print x ; for (i in vals) print vals[i]
    }'

    (where you even can't rely on the result value, e.g. to iterate over
    the fields; in some data cases the values are the same, in other cases
    off by one).

    Janis


    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Ed Morton on Sat Apr 17 22:25:07 2021
    Ed Morton <mortonspam@gmail.com> writes:

    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match
    the separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are
    hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm
    missing - what is it that patsplit() provides that's hard to implement
    with split()?

    Change the + to *. I don't think split will ever see an empty
    separator, but patsplit is happy with empty fields.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Ed Morton on Sat Apr 17 22:20:06 2021
    Ed Morton <mortonspam@gmail.com> writes:

    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match
    the separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are
    hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm
    missing - what is it that patsplit() provides that's hard to implement
    with split()?

    Change the + to a *. I don't think split will ever see an empty separator.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Ed Morton on Sat Apr 17 19:18:04 2021
    On 4/17/2021 10:08 AM, Ed Morton wrote:
    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match the separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
    print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are
    hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm missing
    - what is it that patsplit() provides that's hard to implement with
    split()?

       Ed.

    Thanks for the response Ben & Janis. You both gave an example of a case
    I hadn't considered which is where an empty string could match the regexp:

    Janis:
    -----
    $ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals)
    print i, vals[i] }'
    1 Hello
    2
    3 World

    $ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
    vals) print i, vals[i] }'
    1 Hello
    2 World
    -----

    Ben:
    -----
    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals)
    print i, vals[i] }'
    1
    2
    3
    4 13
    5
    6
    7 27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals)
    print i, vals[i] }'
    1 13
    2 27
    -----

    Is that the only difference - whether or not an empty string can match
    the regexp?

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From J Naman@21:1/5 to Ed Morton on Sat Apr 17 20:39:10 2021
    On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:
    On 4/17/2021 10:08 AM, Ed Morton wrote:
    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match the separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm missing
    - what is it that patsplit() provides that's hard to implement with split()?

    Ed.
    Thanks for the response Ben & Janis. You both gave an example of a case
    I hadn't considered which is where an empty string could match the regexp:

    Janis:
    -----
    $ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals) print i, vals[i] }'
    1 Hello
    2
    3 World

    $ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
    vals) print i, vals[i] }'
    1 Hello
    2 World
    -----

    Ben:
    -----
    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals) print i, vals[i] }'
    1
    2
    3
    4 13
    5
    6
    7 27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals) print i, vals[i] }'
    1 13
    2 27
    -----

    Is that the only difference - whether or not an empty string can match
    the regexp?

    Ed.

    Maybe patsplit() is a convenience, but it is very important to me. In addition to CSV files, I use patsplit() to extract all numeric percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from highly UNSTRUCTURED text files that are aggregations
    of text lines from multiple sources. The text lines that I get have misspellings, non-standard abbreviations, bizarre punctuation -- "unNatural Language Processing". The extracted numeric data then clue me to how to process the sep[] text data.
    Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract all embedded yields (none, some, a lot)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to J Naman on Sun Apr 18 08:53:20 2021
    On 4/17/2021 10:39 PM, J Naman wrote:
    On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:
    On 4/17/2021 10:08 AM, Ed Morton wrote:
    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match the
    separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
    print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals) >>> print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are
    hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm missing
    - what is it that patsplit() provides that's hard to implement with
    split()?

    Ed.
    Thanks for the response Ben & Janis. You both gave an example of a case
    I hadn't considered which is where an empty string could match the regexp: >>
    Janis:
    -----
    $ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals)
    print i, vals[i] }'
    1 Hello
    2
    3 World

    $ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
    vals) print i, vals[i] }'
    1 Hello
    2 World
    -----

    Ben:
    -----
    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals)
    print i, vals[i] }'
    1
    2
    3
    4 13
    5
    6
    7 27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals)
    print i, vals[i] }'
    1 13
    2 27
    -----

    Is that the only difference - whether or not an empty string can match
    the regexp?

    Ed.

    Maybe patsplit() is a convenience, but it is very important to me. In addition to CSV files, I use patsplit() to extract all numeric percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from highly UNSTRUCTURED text files that are
    aggregations of text lines from multiple sources. The text lines that I get have misspellings, non-standard abbreviations, bizarre punctuation -- "unNatural Language Processing". The extracted numeric data then clue me to how to process the sep[] text
    data.
    Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract all embedded yields (none, some, a lot)


    Wouldn't you get the same output from

    split($0, seps, /[0-9]*[.][0-9]*%/, arr)

    though?

    I'm just trying to understand what patsplit() does differently from
    split() with the array names swapped and so far Ben and Janis gave an
    example where it handles null strings differently - best I can tell that wouldn't apply in the case you describe so is there some other difference?

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Ed Morton on Sat Apr 24 07:46:16 2021
    On 4/18/2021 8:53 AM, Ed Morton wrote:
    On 4/17/2021 10:39 PM, J Naman wrote:
    On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:
    On 4/17/2021 10:08 AM, Ed Morton wrote:
    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements
    that match a regexp
    2) split() was given a 4th argument to store the strings that match the >>>> separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) >>>> print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in
    vals)
    print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are >>>> hard to implement with other existing constructs and that both items
    were introduced in the same release there must be something I'm missing >>>> - what is it that patsplit() provides that's hard to implement with
    split()?

    Ed.
    Thanks for the response Ben & Janis. You both gave an example of a case
    I hadn't considered which is where an empty string could match the
    regexp:

    Janis:
    -----
    $ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals) >>> print i, vals[i] }'
    1 Hello
    2
    3 World

    $ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
    vals) print i, vals[i] }'
    1 Hello
    2 World
    -----

    Ben:
    -----
    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals)
    print i, vals[i] }'
    1
    2
    3
    4 13
    5
    6
    7 27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals) >>> print i, vals[i] }'
    1 13
    2 27
    -----

    Is that the only difference - whether or not an empty string can match
    the regexp?

    Ed.

    Maybe patsplit() is a convenience, but it is very important to me. In
    addition to CSV files, I use patsplit() to extract all numeric
    percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from
    highly UNSTRUCTURED text files that are aggregations of text lines
    from multiple sources. The text lines that I get have misspellings,
    non-standard abbreviations, bizarre punctuation -- "unNatural Language
    Processing". The extracted numeric data then clue me to how to process
    the sep[] text data.
    Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract
    all embedded yields (none, some, a lot)


    Wouldn't you get the same output from

       split($0, seps, /[0-9]*[.][0-9]*%/, arr)

    though?

    I'm just trying to understand what patsplit() does differently from
    split() with the array names swapped and so far Ben and Janis gave an
    example where it handles null strings differently - best I can tell that wouldn't apply in the case you describe so is there some other difference?

       Ed.

    Well, best I can tell that handling of null strings that match the
    regexp is the only difference between the 3rd arg for patsplit() and the
    3rd arg for split() other than the cases where split() is using either
    of the special-case FSs of "" or " ".

    So the key is that split() takes an FS for the 3rd arg while patsplit()
    takes a regexp and while a FS is regexp-like, it has 3 special cases
    that make it different from regexps:

    1) FS = "" -> undefined by POSIX, some awks split into chars.
    2) FS = " " -> leading/trailing spaces ignored, split on contiguous spaces.
    3) FS = a regexp that can match a null string -> treat it like a regexp
    that cannot match a null string (e.g. `,*` gets treated like `,+`).

    While that 3rd point makes sense I couldn't actually find anything
    documenting the fact that a field separator isn't allowed to match a
    null string (except in the case of FS="" in some awks). POSIX says:

    ---------
    The following describes FS behavior:

    If FS is a null string, the behavior is unspecified.

    If FS is a single character:

    If FS is <space>, skip leading and trailing <blank> and
    <newline> characters; fields shall be delimited by sets of one or more
    <blank> or <newline> characters.

    Otherwise, if FS is any other character c, fields shall be
    delimited by each single occurrence of c.

    Otherwise, the string value of FS shall be considered to be an extended regular expression. Each occurrence of a sequence matching the extended regular expression shall delimit fields.
    ---------

    so in the case of `-F'[^,]*', for example, that falls into the final
    case above. It should really say "...a sequence _of 1 or more
    characters_ matching..." I suppose.

    That difference makes it non-trivial to implement patsplit() using
    existing functionality (i.e. split() with the args swapped). Thanks to
    all who replied.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kpop 2GM@21:1/5 to All on Mon Sep 6 08:53:05 2021
    706571: /. 8./
    706572: /.시./
    706573: /.!!./
    706574: /.꿀케미 커넥션./
    706575: /. <./
    706576: /.랜선친구 아이오아이./
    706577: /.>./
    706578: /.엠넷 본방사수./
    706579: /.=albm=.=277 ▸ ./
    706580: /.프로듀스./
    706581: /. 101=G0636~ADV009~P2015~E0933647=VOD Various Artists=mnetA-487082=
    ./

    real 0m2.368s
    user 0m0.757s
    sys 0m0.315s

    just tested with mawk 1.3.4 : only 2.4 seconds to split out array with over 700K cells, and the korean strings just by themselves. at lucky times, the english translated names will be conveniently placed in adjacent cells, e.g.

    691357: /., ./
    691358: /.원포유./
    691359: /. (14U) , ./
    691360: /.에이프릴./
    691361: /. (APRIL) , ./
    691362: /.혜이니./
    691363: /. (HEYNE) , ./

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kpop 2GM@21:1/5 to Ed Morton on Mon Sep 6 08:44:36 2021
    On Saturday, April 24, 2021 at 8:46:18 AM UTC-4, Ed Morton wrote:
    On 4/18/2021 8:53 AM, Ed Morton wrote:
    On 4/17/2021 10:39 PM, J Naman wrote:
    On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:
    On 4/17/2021 10:08 AM, Ed Morton wrote:
    In gawk 4.0 two similar changes were introduced:

    1) patsplit() - a new function to split a string into array elements >>>> that match a regexp
    2) split() was given a 4th argument to store the strings that match the >>>> separator regexp in an array.

    For example:

    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals) >>>> print vals[i] }'
    13
    27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in
    vals)
    print vals[i] }'
    13
    27

    Given the awk language traditionally only provides constructs that are >>>> hard to implement with other existing constructs and that both items >>>> were introduced in the same release there must be something I'm missing >>>> - what is it that patsplit() provides that's hard to implement with >>>> split()?

    Ed.
    Thanks for the response Ben & Janis. You both gave an example of a case >>> I hadn't considered which is where an empty string could match the
    regexp:

    Janis:
    -----
    $ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals) >>> print i, vals[i] }'
    1 Hello
    2
    3 World

    $ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
    vals) print i, vals[i] }'
    1 Hello
    2 World
    -----

    Ben:
    -----
    $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals) >>> print i, vals[i] }'
    1
    2
    3
    4 13
    5
    6
    7 27

    $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals) >>> print i, vals[i] }'
    1 13
    2 27
    -----

    Is that the only difference - whether or not an empty string can match >>> the regexp?

    Ed.

    Maybe patsplit() is a convenience, but it is very important to me. In
    addition to CSV files, I use patsplit() to extract all numeric
    percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from
    highly UNSTRUCTURED text files that are aggregations of text lines
    from multiple sources. The text lines that I get have misspellings,
    non-standard abbreviations, bizarre punctuation -- "unNatural Language
    Processing". The extracted numeric data then clue me to how to process
    the sep[] text data.
    Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract
    all embedded yields (none, some, a lot)


    Wouldn't you get the same output from

    split($0, seps, /[0-9]*[.][0-9]*%/, arr)

    though?

    I'm just trying to understand what patsplit() does differently from split() with the array names swapped and so far Ben and Janis gave an example where it handles null strings differently - best I can tell that wouldn't apply in the case you describe so is there some other difference?

    Ed.
    Well, best I can tell that handling of null strings that match the
    regexp is the only difference between the 3rd arg for patsplit() and the
    3rd arg for split() other than the cases where split() is using either
    of the special-case FSs of "" or " ".

    So the key is that split() takes an FS for the 3rd arg while patsplit() takes a regexp and while a FS is regexp-like, it has 3 special cases
    that make it different from regexps:

    1) FS = "" -> undefined by POSIX, some awks split into chars.
    2) FS = " " -> leading/trailing spaces ignored, split on contiguous spaces. 3) FS = a regexp that can match a null string -> treat it like a regexp
    that cannot match a null string (e.g. `,*` gets treated like `,+`).

    While that 3rd point makes sense I couldn't actually find anything documenting the fact that a field separator isn't allowed to match a
    null string (except in the case of FS="" in some awks). POSIX says:

    ---------
    The following describes FS behavior:

    If FS is a null string, the behavior is unspecified.

    If FS is a single character:

    If FS is <space>, skip leading and trailing <blank> and
    <newline> characters; fields shall be delimited by sets of one or more <blank> or <newline> characters.

    Otherwise, if FS is any other character c, fields shall be
    delimited by each single occurrence of c.

    Otherwise, the string value of FS shall be considered to be an
    extended regular expression. Each occurrence of a sequence matching the extended regular expression shall delimit fields.
    ---------

    so in the case of `-F'[^,]*', for example, that falls into the final
    case above. It should really say "...a sequence _of 1 or more
    characters_ matching..." I suppose.

    That difference makes it non-trivial to implement patsplit() using
    existing functionality (i.e. split() with the args swapped). Thanks to
    all who replied.

    Ed.

    I wrote this proof-of-concept for emulating patsplit functionality even without gawk :

    mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"; OFS=" :: "; } END { mypat="\352[\\200-\\277][\\200-\\277]|[\\353\\354][\\200-\\277][\\200-\\277]|\355[\\200-\\235][\\200-\\277]"; print gsub(mypat "|(" mypat ")( |"mypat")*("mypat")", sepFS "&" sepFS);
    gsub(sepFS "("sepFS")+", ""); print nx=split($0,arr, sepFS );for(x=1;x<=nx; x++) { print "\t" x ": /." arr[x] "./" ; }}'

    the test pattern here is all 11,172 korean hangul syllables. at the same time, i also didnt want it to chop korean phrases up due to space character, while preventing latin ASCII from splitting from space character. mawk2, that isn't unicode aware
    whatsoever, can split out nearly 72000-cell array in 3.24 seconds, with all the hangul in the even # cells, and all the ""non-pat", if you will, in the odd numbered ones.

    The trick is simply use a sep string that nearly never exists in proper UTF8 inputs - i only included 2 UTF-8 illegal bytes xC1 xFA \301\372. you can do a quick scan of the data, and if xC1 \301 doesn't show up at all then just use a single byte xC1 as
    your sep. if it *does* show up, it's possibly you're working with binary data streams, in which case, keep padding the sep string with a byte you deem very unlikely to show up

    (tip : don't bother with x00 \000 and xFF \377. those 2 bytes are *very* common in a variety of binary file formats)

    70940: /.베리베리./
    70941: /.XTOO @3./
    70942: /.차 경연./
    70943: /. <./
    70944: /.컬래버레이션 무대./
    70945: /.>=artist=14958011=VOD 657 ▸ ./
    70946: /.로드 투 킹덤./
    70947: /.=year=2020=05-29=secs=251=mstr=NoF=tile=t=info=1280=720=00:04:11=gnr=31219=2908-vod1=clipID=MA_306857=song=.=[Full CAM] ♬ ON - ./
    70948: /.베리베리./
    70949: /.XTOO @3./
    70950: /.차 경연./
    70951: /. <./
    70952: /.컬래버레이션 무대./
    70953: /.>./
    70954: /.킹덤으로 가려는자./
    70955: /., ./
    70956: /.살아남아라./
    70957: /.!Mnet <./
    70958: /.로드 투 킹덤./
    70959: /. (Road to Kingdom) >./
    70960: /.매주 목요일 저녁./
    70961: /. 8./

    I think a similar approach should work for CSV files, using just about any awk variant in circulation. I've personally tested it in mawk 1.3.4, mawk2-beta 1.9.9.6, gawk 5.1.0, and macos 11.5.2 built-in awk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to All on Mon Sep 6 12:16:09 2021
    On 9/6/2021 10:44 AM, Kpop 2GM wrote:
    <snip>
    I wrote this proof-of-concept for emulating patsplit functionality even without gawk :

    mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"

    That's still relying on an extension to POSIX awk for multi-char RS. A
    POSIX awk would treat that like `RS="^"`. I'm not going to try to read
    the rest of the script since it was all crammed onto 1 line. Janis's
    response covers that situation well!

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)