• An RE mystery

    From Helmut Giese@21:1/5 to All on Sat Sep 4 22:19:56 2021
    Hello out there,
    as an answer to my yesterday's post 'ISO conversion tool for text
    widgets' Dave posted code which contains an intriguing RE. After some
    head scratching I finally understand it - almost. The remaining puzzle
    is: What is the difference in the quantifiers '{1,1}?' and '{1}?' ?

    The following code demonstrates what I mean:
    ---
    set txt "This is normal text while this is <i>italic</i>
    and this is <i>too</i>."

    set re1 {<{1,1}?([ib])\s*>(.*?)</\1\s*>}
    set re2 {<{1}?([ib])\s*>(.*?)</\1\s*>}

    set ranges [regexp -all -indices -inline $re1 $txt]
    puts "Version re1"
    puts $ranges
    puts ""
    set ranges [regexp -all -indices -inline $re2 $txt]
    puts "Version re2"
    puts $ranges
    ---
    Why is this? As per the man page: Isn't
    'a sequence of exactly 1 match of the atom'
    the same as
    'a sequence of 1 to 1 (inclusive) matches of the atom'?

    Any enlightenment will be greatly appreciated.
    Helmut

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dave@21:1/5 to Helmut Giese on Sat Sep 4 16:25:15 2021
    On 9/4/2021 3:19 PM, Helmut Giese wrote:
    Hello out there,
    as an answer to my yesterday's post 'ISO conversion tool for text
    widgets' Dave posted code which contains an intriguing RE. After some
    head scratching I finally understand it - almost. The remaining puzzle
    is: What is the difference in the quantifiers '{1,1}?' and '{1}?' ?

    The following code demonstrates what I mean:
    ---
    set txt "This is normal text while this is <i>italic</i>
    and this is <i>too</i>."

    set re1 {<{1,1}?([ib])\s*>(.*?)</\1\s*>}
    set re2 {<{1}?([ib])\s*>(.*?)</\1\s*>}

    set ranges [regexp -all -indices -inline $re1 $txt]
    puts "Version re1"
    puts $ranges
    puts ""
    set ranges [regexp -all -indices -inline $re2 $txt]
    puts "Version re2"
    puts $ranges
    ---
    Why is this? As per the man page: Isn't
    'a sequence of exactly 1 match of the atom'
    the same as
    'a sequence of 1 to 1 (inclusive) matches of the atom'?

    Any enlightenment will be greatly appreciated.
    Helmut


    From the answer to my own posting ca. 2020:

    Subject: Re: Why does adding \s* to my RE change non-greedy to greedy?

    On 4/13/2020 5:35 PM, heinrichmartin wrote:
    On Monday, April 13, 2020 at 11:17:10 PM UTC+2, Dave wrote:
    Tcl 8.6.8, win7x64

    Adding \s* to my RE changes .*? from non-greedy to greedy

    Test script:

    proc Test {re text} {
    puts "\nRe: \"$re\""
    puts "Matching against \"$text\""

    set n 0
    foreach {match sub1 sub2} [regexp -all -inline -indices $re $text] {
    lassign $match s e
    puts "Match [incr n]: [string range $text $s $e]"
    lassign $sub1 s e
    puts " $n.1: [string range $text $s $e]"
    lassign $sub2 s e
    puts " $n.2: [string range $text $s $e]"
    }
    }

    set string "...<i>111</i>..<i>22</i>.."

    Test {<([ib])>(.*?)</\1>} $string

    Test {<([ib])\s*>(.*?)</\1\s*>} $string

    Output:

    Re: "<([ib])>(.*?)</\1>"
    Matching against "...<i>111</i>..<i>22</i>.."
    Match 1: <i>111</i>
    1.1: i
    1.2: 111
    Match 2: <i>22</i>
    2.1: i
    2.2: 22

    Re: "<([ib])\s*>(.*?)</\1\s*>"
    Matching against "...<i>111</i>..<i>22</i>.."
    Match 1: <i>111</i>..<i>22</i>
    1.1: i
    1.2: 111</i>..<i>22
    (Temp) 1 %

    The "(.*?)" is no longer non-greedy. Why?

    First preference wins, see
    https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm#M95 with regard to
    greedy vs non-greedy preference.

    Two more remarks:
    * \s* is followed by non-whitespace ">", make it non-greedy.
    * Using regexp to parse XML/HTML is not a good idea. Use e.g. tdom.


    Thank you. I had skimmed past that part because I thought that the (.*)?
    was sufficient. My re is now {<{1,1}?([ib])\s*>(.*?)</\1\s*>} and it is
    working fine.


    --
    computerjock AT mail DOT com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helmut Giese@21:1/5 to All on Sun Sep 5 19:41:16 2021
    Hi Dave,
    well, your post didn't really answer my origial question
    why is {1}? != {1,1}?
    But since the URL you cited also mentions only {1,1}? to
    'make a RE non-greedy overall'
    and never talks about {1}? I think I'll leave it at that.
    Thanks for the follow-up.
    Helmut

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Conor Williams@21:1/5 to Helmut Giese on Mon Oct 11 12:42:13 2021
    On Sunday, September 5, 2021 at 5:41:20 PM UTC, Helmut Giese wrote:
    Hi Dave,
    well, your post didn't really answer my origial question
    why is {1}? != {1,1}?
    But since the URL you cited also mentions only {1,1}? to
    'make a RE non-greedy overall'
    and never talks about {1}? I think I'll leave it at that.
    Thanks for the follow-up.
    Helmut
    cool.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Conor Williams@21:1/5 to All on Mon Oct 11 16:46:26 2021
    i usually use a flavour of vi for my regexes (maybe sed the odd time either...)

    but.. (in tcl) {4,5} on a dot means match at least 4 and at most 5
    and {4} following a dot means match 4

    /c:202111102339:23

    2 cases: 1@
    txt:
    <b>a</b>e<b>bbb</b>f<b>cccc</b>g<b>ddddd</b>h<b>aaaaaaaa</b>

    regex:
    <b>.{4,5}</b>

    yields:
    <b>cccc</b> <b>ddddd</b>
    :2@
    regex
    <b>.{4}</b>
    yields:
    <b>cccc</b>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)