Hello out there,
as an answer to my yesterday's post 'ISO conversion tool for text
widgets' Dave posted code which contains an intriguing RE. After some
head scratching I finally understand it - almost. The remaining puzzle
is: What is the difference in the quantifiers '{1,1}?' and '{1}?' ?
The following code demonstrates what I mean:
---
set txt "This is normal text while this is <i>italic</i>
and this is <i>too</i>."
set re1 {<{1,1}?([ib])\s*>(.*?)</\1\s*>}
set re2 {<{1}?([ib])\s*>(.*?)</\1\s*>}
set ranges [regexp -all -indices -inline $re1 $txt]
puts "Version re1"
puts $ranges
puts ""
set ranges [regexp -all -indices -inline $re2 $txt]
puts "Version re2"
puts $ranges
---
Why is this? As per the man page: Isn't
'a sequence of exactly 1 match of the atom'
the same as
'a sequence of 1 to 1 (inclusive) matches of the atom'?
Any enlightenment will be greatly appreciated.
Helmut
On Monday, April 13, 2020 at 11:17:10 PM UTC+2, Dave wrote:https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm#M95 with regard to
Tcl 8.6.8, win7x64
Adding \s* to my RE changes .*? from non-greedy to greedy
Test script:
proc Test {re text} {
puts "\nRe: \"$re\""
puts "Matching against \"$text\""
set n 0
foreach {match sub1 sub2} [regexp -all -inline -indices $re $text] {
lassign $match s e
puts "Match [incr n]: [string range $text $s $e]"
lassign $sub1 s e
puts " $n.1: [string range $text $s $e]"
lassign $sub2 s e
puts " $n.2: [string range $text $s $e]"
}
}
set string "...<i>111</i>..<i>22</i>.."
Test {<([ib])>(.*?)</\1>} $string
Test {<([ib])\s*>(.*?)</\1\s*>} $string
Output:
Re: "<([ib])>(.*?)</\1>"
Matching against "...<i>111</i>..<i>22</i>.."
Match 1: <i>111</i>
1.1: i
1.2: 111
Match 2: <i>22</i>
2.1: i
2.2: 22
Re: "<([ib])\s*>(.*?)</\1\s*>"
Matching against "...<i>111</i>..<i>22</i>.."
Match 1: <i>111</i>..<i>22</i>
1.1: i
1.2: 111</i>..<i>22
(Temp) 1 %
The "(.*?)" is no longer non-greedy. Why?
First preference wins, see
Two more remarks:
* \s* is followed by non-whitespace ">", make it non-greedy.
* Using regexp to parse XML/HTML is not a good idea. Use e.g. tdom.
Hi Dave,cool.
well, your post didn't really answer my origial question
why is {1}? != {1,1}?
But since the URL you cited also mentions only {1,1}? to
'make a RE non-greedy overall'
and never talks about {1}? I think I'll leave it at that.
Thanks for the follow-up.
Helmut
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 286 |
Nodes: | 16 (2 / 14) |
Uptime: | 89:07:26 |
Calls: | 6,496 |
Calls today: | 7 |
Files: | 12,100 |
Messages: | 5,277,442 |