When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
This works (no spaces):
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())
That gives me the start and end character positions, which is what I want.
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
I’ve tried several other attempts based on my reseearch, but still no match.
I don’t have much experience with regex, so I hoped a reg-expert might help.
When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
This works (no spaces):
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())
That gives me the start and end character positions, which is what I want.
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
I’ve tried several other attempts based on my reseearch, but still no >match.
On 2023-02-27 23:11, Jen Kris via Python-list wrote:
When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.You need to escape only the pattern, not the string you're searching.
This works (no spaces):
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())
That gives me the start and end character positions, which is what I want. >>
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
I’ve tried several other attempts based on my reseearch, but still no match.
I don’t have much experience with regex, so I hoped a reg-expert might help.
--
https://mail.python.org/mailman/listinfo/python-list
When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
This works (no spaces):
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())
That gives me the start and end character positions, which is what I want.
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq') find_string = re.escape('cty_degrees + 1') for match in re.finditer(find_string,
example):
print(match.start(), match.end())
I’ve tried several other attempts based on my reseearch, but still no match.
I don’t have much experience with regex, so I hoped a reg-expert might help.
On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
This works (no spaces):
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())
That gives me the start and end character positions, which is what I want. >>
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
I’ve tried several other attempts based on my reseearch, but still no match.
You need to print those strings out. You're escaping the _example_ string, which would make it:
X - cty_degrees \+ 1 \+ qq
because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
I went to the re module because the specified string may appear more
than once in the string (in the code I'm writing).
On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
When matching a string against a longer string, where both strings
have spaces in them, we need to escape the spaces.
This works (no spaces):
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())
That gives me the start and end character positions, which is what I
want.
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq') find_string =
re.escape('cty_degrees + 1') for match in re.finditer(find_string,
example):
print(match.start(), match.end())
I’ve tried several other attempts based on my reseearch, but still no
match.
You need to print those strings out. You're escaping the _example_ string, which would make it:
X - cty_degrees \+ 1 \+ qq
because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Would string.count() work for you then?
On Mon, Feb 27, 2023 at 5:16 PM Jen Kris via Python-list <> python-list@python.org> > wrote:
I went to the re module because the specified string may appear more than once in the string (in the code I'm writing). For example:
a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)
print(q)
4
So it correctly finds the start of the first instance, but not the second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.
I changed my re code after MRAB's comment, it now works.
Thanks much.
Jen
Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :
On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:
When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
This works (no spaces):
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())
That gives me the start and end character positions, which is what I want.
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
I’ve tried several other attempts based on my reseearch, but still no match.
You need to print those strings out. You're escaping the _example_ string, which would make it:
X - cty_degrees \+ 1 \+ qq
because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
Cheers,
Cameron Simpson <>> cs@cskk.id.au>> >
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list
--
**** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: > bob@mellowood.ca
WWW: > http://www.mellowood.ca
Yes, that's it. I don't know how long it would have taken to find that >detail with research through the voluminous re documentation. Thanks
very much.
On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end
Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Trueshort = "hello world"
longer = "hello world is how many programs start for novices but some use hello world! to show how happy they are to say hello world"
short in longer
[(0, 11), (64, 75), (111, 122)]howLong = len(short)
res = [(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
res
3len(res)
print( [ longer[res[index][0]:res[index][1]] for index in range(len(res))]) ['hello world', 'hello world', 'hello world']
[(2, 11), (7, 16), (37, 46), (42, 51), (47, 56)]short = "good good"
longer = "A good good good but not douple plus good good good goody"
howLong = len(short)
res = [(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
res
Would string.count() work for you then?
On Mon, Feb 27, 2023 at 5:16 PM Jen Kris via Python-list <> python-list@python.org> > wrote:
I went to the re module because the specified string may appear more
than once in the string (in the code I'm writing). For example:
a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)
print(q)
4
So it correctly finds the start of the first instance, but not the
second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.
I changed my re code after MRAB's comment, it now works.
Thanks much.
Jen
Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :
On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:strings have spaces in them, we need to escape the spaces. >> >>
When matching a string against a longer string, where both
This works (no spaces):
start and end character positions, which is what I want.
import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end()) >> >> That gives me the
re.escape('cty_degrees + 1') >> for match in
However, this does not work:
import re
example = re.escape('X - cty_degrees + 1 + qq') >> find_string =
re.finditer(find_string, example):
other attempts based on my reseearch, but still no match.print(match.start(), match.end()) >> >> I’ve tried several
https://mail.python.org/mailman/listinfo/python-list
You need to print those strings out. You're escaping the _example_ string, which would make it:
X - cty_degrees \+ 1 \+ qq
because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
Cheers,
Cameron Simpson <>> cs@cskk.id.au>> > > -- > >>
--
https://mail.python.org/mailman/listinfo/python-list
--
**** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: > bob@mellowood.ca
WWW: > http://www.mellowood.ca
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end
Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
And, just for fun, since there is nothing wrong with your code, this minor change is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
...
4 18
26 40
And, just for fun, since there is nothing wrong with your code, this minorchange is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
...
4 18
26 40
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:I think it's often a good idea to use a standard library function
And, just for fun, since there is nothing wrong with your code, this
minor change is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'
for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
Op 28/02/2023 om 3:44 schreef Thomas Passin:
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:I think it's often a good idea to use a standard library function
And, just for fun, since there is nothing wrong with your code, this
minor change is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'
for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40
instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where re.finditer() uses regular expressions while the use case only uses
simple search strings). Ideally there would be a str.finditer() method
we could use, but in the absence of that I think we still need to
consider using the almost-but-not-quite fitting re.finditer().
Two reasons:
(1) I think it's clearer: the name tells us what it does (though of
course we could solve this in a hand-written version by wrapping it in a suitably named function).
(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't a
match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes
more complex and more error-prone. Using a well-tested existing function becomes quite attractive.
To illustrate the difference performance, I did a simple test (using the paragraph above is test text):
import re
import timeit
def using_re_finditer(key, text):
matches = []
for match in re.finditer(re.escape(key), text):
matches.append((match.start(), match.end()))
return matches
def using_simple_loop(key, text):
matches = []
for i in range(len(text)):
if text[i:].startswith(key):
matches.append((i, i + len(key)))
return matches
CORPUS = """Searching for a string in another string, in a
performant way, is
not as simple as it first appears. Your version works correctly,
but slowly.
In some situations it doesn't matter, but in other cases it will.
For better
performance, string searching algorithms jump ahead either when
they found a
match or when they know for sure there isn't a match for some time (see e.g.
the Boyer–Moore string-search algorithm). You could write such a more
efficient algorithm, but then it becomes more complex and more error-prone.
Using a well-tested existing function becomes quite attractive."""
KEY = 'in'
print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))
This does 5 runs of 1000 repetitions each, and reports the time in
seconds for each of those runs.
Result on my machine:
using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
We find that in this test re.finditer() is more than 30 times faster
(despite the overhead of regular expressions.
While speed isn't everything in programming, with such a large
difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.
On 2/28/2023 4:33 AM, Roel Schroeven wrote:On hindsight I think it was overthinking things a bit. "It all depends
[...]
(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't
a match for some time (see e.g. the Boyer–Moore string-search
algorithm). You could write such a more efficient algorithm, but then
it becomes more complex and more error-prone. Using a well-tested
existing function becomes quite attractive.
Sure, it all depends on what the real task will be. That's why I
wrote "Without knowing how general your expressions will be". For the
example string, it's unlikely that speed will be a factor, but who
knows what target strings and keys will turn up in the future?
Op 28/02/2023 om 14:35 schreef Thomas Passin:
On 2/28/2023 4:33 AM, Roel Schroeven wrote:On hindsight I think it was overthinking things a bit. "It all depends
[...]
(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't
a match for some time (see e.g. the Boyer–Moore string-search
algorithm). You could write such a more efficient algorithm, but then
it becomes more complex and more error-prone. Using a well-tested
existing function becomes quite attractive.
Sure, it all depends on what the real task will be. That's why I
wrote "Without knowing how general your expressions will be". For the
example string, it's unlikely that speed will be a factor, but who
knows what target strings and keys will turn up in the future?
on what the real task will be" you say, and indeed I think that should
be the main conclusion here.
On 2/28/2023 10:05 AM, Roel Schroeven wrote:
Op 28/02/2023 om 14:35 schreef Thomas Passin:
On 2/28/2023 4:33 AM, Roel Schroeven wrote:On hindsight I think it was overthinking things a bit. "It all depends
[...]
(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't
a match for some time (see e.g. the Boyer–Moore string-search
algorithm). You could write such a more efficient algorithm, but then
it becomes more complex and more error-prone. Using a well-tested
existing function becomes quite attractive.
Sure, it all depends on what the real task will be. That's why I
wrote "Without knowing how general your expressions will be". For the
example string, it's unlikely that speed will be a factor, but who
knows what target strings and keys will turn up in the future?
on what the real task will be" you say, and indeed I think that should
be the main conclusion here.
It is interesting, though, how pre-processing the search pattern can
improve search times if you can afford the pre-processing. Here's a
paper on rapidly finding matches when there may be up to one misspelled character. It's easy enough to implement, though in Python you can't
take the additional step of tuning it to stay in cache.
https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
And, just for fun, since there is nothing wrong with your code, this minor change is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'
for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40
If you may have variable numbers of spaces around the symbols, OTOH, the whole situation changes and then regexes would almost certainly be the best approach. But the regular expression strings would become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list
Jen,doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.
Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.
What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.
This is what you sent:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())
This is code indentedproperly:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....
And, just for fun, since there is nothing wrong with your code, this minor change is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
...
4 18
26 40
But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple
-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 8:36 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?
I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())
4 18
26 40
I don't insist on terseness for its own sake, but it's cleaner this way.
Jen
Feb 27, 2023, 16:55 by cs@cskk.id.au:
On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end
Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list
The code I sent is correct, and it runs here. Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
One question: several people have made suggestions other than regex (not your terser example with regex you shown below). Is there a reason why regex is not preferred to, for example, a list comp? Performance? Reliability?
Feb 27, 2023, 18:16 by avi.e.gross@gmail.com:doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.
Jen,
Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.
What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.
This is what you sent:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())
This is code indentedproperly:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())
Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....
And, just for fun, since there is nothing wrong with your code, this minor change is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
...
4 18
26 40
But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple
-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 8:36 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?
I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())
4 18
26 40
I don't insist on terseness for its own sake, but it's cleaner this way.
Jen
Feb 27, 2023, 16:55 by cs@cskk.id.au:
On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
Sure, but writing a `finditer` for plain `str` is pretty easy (untested): >>>
pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end
Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list
On 2/28/2023 4:33 AM, Roel Schroeven wrote:use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting re.finditer().
Op 28/02/2023 om 3:44 schreef Thomas Passin:
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where re.finditer() uses regular expressions while the
And, just for fun, since there is nothing wrong with your code, this minor change is terser:
... print(match.start(), match.end())example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example): >>>>>>>
...
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'
for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40
searching algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm). You could write such a more efficient algorithm, but then it becomes more
Two reasons:
(1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).
(2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string
in the future?
Sure, it all depends on what the real task will be. That's why I wrote "Without knowing how general your expressions will be". For the example string, it's unlikely that speed will be a factor, but who knows what target strings and keys will turn up
To illustrate the difference performance, I did a simple test (using the paragraph above is test text):
import re
import timeit
def using_re_finditer(key, text):
matches = []
for match in re.finditer(re.escape(key), text):
matches.append((match.start(), match.end()))
return matches
def using_simple_loop(key, text):
matches = []
for i in range(len(text)):
if text[i:].startswith(key):
matches.append((i, i + len(key)))
return matches
CORPUS = """Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but slowly.
In some situations it doesn't matter, but in other cases it will. For better
performance, string searching algorithms jump ahead either when they found a
match or when they know for sure there isn't a match for some time (see e.g.
the Boyer–Moore string-search algorithm). You could write such a more
efficient algorithm, but then it becomes more complex and more error-prone.
Using a well-tested existing function becomes quite attractive.""" >> KEY = 'in'
print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))
This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
Result on my machine:
using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.
While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.
--
https://mail.python.org/mailman/listinfo/python-list
Using str.startswith is a cool idea in this case. But is it better than regex for performance or reliability? Regex syntax is not a model of simplicity, but in my simple case it's not too difficult.
Feb 27, 2023, 18:52 by list1@tompassin.net:
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
And, just for fun, since there is nothing wrong with your code,
this minor change is terser:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1')
, example):
... print(match.start(), match.end())
...
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than
regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'
for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40
If you may have variable numbers of spaces around the symbols, OTOH,
the whole situation changes and then regexes would almost certainly
be the best approach. But the regular expression strings would
become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where
And, just for fun, since there is nothing wrong with your code, this
minor change is terser:
... print(match.start(), match.end()) ...example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'
for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40
On 2023-02-28, Thomas Passin <list1@tompassin.net> wrote:...
It is interesting, though, how pre-processing the search pattern can
improve search times if you can afford the pre-processing. Here's a
paper on rapidly finding matches when there may be up to one misspelled
character. It's easy enough to implement, though in Python you can't
take the additional step of tuning it to stay in cache.
https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf
You've somehow title-cased that URL. The correct URL is:
https://robert.muth.org/Papers/1996-approx-multi.pdf
I wrote my previous message before reading this. Thank you for the test you ran -- it answers the question of performance. You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces loopingoverhead.
def using_simple_loop(key, text):
matches = []
for i in range(len(text)):
if text[i:].startswith(key):
matches.append((i, i + len(key)))
return matches
using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
Using str.startswith is a cool idea in this case. But is it better
than regex for performance or reliability? Regex syntax is not a
model of simplicity, but in my simple case it's not too difficult.
Feb 27, 2023, 18:52 by list1@tompassin.net:
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
And, just for fun, since there is nothing wrong with your code,
this minor change is terser:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1')
, example):
... print(match.start(), match.end())
...
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than
regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'
for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40
If you may have variable numbers of spaces around the symbols, OTOH,
the whole situation changes and then regexes would almost certainly
be the best approach. But the regular expression strings would
become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list
With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.
def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches
using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]
I wrote my previous message before reading this. Thank you for the testyou ran -- it answers the question of performance. You show that
def using_simple_loop(key, text):
matches = []
for i in range(len(text)):
if text[i:].startswith(key):
matches.append((i, i + len(key)))
return matches
using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
One question: several people have made suggestions other than regex
(not your terser example with regex you shown below). Is there a
reason why regex is not preferred to, for example, a list comp?
Performance? Reliability?
Jen,
I had no doubt the code you ran was indented properly or it would not work.
I am merely letting you know that somewhere in the process of copying
the code or the transition between mailers, my version is messed up.
It happens to be easy for me to fix but I sometimes see garbled code I
then simply ignore.
On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
It happens to be easy for me to fix but I sometimes see garbled code I
then simply ignore.
Truth to be told, that's one reason why I rarely read your mails to the
end. The long lines and the triple-spaced paragraphs make it just too uncomfortable.
On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
It happens to be easy for me to fix but I sometimes see garbled code
I then simply ignore.
Truth to be told, that's one reason why I rarely read your mails to
the end. The long lines and the triple-spaced paragraphs make it just
too uncomfortable.
Regexps are:
- cryptic and error prone (you can make them more readable, but the
notation is deliberately both terse and powerful, which means that
small changes can have large effects in behaviour); the "error prone"
part does not mean that a regexp is unreliable, but that writing one
which is _correct_ for your task can be difficult,
On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:
Regexps are:
- cryptic and error prone (you can make them more readable, but the
notation is deliberately both terse and powerful, which means that
small changes can have large effects in behaviour); the "error prone"
part does not mean that a regexp is unreliable, but that writing one
which is _correct_ for your task can be difficult,
The nasty thing is that writing one that _appears_ to be correct for
your task is often fairly easy. It will work as you expect for the
test cases you throw at it, but then fail in confusing ways when
released into the "real world". If you're lucky, it fails frequently
and obviously enough that you notice it right away. If you're not
lucky, it will fail infrequently and subtly for many years to come.
My rule: never use an RE if you can use the normal string methods
(even if it takes a a few lines of code using them to replace a single
RE).
On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:[...]
I had no doubt the code you ran was indented properly or it would not work.
I am merely letting you know that somewhere in the process of copying
the code or the transition between mailers, my version is messed up.
The problem seems to be at your end. Jen's code looks ok here.
I have no idea why it would join only some lines but not others.
On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:work.
I had no doubt the code you ran was indented properly or it would not
up.I am merely letting you know that somewhere in the process of
copying the code or the transition between mailers, my version is messed
The problem seems to be at your end. Jen's code looks ok here.[...]
I have no idea why it would join only some lines but not others.
[1] Personally I'd say you shouldn't use Outlook if you are reading
mails where line breaks (or other formatting) is important, but ...
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 73:58:48 |
Calls: | 6,714 |
Calls today: | 2 |
Files: | 12,246 |
Messages: | 5,357,271 |