Forum: >>> Magnum BBS <<<

Re: How to escape strings for re.finditer?

From MRAB@21:1/5 to Jen Kris via Python-list on Mon Feb 27 23:45:45 2023

On 2023-02-27 23:11, Jen Kris via Python-list wrote:

When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())

That gives me the start and end character positions, which is what I want.

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match.

I don’t have much experience with regex, so I hoped a reg-expert might help.

You need to escape only the pattern, not the string you're searching.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 00:11:10 2023

When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())

That gives me the start and end character positions, which is what I want.

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match.

I don’t have much experience with regex, so I hoped a reg-expert might help.

Thanks,

Jen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Cameron Simpson@21:1/5 to Jen Kris on Tue Feb 28 10:54:43 2023

On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:

When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())

That gives me the start and end character positions, which is what I want.

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no >match.

You need to print those strings out. You're escaping the _example_
string, which would make it:

X - cty_degrees \+ 1 \+ qq

because `+` is a special character in regexps and so `re.escape` escapes
it. But you don't want to mangle the string you're searching! After all,
the text above does not contain the string `cty_degrees + 1`.

My secondary question is: if you're escaping the thing you're searching
_for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your
searching?

The `str` type has a `find(substring)` function. Just use that! It'll be
faster and the code simpler!

Cheers,
Cameron Simpson <cs@cskk.id.au>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 00:57:22 2023

Yes, that's it. I don't know how long it would have taken to find that detail with research through the voluminous re documentation. Thanks very much.

Feb 27, 2023, 15:47 by python@mrabarnett.plus.com:

On 2023-02-27 23:11, Jen Kris via Python-list wrote:

When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())

That gives me the start and end character positions, which is what I want. >>
However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match.

I don’t have much experience with regex, so I hoped a reg-expert might help.

You need to escape only the pattern, not the string you're searching.
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to Jen Kris via Python-list on Mon Feb 27 19:06:04 2023

MRAB makes a valid point. The regular expression compiled is only done on the pattern you are looking for and it it contains anything that might be a command, such as an ^ at the start or [12] in middle, you want that converted so NONE OF THAT is one. It
will be compiled to something that looks for an ^, including later in the string, and look for a real [ then a real 1 and a real 2 and a real ], not for one of the choices of 1 or 2.

Your example was 'cty_degrees + 1' which can have a subtle bug introduced. The special character is "+" which means match greedily as many copies of the previous entity as possible. In this case, the previous entity was a single space. So the regular
expression will match 'cty degrees' then match the single space it sees because it sees a space followed ny a plus then not looking for a plus, hits a plus and fails. If your example is rewritten in whatever way re.escape uses, it might be 'cty_degrees \
+ 1' and then it should work fine.

But converting what you are searching for just breaks that as the result will have a '\+" whish is being viewed as two unrelated symbols and the backslash breaks the match from going further.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of MRAB
Sent: Monday, February 27, 2023 6:46 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-02-27 23:11, Jen Kris via Python-list wrote:

When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())

That gives me the start and end character positions, which is what I want.

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq') find_string = re.escape('cty_degrees + 1') for match in re.finditer(find_string,
example):
print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match.

I don’t have much experience with regex, so I hoped a reg-expert might help.

You need to escape only the pattern, not the string you're searching.
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 01:13:32 2023

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing). For example:

a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.

I changed my re code after MRAB's comment, it now works.

Thanks much.

Jen

Feb 27, 2023, 15:56 by cs@cskk.id.au:

On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:

When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())

That gives me the start and end character positions, which is what I want. >>
However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match.

You need to print those strings out. You're escaping the _example_ string, which would make it:

X - cty_degrees \+ 1 \+ qq

because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.

My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?

The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Cameron Simpson@21:1/5 to Jen Kris on Tue Feb 28 11:33:47 2023

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:

I went to the re module because the specified string may appear more
than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy
(untested):

pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end

Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to
keep in mind.

Cheers,
Cameron Simpson <cs@cskk.id.au>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to All on Mon Feb 27 19:34:49 2023

Just FYI, Jen, there are times a sledgehammer works but perhaps is not the only way. These days people worry less about efficiency and more about programmer time and education and that can be fine.

But it you looked at methods available in strings or in some other modules, your situation is quite common. Some may use another RE front end called finditer().

I am NOT suggesting you do what I say next, but imagine writing a loop that takes a substring of what you are searching for of the same length as your search string. Near the end, it stops as there is too little left.

You can now simply test your searched for string against that substring for equality and it tends to return rapidly when they are not equal early on.

Your loop would return whatever data structure or results you want such as that it matched it three times at offsets a, b and c.

But do you allow overlaps? If not, your loop needs to skip len(search_str) after a match.

What you may want to consider is another form of pre-processing. Do you care if "abc_degree + 1" has missing or added spaces at the tart or end or anywhere in middle as in " abc_degree +1"?

Do you care if stuff is a different case like "Abc_Degree + 1"?

Some such searches can be done if both the pattern and searched string are first converted to a canonical format that maps to the same output. But that complicates things a bit and you may to display what you match differently.

And are you also willing to match this: "myabc_degree + 1"?

When using a crafter RE there is a way to ask for a word boundary so abc will only be matched if before that is a space or the start of the string and not "my".

So this may be a case where you can solve an easy version with the chance it can be fooled or overengineer it. If you are allowing the user to type in what to search for, as many programs including editors, do, you will often find such false positives
unless the user knows RE syntax and applies it and you do not escape it. I have experienced havoc when doing a careless global replace that matched more than I expected, including making changes in comments or constant strings rather than just the name
of a function. Adding a paren is helpful as is not replacing them all but one at a time and skipping any that are not wanted.

Good luck.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:14 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing). For example:

a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.

I changed my re code after MRAB's comment, it now works.

Thanks much.

Jen

Feb 27, 2023, 15:56 by cs@cskk.id.au:

On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:

When matching a string against a longer string, where both strings
have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end())

That gives me the start and end character positions, which is what I
want.

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq') find_string =
re.escape('cty_degrees + 1') for match in re.finditer(find_string,
example):
print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no
match.

You need to print those strings out. You're escaping the _example_ string, which would make it:

X - cty_degrees \+ 1 \+ qq

because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.

My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?

The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 01:39:57 2023

string.count() only tells me there are N instances of the string; it does not say where they begin and end, as does re.finditer.

Feb 27, 2023, 16:20 by bobmellowood@gmail.com:

Would string.count() work for you then?

On Mon, Feb 27, 2023 at 5:16 PM Jen Kris via Python-list <> python-list@python.org> > wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing). For example:

a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.

I changed my re code after MRAB's comment, it now works.

Thanks much.

Jen

Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :

On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:

When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

That gives me the start and end character positions, which is what I want.

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match.

You need to print those strings out. You're escaping the _example_ string, which would make it:

X - cty_degrees \+ 1 \+ qq

because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.

My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?

The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!

Cheers,
Cameron Simpson <>> cs@cskk.id.au>> >
--

https://mail.python.org/mailman/listinfo/python-list

--

https://mail.python.org/mailman/listinfo/python-list

--
**** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: > bob@mellowood.ca
WWW:   > http://www.mellowood.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Cameron Simpson@21:1/5 to Jen Kris on Tue Feb 28 11:36:45 2023

On 28Feb2023 00:57, Jen Kris <jenkris@tutanota.com> wrote:

Yes, that's it. I don't know how long it would have taken to find that >detail with research through the voluminous re documentation. Thanks
very much.

You find things like this by printing out the strings you're actually
working with. Not the original strings, but the strings when you're
invoking `finditer` i.e. in your case, escaped strings.

Then you might have seen that what you were searching no longer
contained what you were searching for.

Don't underestimate the value of the debugging print call. It lets you
see what your programme is actually working with, instead of what you
thought it was working with.

Cheers,
Cameron Simpson <cs@cskk.id.au>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 02:36:26 2023

I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way.

Jen

Feb 27, 2023, 16:55 by cs@cskk.id.au:

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy (untested):

pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end

Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to All on Mon Feb 27 20:56:00 2023

Jen,

What you just described is why that tool is not the right tool for the job, albeit it may help you confirm if whatever method you choose does work correctly and finds the same number of matches.

Sometimes you simply do some searching and roll your own.

Consider this code using a sort of list comprehension feature:

short = "hello world"
longer = "hello world is how many programs start for novices but some use hello world! to show how happy they are to say hello world"

short in longer

True

howLong = len(short)

res = [(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
res

[(0, 11), (64, 75), (111, 122)]

len(res)

3

I could do a bit more but it seems to work. Did I get the offsets right? Checking:

print( [ longer[res[index][0]:res[index][1]] for index in range(len(res))]) ['hello world', 'hello world', 'hello world']

Seems to work but thrown together quickly so can likely be done much nicer.

But as noted, the above has flaws such as matching overlaps like:

short = "good good"
longer = "A good good good but not douple plus good good good goody"
howLong = len(short)
res = [(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
res

[(2, 11), (7, 16), (37, 46), (42, 51), (47, 56)]

It matched five times as sometimes we had three of four good in a row. Some other method might match only three.

What some might do can get long and you clearly want one answer and not tutorials. For example, people can make a loop that finds a match and either sabotages the area by replacing or deleting it, or keeps track and searched again on a substring offset
from the beginning.

When you do not find a tool, consider making one. You can take (better) code than I show above and make it info a function and now you have a tool. Even better, you can make it return whatever you want.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:40 PM
To: Bob van der Poel <bobmellowood@gmail.com>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?

string.count() only tells me there are N instances of the string; it does not say where they begin and end, as does re.finditer.

Feb 27, 2023, 16:20 by bobmellowood@gmail.com:

Would string.count() work for you then?

On Mon, Feb 27, 2023 at 5:16 PM Jen Kris via Python-list <> python-list@python.org> > wrote:

I went to the re module because the specified string may appear more
than once in the string (in the code I'm writing). For example:

a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the
second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.

I changed my re code after MRAB's comment, it now works.

Thanks much.

Jen

Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :

On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:

When matching a string against a longer string, where both

strings have spaces in them, we need to escape the spaces. >> >>
This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
print(match.start(), match.end()) >> >> That gives me the

start and end character positions, which is what I want.

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq') >> find_string =

re.escape('cty_degrees + 1') >> for match in
re.finditer(find_string, example):

print(match.start(), match.end()) >> >> I’ve tried several

other attempts based on my reseearch, but still no match.

You need to print those strings out. You're escaping the _example_ string, which would make it:

X - cty_degrees \+ 1 \+ qq

because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.

My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?

The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!

Cheers,
Cameron Simpson <>> cs@cskk.id.au>> > > -- > >>

https://mail.python.org/mailman/listinfo/python-list

--

https://mail.python.org/mailman/listinfo/python-list

--
**** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: > bob@mellowood.ca
WWW: > http://www.mellowood.ca

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to All on Mon Feb 27 21:16:01 2023

Jen,

Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.

What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.

This is what you sent:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

This is code indentedproperly:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())
...
...
4 18
26 40

But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple
doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 8:36 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?

I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way.

Jen

Feb 27, 2023, 16:55 by cs@cskk.id.au:

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy (untested):

pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end

Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to avi.e.gross@gmail.com on Mon Feb 27 21:44:26 2023

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the
whole situation changes and then regexes would almost certainly be the
best approach. But the regular expression strings would become harder
to read.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to avi.e.gross@gmail.com on Mon Feb 27 22:47:57 2023

I think by now we have given all that is needed by the OP but Dave's answer strikes me as being able to be a tad faster as a while loop if you are searching larger corpus such as an entire ebook or all books as you can do
on books.google.com

I think I mentioned earlier that some assumptions need to apply. The text
needs to be something like an ASCII encoding or seen as code points rather
than bytes. We assume a match should move forward by the length of the
match. And, clearly, there cannot be a match too close to the end.

So a while loop would begin with a variable set to zero to mark the current location of the search. The condition for repeating the loop is that this variable is less than or equal to len(searched_text) - len(key)

In the loop, each comparison is done the same way as David uses, or anything similar enough but the twist is a failure increments the variable by 1 while success increments by len(key).

Will this make much difference? It might as the simpler algorithm counts overlapping matches and wastes some time hunting where perhaps it shouldn't.

And, of course, if you made something like this into a search function, you
can easily add features such as asking that you only return the first N
matches or the next N, simply by making it a generator.
So tying this into an earlier discussion, do you want the LAST match info visible when the While loop has completed? If it was available, it opens up possibilities for running the loop again but starting from where you left
off.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Thomas Passin
Sent: Monday, February 27, 2023 9:44 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code, this minor

change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the
whole situation changes and then regexes would almost certainly be the best approach. But the regular expression strings would become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roel Schroeven@21:1/5 to All on Tue Feb 28 10:33:20 2023

Op 28/02/2023 om 3:44 schreef Thomas Passin:

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code, this
minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

...     print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
    if example[i:].startswith(KEY):
        print(i, i + len(KEY))
# prints:
4 18
26 40

I think it's often a good idea to use a standard library function
instead of rolling your own. The issue becomes less clear-cut when the
standard library doesn't do exactly what you need (as here, where
re.finditer() uses regular expressions while the use case only uses
simple search strings). Ideally there would be a str.finditer() method
we could use, but in the absence of that I think we still need to
consider using the almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of
course we could solve this in a hand-written version by wrapping it in a suitably named function).

(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't a
match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes
more complex and more error-prone. Using a well-tested existing function becomes quite attractive.

To illustrate the difference performance, I did a simple test (using the paragraph above is test text):

    import re
    import timeit

    def using_re_finditer(key, text):
        matches = []
        for match in re.finditer(re.escape(key), text):
            matches.append((match.start(), match.end()))
        return matches

    def using_simple_loop(key, text):
        matches = []
        for i in range(len(text)):
            if text[i:].startswith(key):
                matches.append((i, i + len(key)))
        return matches

    CORPUS = """Searching for a string in another string, in a
performant way, is
    not as simple as it first appears. Your version works correctly,
but slowly.
    In some situations it doesn't matter, but in other cases it will.
For better
    performance, string searching algorithms jump ahead either when
they found a
    match or when they know for sure there isn't a match for some time
(see e.g.
    the Boyer–Moore string-search algorithm). You could write such a more
    efficient algorithm, but then it becomes more complex and more error-prone.
    Using a well-tested existing function becomes quite attractive."""
    KEY = 'in'
    print('using_simple_loop:',
timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
    print('using_re_finditer:',
timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in
seconds for each of those runs.
Result on my machine:

    using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
    using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster
(despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large
difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.

--
"The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom."
-- Isaac Asimov

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Jen Kris on Tue Feb 28 11:47:44 2023

Jen Kris <jenkris@tutanota.com> writes:

example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

One needs to escape the plus sign, not the spaces,
but only in the pattern.

import re

string = 'X - cty_degrees + 1 + qq'
pattern = re.escape( 'cty_degrees + 1' )
for match in re.finditer( pattern, string ):
print( match.start(), match.end() )

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Roel Schroeven on Tue Feb 28 08:35:35 2023

On 2/28/2023 4:33 AM, Roel Schroeven wrote:

Op 28/02/2023 om 3:44 schreef Thomas Passin:

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code, this
minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

...     print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
    if example[i:].startswith(KEY):
        print(i, i + len(KEY))
# prints:
4 18
26 40

I think it's often a good idea to use a standard library function
instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where re.finditer() uses regular expressions while the use case only uses
simple search strings). Ideally there would be a str.finditer() method
we could use, but in the absence of that I think we still need to
consider using the almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of
course we could solve this in a hand-written version by wrapping it in a suitably named function).

(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't a
match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes
more complex and more error-prone. Using a well-tested existing function becomes quite attractive.

Sure, it all depends on what the real task will be. That's why I wrote "Without knowing how general your expressions will be". For the example
string, it's unlikely that speed will be a factor, but who knows what
target strings and keys will turn up in the future?

To illustrate the difference performance, I did a simple test (using the paragraph above is test text):

    import re
    import timeit

    def using_re_finditer(key, text):
        matches = []
        for match in re.finditer(re.escape(key), text):
            matches.append((match.start(), match.end()))
        return matches

    def using_simple_loop(key, text):
        matches = []
        for i in range(len(text)):
            if text[i:].startswith(key):
                matches.append((i, i + len(key)))
        return matches

    CORPUS = """Searching for a string in another string, in a
performant way, is
    not as simple as it first appears. Your version works correctly,
but slowly.
    In some situations it doesn't matter, but in other cases it will.
For better
    performance, string searching algorithms jump ahead either when
they found a
    match or when they know for sure there isn't a match for some time (see e.g.
    the Boyer–Moore string-search algorithm). You could write such a more
    efficient algorithm, but then it becomes more complex and more error-prone.
    Using a well-tested existing function becomes quite attractive."""
    KEY = 'in'
    print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
    print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in
seconds for each of those runs.
Result on my machine:

    using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
    using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster
(despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large
difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roel Schroeven@21:1/5 to All on Tue Feb 28 16:05:47 2023

Op 28/02/2023 om 14:35 schreef Thomas Passin:

On 2/28/2023 4:33 AM, Roel Schroeven wrote:

[...]
(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't
a match for some time (see e.g. the Boyer–Moore string-search
algorithm). You could write such a more efficient algorithm, but then
it becomes more complex and more error-prone. Using a well-tested
existing function becomes quite attractive.

Sure, it all depends on what the real task will be. That's why I
wrote "Without knowing how general your expressions will be". For the
example string, it's unlikely that speed will be a factor, but who
knows what target strings and keys will turn up in the future?

On hindsight I think it was overthinking things a bit. "It all depends
on what the real task will be" you say, and indeed I think that should
be the main conclusion here.

--
"Man had always assumed that he was more intelligent than dolphins because
he had achieved so much — the wheel, New York, wars and so on — whilst all the dolphins had ever done was muck about in the water having a good time.
But conversely, the dolphins had always believed that they were far more intelligent than man — for precisely the same reasons."
-- Douglas Adams

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Roel Schroeven on Tue Feb 28 10:41:11 2023

On 2/28/2023 10:05 AM, Roel Schroeven wrote:

Op 28/02/2023 om 14:35 schreef Thomas Passin:

On 2/28/2023 4:33 AM, Roel Schroeven wrote:

[...]
(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't
a match for some time (see e.g. the Boyer–Moore string-search
algorithm). You could write such a more efficient algorithm, but then
it becomes more complex and more error-prone. Using a well-tested
existing function becomes quite attractive.

Sure, it all depends on what the real task will be. That's why I
wrote "Without knowing how general your expressions will be". For the
example string, it's unlikely that speed will be a factor, but who
knows what target strings and keys will turn up in the future?

On hindsight I think it was overthinking things a bit. "It all depends
on what the real task will be" you say, and indeed I think that should
be the main conclusion here.

It is interesting, though, how pre-processing the search pattern can
improve search times if you can afford the pre-processing. Here's a
paper on rapidly finding matches when there may be up to one misspelled character. It's easy enough to implement, though in Python you can't
take the additional step of tuning it to stay in cache.

https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jon Ribbens@21:1/5 to Thomas Passin on Tue Feb 28 16:48:29 2023

On 2023-02-28, Thomas Passin <list1@tompassin.net> wrote:

On 2/28/2023 10:05 AM, Roel Schroeven wrote:

Op 28/02/2023 om 14:35 schreef Thomas Passin:

On 2/28/2023 4:33 AM, Roel Schroeven wrote:

[...]
(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't
a match for some time (see e.g. the Boyer–Moore string-search
algorithm). You could write such a more efficient algorithm, but then
it becomes more complex and more error-prone. Using a well-tested
existing function becomes quite attractive.

Sure, it all depends on what the real task will be. That's why I
wrote "Without knowing how general your expressions will be". For the
example string, it's unlikely that speed will be a factor, but who
knows what target strings and keys will turn up in the future?

On hindsight I think it was overthinking things a bit. "It all depends
on what the real task will be" you say, and indeed I think that should
be the main conclusion here.

It is interesting, though, how pre-processing the search pattern can
improve search times if you can afford the pre-processing. Here's a
paper on rapidly finding matches when there may be up to one misspelled character. It's easy enough to implement, though in Python you can't
take the additional step of tuning it to stay in cache.

https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf

You've somehow title-cased that URL. The correct URL is:

https://robert.muth.org/Papers/1996-approx-multi.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 19:07:42 2023

Using str.startswith is a cool idea in this case. But is it better than regex for performance or reliability? Regex syntax is not a model of simplicity, but in my simple case it's not too difficult.

Feb 27, 2023, 18:52 by list1@tompassin.net:

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the whole situation changes and then regexes would almost certainly be the best approach. But the regular expression strings would become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 18:57:52 2023

The code I sent is correct, and it runs here. Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

One question: several people have made suggestions other than regex (not your terser example with regex you shown below). Is there a reason why regex is not preferred to, for example, a list comp? Performance? Reliability?

Feb 27, 2023, 18:16 by avi.e.gross@gmail.com:

Jen,

Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.

What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.

This is what you sent:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

This is code indentedproperly:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())
...
...
4 18
26 40

But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple

doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 8:36 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?

I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way.

Jen

Feb 27, 2023, 16:55 by cs@cskk.id.au:

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy (untested):

pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end

Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Jen Kris via Python-list on Tue Feb 28 13:23:25 2023

On 2/28/2023 12:57 PM, Jen Kris via Python-list wrote:

The code I sent is correct, and it runs here. Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

One question: several people have made suggestions other than regex (not your terser example with regex you shown below). Is there a reason why regex is not preferred to, for example, a list comp? Performance? Reliability?

"Some people, when confronted with a problem, think 'I know, I'll use
regular expressions.' Now they have two problems."

-
https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/

Of course, if you actually read the blog post in the link, there's more
to it than that...

Feb 27, 2023, 18:16 by avi.e.gross@gmail.com:

Jen,

Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.

What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.

This is what you sent:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

This is code indentedproperly:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())
...
...
4 18
26 40

But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple

doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 8:36 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?

I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way.

Jen

Feb 27, 2023, 16:55 by cs@cskk.id.au:

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy (untested): >>>
pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end

Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jen Kris@21:1/5 to All on Tue Feb 28 19:16:54 2023

I wrote my previous message before reading this. Thank you for the test you ran -- it answers the question of performance. You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping
overhead.

Feb 28, 2023, 05:44 by list1@tompassin.net:

On 2/28/2023 4:33 AM, Roel Schroeven wrote:

Op 28/02/2023 om 3:44 schreef Thomas Passin:

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example): >>>>>>>

...     print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
    if example[i:].startswith(KEY):
        print(i, i + len(KEY))
# prints:
4 18
26 40

I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where re.finditer() uses regular expressions while the

use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).

(2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string

searching algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm). You could write such a more efficient algorithm, but then it becomes more
complex and more error-prone. Using a well-tested existing function becomes quite attractive.

Sure, it all depends on what the real task will be. That's why I wrote "Without knowing how general your expressions will be". For the example string, it's unlikely that speed will be a factor, but who knows what target strings and keys will turn up

in the future?

To illustrate the difference performance, I did a simple test (using the paragraph above is test text):

    import re
    import timeit

    def using_re_finditer(key, text):
        matches = []
        for match in re.finditer(re.escape(key), text):
            matches.append((match.start(), match.end()))
        return matches

    def using_simple_loop(key, text):
        matches = []
        for i in range(len(text)):
            if text[i:].startswith(key):
                matches.append((i, i + len(key)))
        return matches

    CORPUS = """Searching for a string in another string, in a performant way, is
    not as simple as it first appears. Your version works correctly, but slowly.
    In some situations it doesn't matter, but in other cases it will. For better
    performance, string searching algorithms jump ahead either when they found a
    match or when they know for sure there isn't a match for some time (see e.g.
    the Boyer–Moore string-search algorithm). You could write such a more
    efficient algorithm, but then it becomes more complex and more error-prone.
    Using a well-tested existing function becomes quite attractive.""" >>     KEY = 'in'
    print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
    print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
Result on my machine:

    using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
    using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Jen Kris on Tue Feb 28 13:30:32 2023

On 2/28/2023 1:07 PM, Jen Kris wrote:

Using str.startswith is a cool idea in this case. But is it better than regex for performance or reliability? Regex syntax is not a model of simplicity, but in my simple case it's not too difficult.

The trouble is that we don't know what your case really is. If you are
talking about a short pattern like your example and a small text to
search, and you don't need to do it too often, then my little code
example is probably ideal. Reliability wouldn't be an issue, and
performance would not be relevant. If your case is going to be much
larger, called many times in a loop, or be much more complicated in some
other way, then a regex or some other approach is likely to be much faster.

Feb 27, 2023, 18:52 by list1@tompassin.net:

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code,
this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1')
, example):

... print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than
regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH,
the whole situation changes and then regexes would almost certainly
be the best approach. But the regular expression strings would
become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to All on Tue Feb 28 13:31:22 2023

Roel,

You make some good points. One to consider is that when you ask a regular expression matcher to search using something that uses NO regular expression features, much of the complexity disappears and what it creates is probably similar enough to what you
get with a string search except that loops and all are written as something using fast functions probably written in C.

That is one reason the roll your own versions have a disadvantage unless you roll your own in a similar way by writing a similar C function.

Nobody has shown us what really should be out there of a simple but fast text search algorithm that does a similar job and it may still be out there, but as you point out, perhaps it is not needed as long as people just use the re version.

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Roel Schroeven
Sent: Tuesday, February 28, 2023 4:33 AM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

Op 28/02/2023 om 3:44 schreef Thomas Passin:

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code, this
minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end()) ...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where
re.finditer() uses regular expressions while the use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting
re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).

(2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string searching
algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes more complex and more error-prone. Using a well-tested existing function becomes quite attractive.

To illustrate the difference performance, I did a simple test (using the paragraph above is test text):

import re
import timeit

def using_re_finditer(key, text):
matches = []
for match in re.finditer(re.escape(key), text):
matches.append((match.start(), match.end()))
return matches

def using_simple_loop(key, text):
matches = []
for i in range(len(text)):
if text[i:].startswith(key):
matches.append((i, i + len(key)))
return matches

CORPUS = """Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but slowly.
In some situations it doesn't matter, but in other cases it will.
For better
performance, string searching algorithms jump ahead either when they found a
match or when they know for sure there isn't a match for some time (see e.g.
the Boyer–Moore string-search algorithm). You could write such a more
efficient algorithm, but then it becomes more complex and more error-prone.
Using a well-tested existing function becomes quite attractive."""
KEY = 'in'
print('using_simple_loop:',
timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
print('using_re_finditer:',
timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
Result on my machine:

using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.

--
"The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom."
-- Isaac Asimov

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Jon Ribbens via Python-list on Tue Feb 28 14:17:30 2023

On 2/28/2023 11:48 AM, Jon Ribbens via Python-list wrote:

On 2023-02-28, Thomas Passin <list1@tompassin.net> wrote:

...

It is interesting, though, how pre-processing the search pattern can
improve search times if you can afford the pre-processing. Here's a
paper on rapidly finding matches when there may be up to one misspelled
character. It's easy enough to implement, though in Python you can't
take the additional step of tuning it to stay in cache.

https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf

You've somehow title-cased that URL. The correct URL is:

https://robert.muth.org/Papers/1996-approx-multi.pdf

Thanks, not sure how that happened ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Raymond@21:1/5 to All on Tue Feb 28 19:40:10 2023

I wrote my previous message before reading this.� Thank you for the test you ran -- it answers the question of performance.� You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping

overhead.�

�� def using_simple_loop(key, text):
�� matches = []
�� for i in range(len(text)):
�� if text[i:].startswith(key):
�� matches.append((i, i + len(key)))
�� return matches

�� using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
�� using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.

def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches

using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to Jen Kris on Tue Feb 28 15:40:26 2023

This message is more for Thomas than Jen,

You made me think of what happens in fairly large cases. What happens if I ask you to search a thousand pages looking for your name?

One solution might be to break the problem into parts that can be run in independent threads or processes and perhaps across different CPU's or on many machines at once. Think of it as a variant on a merge sort where each chunk returns where it found one
or more items and then those are gathered together and merged upstream.

The problem is you cannot just randomly divide the text. Any matches across a divide are lost. So if you know you are searching for "Thomas Passin" you need an overlap big enough to hold enough of that size. It would not be made as something like a pure
binary tree and if the choices made included variant sizes in what might match, you would get duplicates. So the merging part would obviously have to eventually remove those.

I have often wondered how Google and other such services are able to find millions of things in hardly any time and arguably never show most of them as who looks past a few pages/screens?

I think much of that may involve other techniques including quite a bit of pre-indexing. But they also seem to enlist lots of processors that each do the search on a subset of the problem space and combine and prioritize.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Thomas Passin
Sent: Tuesday, February 28, 2023 1:31 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2/28/2023 1:07 PM, Jen Kris wrote:

Using str.startswith is a cool idea in this case. But is it better
than regex for performance or reliability? Regex syntax is not a
model of simplicity, but in my simple case it's not too difficult.

The trouble is that we don't know what your case really is. If you are talking about a short pattern like your example and a small text to search, and you don't need to do it too often, then my little code example is probably ideal. Reliability wouldn't
be an issue, and performance would not be relevant. If your case is going to be much larger, called many times in a loop, or be much more complicated in some other way, then a regex or some other approach is likely to be much faster.

Feb 27, 2023, 18:52 by list1@tompassin.net:

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:

And, just for fun, since there is nothing wrong with your code,
this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1')
, example):

... print(match.start(), match.end())
...
...
4 18
26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than
regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH,
the whole situation changes and then regexes would almost certainly
be the best approach. But the regular expression strings would
become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to All on Tue Feb 28 15:25:05 2023

Jen,

I had no doubt the code you ran was indented properly or it would not work.

I am merely letting you know that somewhere in the process of copying the code or the transition between mailers, my version is messed up. It happens to be easy for me to fix but I sometimes see garbled code I then simply ignore.

At times what may help is to leave blank lines that python ignores but also keeps the line rearrangements minimal.

On to your real question.

In my OPINION, there are many interesting questions that can get in the way of just getting a working solution. Some may be better in some abstract way but except for big projects it often hardly matters.

So regex is one thing or more a cluster of things and a list comp is something completely different. They are both tools you can use and abuse or lose.

The distinction I believe we started with was how to find a fixed string inside another fixed string in as many places as needed and perhaps return offset info. So this can be solved in too many ways using a side of python focused on pure text. As
discussed, solutions can include explicit loops such as “for” and “while” and their syntactic sugar cousin of a list comp. Not mentioned yet are other techniques like a recursive function that finds the first and passes on the rest of the string
to itself to find the rest, or various functional programming techniques that may do sort of hidden loops. YOU DO NOT NEED ALL OF THEM but it can be interesting to learn.

Regex is a completely different universe that is a bit more of MORE. If I ask you for a ride to the grocery store, I might expect you to show up with a car and not a James Bond vehicle that also is a boat, submarine, airplane, and maybe spaceship. Well,
Regex is the latter. And in your case, it is this complexity that meant you had to convert your text so it will not see what it considers commands or hints.

In normal use, put a bit too simply, it wants a carefully crafted pattern to be spelled out and it weaves an often complex algorithm it then sort of compiles that represents the understanding of what you asked for. The simplest pattern is to match
EXACTLY THIS. That is your case.

A more complex pattern may say to match Boston OR Chicago followed by any amount of whitespace then a number of digits between 3 and 5 and then should not be followed by something specific. Oh, and by the way, save selected parts in parentheses to be
accessed as \1 or \2 so I can ask you to do things like match a word followed by itself. It goes on and on.

Be warned RE is implemented now all over the place including outside the usual UNIX roots and there are somewhat different versions. For your need, it does not matter.

The compiled monstrosity though can be fairly fast and might be a tad hard for you to write by yourself as a bunch of if statements nested that are weirdly matching various patterns with some look ahead or look behind.

What you are being told is that despite this being way more than you asked for, it not only works but is fairly fast when doing the simple thing you asked for. That may be why a text version you are looking for is hard to find.

I am not clear what exactly the rest of your project is about but my guess is your first priority is completing it decently and not to try umpteen methods and compare them. Not today. Of course if the working version is slow and you profile it and find
this part seems to be holding it back, it may be worth examining.

From: Jen Kris <jenkris@tutanota.com>
Sent: Tuesday, February 28, 2023 12:58 PM
To: avi.e.gross@gmail.com
Cc: 'Python List' <python-list@python.org>
Subject: RE: How to escape strings for re.finditer?

The code I sent is correct, and it runs here. Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1')

for match in re.finditer(find_string, example):

print(match.start(), match.end())

One question: several people have made suggestions other than regex (not your terser example with regex you shown below). Is there a reason why regex is not preferred to, for example, a list comp? Performance? Reliability?

Feb 27, 2023, 18:16 by avi.e.gross@gmail.com <mailto:avi.e.gross@gmail.com> :

Jen,

Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.

What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.

This is what you sent:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())

This is code indentedproperly:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1')

for match in re.finditer(find_string, example):

print(match.start(), match.end())

Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())

...

...

4 18

26 40

But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple
doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.

-----Original Message-----

From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org <mailto:python-list-bounces+avi.e.gross=gmail.com@python.org> > On Behalf Of Jen Kris via Python-list

Sent: Monday, February 27, 2023 8:36 PM

To: Cameron Simpson <cs@cskk.id.au <mailto:cs@cskk.id.au> >

Cc: Python List <python-list@python.org <mailto:python-list@python.org> >

Subject: Re: How to escape strings for re.finditer?

I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())

4 18

26 40

I don't insist on terseness for its own sake, but it's cleaner this way.

Jen

Feb 27, 2023, 16:55 by cs@cskk.id.au <mailto:cs@cskk.id.au> :

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com <mailto:jenkris@tutanota.com> > wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy (untested):

pos = 0

while True:

found = s.find(substring, pos)

if found < 0:

break

start = found

end = found + len(substring)

... do whatever with start and end ...

pos = end

Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.

Cheers,

Cameron Simpson <cs@cskk.id.au <mailto:cs@cskk.id.au> >

--

https://mail.python.org/mailman/listinfo/python-list

--

https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to David Raymond on Tue Feb 28 16:26:04 2023

On 2/28/2023 2:40 PM, David Raymond wrote:

With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.

def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches

using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]

On my system the difference is way bigger than that:

KEY = '''it doesn't matter, but in other cases it will.'''

using_simple_loop2: [0.0004955999902449548, 0.0004844000213779509, 0.0004862999776378274, 0.0004800999886356294, 0.0004792999825440347]

using_re_finditer: [0.002840900036972016, 0.0028330000350251794, 0.002701299963518977, 0.0028105000383220613, 0.0029977999511174858]

Shorter keys show the least differential:

KEY = 'in'

using_simple_loop2: [0.001983499969355762, 0.0019614999764598906, 0.0019617999787442386, 0.002027600014116615, 0.0020669000223279]

using_re_finditer: [0.002787900040857494, 0.0027620999608188868, 0.0027723999810405076, 0.002776700013782829, 0.002946800028439611]

Brilliant!

Python 3.10.9
Windows 10 AMD64 (build 10.0.19044) SP0

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to All on Tue Feb 28 16:16:54 2023

David,

Your results suggest we need to be reminded that lots depends on other
factors. There are multiple versions/implementations of python out there including some written in C but also other underpinnings. Each can often
have sections of pure python code replaced carefully with libraries of
compiled code, or not. So your results will vary.

Just as an example, assume you derive a type of your own as a subclass of
str and you over-ride the find method by writing it in pure python using
loops and maybe add a few bells and whistles. If you used your improved algorithm using this variant of str, might it not be quite a bit slower? Imagine how much slower if your improvement also implemented caching and logging and the option of ignoring case which are not really needed here.

This type of thing can happen in many other scenarios and some module may be shared that is slow and a while later is updated but not everyone installs
the update so performance stats can vary wildly.

Some people advocate using some functional programming tactics, in various languages, partially because the more general loops are SLOW. But that is largely because some of the functional stuff is a compiled function that
hides the loops inside a faster environment than the interpreter.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of David Raymond
Sent: Tuesday, February 28, 2023 2:40 PM
To: python-list@python.org
Subject: RE: How to escape strings for re.finditer?

I wrote my previous message before reading this.� Thank you for the test

you ran -- it answers the question of performance.� You show that
re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping overhead.�

�� def using_simple_loop(key, text):
�� matches = []
�� for i in range(len(text)):
�� if text[i:].startswith(key):
�� matches.append((i, i + len(key)))
�� return matches

�� using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
�� using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.

def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches

using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015] using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677] using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Cameron Simpson@21:1/5 to Jen Kris on Wed Mar 1 08:58:59 2023

On 28Feb2023 18:57, Jen Kris <jenkris@tutanota.com> wrote:

One question: several people have made suggestions other than regex
(not your terser example with regex you shown below). Is there a
reason why regex is not preferred to, for example, a list comp?

These are different things; I'm not sure a comparison is meaningful.

Performance? Reliability?

Regexps are:
- cryptic and error prone (you can make them more readable, but the
notation is deliberately both terse and powerful, which means that
small changes can have large effects in behaviour); the "error prone"
part does not mean that a regexp is unreliable, but that writing one
which is _correct_ for your task can be difficult, and also difficult
to debug
- have a compile step, which slows things down
- can be slower to execute as well, as a regexp does a bunch of
housekeeping for you

The more complex the tool the more... indirection between your solution
using that tool and the smallest thing which needs to be done, and often
the slower the solution. This isn't absolute; there are times for the
complex tool.

Common opinion here is often that if you're doing simple fixed-string
things such as your task, which was finding instances of a fixed string,
just use the existing str methods. You'll end up writing what you need
directly and overtly.

I've a personal maxim that one should use the "smallest" tool which
succinctly solves the problem. I usually use it to choose a programming language (eg sed vs awk vs shell vs python in loose order of problem difficulty), but it applies also to choosing tools within a language.

Cheers,
Cameron Simpson <cs@cskk.id.au>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to avi.e.gross@gmail.com on Wed Mar 1 01:01:42 2023

On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:

Jen,

I had no doubt the code you ran was indented properly or it would not work.

I am merely letting you know that somewhere in the process of copying
the code or the transition between mailers, my version is messed up.

The problem seems to be at your end. Jen's code looks ok here.

The content type is text/plain, no format=flowed or anything which would
affect the interpretation of line endings. However, after
base64-decoding it only contains unix-style LF line endings, not CRLF
line endings. That might throw your mailer off, but I have no idea why
it would join only some lines but not others.

It happens to be easy for me to fix but I sometimes see garbled code I
then simply ignore.

Truth to be told, that's one reason why I rarely read your mails to the
end. The long lines and the triple-spaced paragraphs make it just too uncomfortable.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmP+leEACgkQ8g5IURL+ KF2RMhAAok0E/VmfEGCWSJvkfNShavXz+q6vwvNEGxCK6rqJyHlVRp1KbNj0wdZz 5ryVPmMEyhjLu/xAS/oWxzc0c7n58nZeBayxytQZ0SKfipQHebJzm046jvsbs0bo BOGA3ktd7PHUtTUN3K/FGIGnxwVonJjvLW4xsCwZOFUVq0DEJUcE4DY0oRaiBwsN dHUtnZwMYefAj5pajvi3p7UBr8iZqZi5CbIBjCto0tKsSMoij6Y8KdM3Fby3MROn FtNbz55aaH3au4zC1JXbelsVO9lCEg2exdh1C9Yj45ei9d7ypLW7ZKrVAx4E61Iu K5J/aWveBj6dJEviqaXeWRl5j77Ag4j/0N+zoZmSaLISDNsRL2xDdP/0vhJdSLRy qOpQJfWC0xtc/INo5waK31RnSqzzy2cp5BaQZ7dtvgSZtzu9PNR8qz+POn/HJY4e 7ygCfEMI+02aFhurIoD0HRaLV42SXsJQefNpy6eZBsIRkS//ZirJJGLwT8LNjb1Q +zgqRcdnnPqeE/xkOsK9Y5Um+nAVyoTjKGDF7SW+dQd8CHnflVTrCTUAv3QxU+8i REy0elquVEG4J+QZqJqNsQ+ZCV7vUoFbWcZG2Zd5AUGohsQN4pKkrbECWgQwBHnv FtCNiXr6ZTXhQxzjpO2oe/7RZ0GmZDRH/lfCUw2

From Peter J. Holzer@21:1/5 to Peter J. Holzer on Wed Mar 1 01:25:58 2023

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:

On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:

It happens to be easy for me to fix but I sometimes see garbled code I
then simply ignore.

Truth to be told, that's one reason why I rarely read your mails to the
end. The long lines and the triple-spaced paragraphs make it just too uncomfortable.

Hmm, since I was now paying a bit more attention to formatting problems
I saw that only about half of your messages have those long lines
although all seem to be sent with the same mailer. Don't know what's
going on there.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmP+m5EACgkQ8g5IURL+ KF3W6Q//XtCYaLO2vBURiq74LWFxkgQW/rHw4E7mqLLZxlkyLkKP0o654s1skU/X G8KA0NgQohP+KLoOjVpdkQzCxIn6SG+kLSBMj133SFN/rj0QuIO6iTOmlLtzooa+ sxCig4SUaB0YRDMjFHX6PSqdxMBnsquGkhc9I004/WiOKN247O/jCMn1RYKWOQbv BNED8xwSJSEEAEn42EEmz2Vlu4EbVELAhP0MmMrIgv60w7U10+skTK608mMQT2lJ EL+AnPVucuR1W2aAtofAK0LOv8J7LsBzhLRirLgngZ2PMdchQTJAOXQpsy720NmY zUeGgVQRPWrZVMH+kw/itsCAwNMgwUb783yEvcg6jLUHYPUfH8DR9JDoPMdR3LP2 ZPMndZ0wZwxZRts2tYlU1+WyBG0MABpNB//6BpFVesYt2UbloSC0e6uAQsjpDHg6 cTYxo5rIP0HnpyEyvkNGcmeZl/Z+peCawhM3CJGOXvGpDRSw4+Yg3CR7d4Zcl0da 2yx2Gx+FfpkEOVQgwix7w3MeImp/RhLvqaos3x2PeMwlq3lIhZq6SfhSgeJZ0m2C n0Bh/gAP14CuPUWVHnRBSuCUZXBsIEPzhNibP8tpZQsqI9Ty/izQ96NtIaPNaG4m j/k8LOD1D+8ywj0Mt5+1y977m1nvTN2Jqh29vdS

From Weatherby,Gerard@21:1/5 to All on Wed Mar 1 01:01:48 2023

Regex is fine if it works for you. The critiques � �difficult to read� �are subjective. Unless the code is in a section that has been profiled to be a bottleneck, I don�t sweat performance at this level.

For me, using code that has already been written and vetted is the preferred approach to writing new code I have to test and maintain. I use an online regex tester, https://pythex.org, to get the syntax write before copying pasting it into my code.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Jen Kris via Python-list <python-list@python.org>
Date: Tuesday, February 28, 2023 at 1:11 PM
To: Thomas Passin <list1@tompassin.net>
Cc: python-list@python.org <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Using str.startswith is a cool idea in this case. But is it better than regex for performance or reliability? Regex syntax is not a model of simplicity, but in my simple case it's not too difficult.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to Peter J. Holzer on Tue Feb 28 21:19:05 2023

Peter,

Nobody here would appreciate it if I tested it by sending out multiple
copies of each email to see if the same message wraps differently.

I am using a fairly standard mailer in Outlook that interfaces with gmail
and I could try mailing directly from gmail but apparently there are
systemic problems and I experience other complaints when sending directly
from AOL mail too.

So, if some people don't read me, I can live with that. I mean the right people, LOL!

Or did I get that wrong?

I do appreciate the feedback. Ironically, when I politely shared how someone else's email was displaying on my screen, it seems I am equally causing
similar issues for others.

An interesting question is whether any of us reading the archived copies see different things including with various browsers:

https://mail.python.org/pipermail/python-list/

I am not sure which letters from me had the anomalies you mention but spot-checking a few of them showed a normal display when I use Chrome.

But none of this is really a python issue except insofar as you never know
what functionality in the network was written for in python.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Peter J. Holzer
Sent: Tuesday, February 28, 2023 7:26 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:

On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:

It happens to be easy for me to fix but I sometimes see garbled code
I then simply ignore.

Truth to be told, that's one reason why I rarely read your mails to
the end. The long lines and the triple-spaced paragraphs make it just
too uncomfortable.

Hmm, since I was now paying a bit more attention to formatting problems I
saw that only about half of your messages have those long lines although all seem to be sent with the same mailer. Don't know what's going on there.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Grant Edwards@21:1/5 to Cameron Simpson on Wed Mar 1 09:04:08 2023

On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:

Regexps are:
- cryptic and error prone (you can make them more readable, but the
notation is deliberately both terse and powerful, which means that
small changes can have large effects in behaviour); the "error prone"
part does not mean that a regexp is unreliable, but that writing one
which is _correct_ for your task can be difficult,

The nasty thing is that writing one that _appears_ to be correct for
your task is often fairly easy. It will work as you expect for the
test cases you throw at it, but then fail in confusing ways when
released into the "real world". If you're lucky, it fails frequently
and obviously enough that you notice it right away. If you're not
lucky, it will fail infrequently and subtly for many years to come.

My rule: never use an RE if you can use the normal string methods
(even if it takes a a few lines of code using them to replace a single
RE).

--
Grant

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Grant Edwards on Wed Mar 1 12:48:27 2023

On 3/1/2023 12:04 PM, Grant Edwards wrote:

On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:

Regexps are:
- cryptic and error prone (you can make them more readable, but the
notation is deliberately both terse and powerful, which means that
small changes can have large effects in behaviour); the "error prone"
part does not mean that a regexp is unreliable, but that writing one
which is _correct_ for your task can be difficult,

The nasty thing is that writing one that _appears_ to be correct for
your task is often fairly easy. It will work as you expect for the
test cases you throw at it, but then fail in confusing ways when
released into the "real world". If you're lucky, it fails frequently
and obviously enough that you notice it right away. If you're not
lucky, it will fail infrequently and subtly for many years to come.

My rule: never use an RE if you can use the normal string methods
(even if it takes a a few lines of code using them to replace a single
RE).

A corollary is that once you get a working regex, don't mess with it if
you do not absolutely have to.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Peter J. Holzer on Thu Mar 2 21:08:35 2023

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:

On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:

I had no doubt the code you ran was indented properly or it would not work.

I am merely letting you know that somewhere in the process of copying
the code or the transition between mailers, my version is messed up.

The problem seems to be at your end. Jen's code looks ok here.

[...]

I have no idea why it would join only some lines but not others.

Actually I do have an idea now, since I noticed something similar at
work today: Outlook has an option "remove additional line breaks from
text-only messages" (translated from German) in the the "Email / Message Format" section. You want to make sure this is off if you are reading
mails where line breaks might be important[1].

hp

[1] Personally I'd say you shouldn't use Outlook if you are reading
mails where line breaks (or other formatting) is important, but ...

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmQBAj4ACgkQ8g5IURL+ KF2OwhAAhyhevev0RdnZOeETrJEvYplv3vzMhP3W4PXCavc6jVExgkoNJBE5K91S 4I52LHSTsguCWgWGujP6jmMwhtkmV+k/L+/J/Eqzs1zWsH/RzrfP5Lg2NMqHtppm o51k7zaTVh6WWsGXMHiuiku6M/0z9PLxMQef9oyqEoKDuP6c3nP6A6SYDGRTOnmS Ar5m1aqC8xRAcYnkhMCn93oZD7clKp10/KDrfr4DG0uL520kMlN9SGa+xNO0ZA6h yKftdhidASK+Km5oXeRnWm+xddWbyMQ8fsPSMpcjbFUu3GnxTMxkmmEMQsQkbltd HRCwfoeydD9r00kV0FoLE3irh14r1G90F1x6tn3A8ciZ6AhZnP6D+P81dne0mxmF s9et+ePC9O+vIMYINPT0CZnNJdFnu2x3Vc+1kD7SmhwRlZap+1EpHkh0Ppv+kLzf 455h0AgowS0nUhWNMKsTOTVLFUER6jdfqx0AG6LCHQG3GkD1TLGFqh//MCXerrfE ehjSbtZsPafulAPbSFSvq9cr3CICzwImu+mOBuUWVpCMs7LJbJCwYHsCCQsWgVxM 5WNBJcZgtfalqXPAyxp9dwxTyWZVReXx08IS1ih6YgfE6ltH6QwLRlM0XeZnnZzg Ke6pSMdukAbxFyIMzeQoRChlrqDA3HHuuGrmGc7

From avi.e.gross@gmail.com@21:1/5 to Peter J. Holzer on Thu Mar 2 18:08:36 2023

Thanks, Peter. Excellent advice, even if only for any of us using Microsoft Outlook as our mailer. I made the changes and we will see but they should mainly impact what I see. I did tweak another parameter.

The problem for me was finding where they hid the options menu I needed.
Then, I started translating the menus back into German until I realized I
was being silly! Good practice though. LOL!

The truth is I generally can handle receiving mangled code as most of the
time I can re-edit it into shape, or am just reading it and not copying/pasting.

What concerns me is to be able to send out the pure text content many seem
to need in a way that does not introduce the anomalies people see. Something like a least-common denominator.

Or. I could switch mailers. But my guess is reading/responding from the
native gmail editor may also need options changes and yet still impact some readers.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Peter J. Holzer
Sent: Thursday, March 2, 2023 3:09 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:

On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:

I had no doubt the code you ran was indented properly or it would not

work.

I am merely letting you know that somewhere in the process of
copying the code or the transition between mailers, my version is messed

up.

The problem seems to be at your end. Jen's code looks ok here.

[...]

I have no idea why it would join only some lines but not others.

Actually I do have an idea now, since I noticed something similar at work today: Outlook has an option "remove additional line breaks from text-only messages" (translated from German) in the the "Email / Message Format"
section. You want to make sure this is off if you are reading mails where
line breaks might be important[1].

hp

[1] Personally I'd say you shouldn't use Outlook if you are reading mails
where line breaks (or other formatting) is important, but ...

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Grant Edwards@21:1/5 to Peter J. Holzer on Thu Mar 2 15:19:28 2023

On 2023-03-02, Peter J. Holzer <hjp-python@hjp.at> wrote:

[1] Personally I'd say you shouldn't use Outlook if you are reading
mails where line breaks (or other formatting) is important, but ...

I'd shorten that to

"You shouldn't use Outlook if mail is important."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue May 7 21:37:57 2024
  from Wales, Uk via Telnet
- Keyop
  Tue May 7 20:20:13 2024
  from Huddersfield, West Yorkshire via SSH
- Cronus
  Wed May 8 19:22:39 2024
  from Provo, Ut via SSH
- Michal Wronka
  Wed May 8 18:58:52 2024
  from Wroclaw, Poland via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	73:58:48
Calls:	6,714
Calls today:	2
Files:	12,246
Messages:	5,357,271

Re: How to escape strings for re.finditer?

Who's Online

Recent Visitors

System Info