I need fast text-search on a large (not huge, let's say 30k records
totally) list of items. Here's a sample of my raw data (a list of US
cars: model and make)
el]; print(time.process_time_ns() -s);import time
lis = [str(a**2+a*3+a) for a in range(0,30000)]
s = time.process_time_ns(); res = [el for el in lis if "13467" in
el]; print(time.process_time_ns() -s);s = time.process_time_ns(); res = [el for el in lis if "52356" in
el]; print(time.process_time_ns() -s);s = time.process_time_ns(); res = [el for el in lis if "5256" in
1447300s = time.process_time_ns(); res = [el for el in lis if "6" in el]; print(time.process_time_ns() -s);
1511100s = time.process_time_ns(); res = [el for el in lis if "1" in el]; print(time.process_time_ns() -s);
el]; print(time.process_time_ns() -s); print(len(res), res[:10])s = time.process_time_ns(); res = [el for el in lis if "13467" in
I can do a substring search in a list of 30k elements in less than 2ms
with Python. Is my reasoning sound?
On 3/4/2023 10:43 PM, Dino wrote:
I need fast text-search on a large (not huge, let's say 30k records
totally) list of items. Here's a sample of my raw data (a list of US
cars: model and make)
I suspect I am really close to answering my own question...
el]; print(time.process_time_ns() -s);import time
lis = [str(a**2+a*3+a) for a in range(0,30000)]
s = time.process_time_ns(); res = [el for el in lis if "13467" in
753800
el]; print(time.process_time_ns() -s);s = time.process_time_ns(); res = [el for el in lis if "52356" in
1068300
el]; print(time.process_time_ns() -s);s = time.process_time_ns(); res = [el for el in lis if "5256" in
862000
1447300s = time.process_time_ns(); res = [el for el in lis if "6" in el]; print(time.process_time_ns() -s);
1511100s = time.process_time_ns(); res = [el for el in lis if "1" in el]; print(time.process_time_ns() -s);
el]; print(time.process_time_ns() -s); print(len(res), res[:10])s = time.process_time_ns(); res = [el for el in lis if "13467" in
926900
2 ['134676021', '313467021']
I can do a substring search in a list of 30k elements in less than 2ms
with Python. Is my reasoning sound?
Dino, Sending lots of data to an archived forum is not a great idea. I snipped most of it out below as not to replicate it.
Your question does not look difficult unless your real question is about speed. Realistically, much of the time spent generally is in reading in a file and the actual search can be quite rapid with a wide range of methods.
The data looks boring enough and seems to not have much structure other than one comma possibly separating two fields. Do you want the data as one wide filed or perhaps in two parts, which a CSV file is normally used to represent. Do you ever have questions like tell me all cars whose name
begins with the letter D and has a V6 engine? If so, you may want more than
a vanilla search.
What exactly do you want to search for? Is it a set of built-in searches or something the user types in?
The data seems to be sorted by the first field and then by the second and I did not check if some searches might be ambiguous. Can there be many entries containing III? Yep. Can the same words like Cruiser or Hybrid appear?
So is this a one-time search or multiple searches once loaded as in a
service that stays resident and fields requests. The latter may be worth speeding up.
I don't NEED to know any of this but want you to know that the answer may depend on this and similar factors. We had a long discussion lately on whether to search using regular expressions or string methods. If your data is meant to be used once, you may not even need to read the file into
memory, but read something like a line at a time and test it. Or, if you end up with more data like how many cylinders a car has, it may be time to read it in not just to a list of lines or such data structures, but get numpy/pandas involved and use their many search methods in something like a data.frame.
Of course if you are worried about portability, keep using Get Regular Expression Print.
Your example was:
$ grep -i v60 all_cars_unique.csv
Genesis,GV60
Volvo,V60
You seem to have wanted case folding and that is NOT a normal search. And your search is matching anything on any line. If you wanted only a complete field, such as all text after a comma to the end of the line, you could use grep specifications to say that.
But once inside python, you would need to make choices depending on what
kind of searches you want to allow but also things like do you want all matching lines shown if you search for say "a" ...
I would probably ingest the data at startup into a dictionary - or
perhaps several depending on your access patterns - and then you will
only need to to a fast lookup in one or more dictionaries.
If your access pattern would be easier with SQL queries, load the data
into an SQLite database on startup.
IOW, do the bulk of the work once at startup.
I just did a similar test with your actual data and got
about the same result. If that's fast enough for you,
then you don't need to do anything fancy.
The idea that someone types into an input field and matches start
dancing in the browser made me think that this was exactly what I
needed, and hence I figured that asking here about Whoosh would be a
good idea. I know realize that Whoosh would be overkill for my use-case,
as a simple (case insensitive) query substring would get me 90% of what
I want. Speed is in the order of a few milliseconds out of the box,
which is chump change in the context of a web UI.
Not sure if this is what Thomas meant, but I was also thinking dictionaries.
Dino could build a set of dictionaries with keys “a” through “z” that contain data with those letters in them. (I’m assuming case insensitive search) and then just search “v” if that’s what the user starts with.
Increased performance may be achieved by building dictionaries “aa”,”ab” ... “zz. And so on.
Of course, it’s trading CPU for memory usage, and there’s likely a point at which the cost of building dictionaries exceeds the savings in searching.
From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Thomas Passin <list1@tompassin.net>python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$>
Date: Sunday, March 5, 2023 at 9:07 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Fast full-text searching in Python (job for Whoosh?)
I would probably ingest the data at startup into a dictionary - or
perhaps several depending on your access patterns - and then you will
only need to to a fast lookup in one or more dictionaries.
If your access pattern would be easier with SQL queries, load the data
into an SQLite database on startup.
IOW, do the bulk of the work once at startup.
-- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/
Increased performance may be achieved by building dictionaries “aa”,”ab”
... “zz. And so on.
On 3/5/2023 9:05 PM, Thomas Passin wrote:
I would probably ingest the data at startup into a dictionary - or
perhaps several depending on your access patterns - and then you will
only need to to a fast lookup in one or more dictionaries.
If your access pattern would be easier with SQL queries, load the data
into an SQLite database on startup.
Thank you. SQLite would be overkill here, plus all the machinery that I
would need to set up to make sure that the DB is rebuilt/updated regularly. Do you happen to know something about Whoosh? have you ever used it?
IOW, do the bulk of the work once at startup.
Sound advice
Thank you
Not sure if this is what Thomas meant, but I was also thinking dictionaries.
Dino could build a set of dictionaries with keys “a” through “z” that contain data with those letters in them. (I’m assuming case insensitive search) and then just search “v” if that’s what the user starts with.
Increased performance may be achieved by building dictionaries “aa”,”ab” ... “zz. And so on.
Of course, it’s trading CPU for memory usage, and there’s likely a point at which the cost of building dictionaries exceeds the savings in searching.
From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Thomas Passin <list1@tompassin.net>python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$>
Date: Sunday, March 5, 2023 at 9:07 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Fast full-text searching in Python (job for Whoosh?)
I would probably ingest the data at startup into a dictionary - or
perhaps several depending on your access patterns - and then you will
only need to to a fast lookup in one or more dictionaries.
If your access pattern would be easier with SQL queries, load the data
into an SQLite database on startup.
IOW, do the bulk of the work once at startup.
-- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/
Dino, Sending lots of data to an archived forum is not a great idea. I snipped most of it out below as not to replicate it.methods.
Your question does not look difficult unless your real question is about speed. Realistically, much of the time spent generally is in reading in a file and the actual search can be quite rapid with a wide range of
The data looks boring enough and seems to not have much structure otherthan
one comma possibly separating two fields. Do you want the data as one wide filed or perhaps in two parts, which a CSV file is normally used to represent. Do you ever have questions like tell me all cars whose namethan
begins with the letter D and has a V6 engine? If so, you may want more
a vanilla search.or
What exactly do you want to search for? Is it a set of built-in searches
something the user types in?I
The data seems to be sorted by the first field and then by the second and
did not check if some searches might be ambiguous. Can there be manyentries
containing III? Yep. Can the same words like Cruiser or Hybrid appear?data
So is this a one-time search or multiple searches once loaded as in a
service that stays resident and fields requests. The latter may be worth speeding up.
I don't NEED to know any of this but want you to know that the answer may depend on this and similar factors. We had a long discussion lately on whether to search using regular expressions or string methods. If your
is meant to be used once, you may not even need to read the file intoend
memory, but read something like a line at a time and test it. Or, if you
up with more data like how many cylinders a car has, it may be time toread
it in not just to a list of lines or such data structures, but get numpy/pandas involved and use their many search methods in something likea
data.frame.complete
Of course if you are worried about portability, keep using Get Regular Expression Print.
Your example was:
$ grep -i v60 all_cars_unique.csv
Genesis,GV60
Volvo,V60
You seem to have wanted case folding and that is NOT a normal search. And your search is matching anything on any line. If you wanted only a
field, such as all text after a comma to the end of the line, you coulduse
grep specifications to say that.
But once inside python, you would need to make choices depending on what
kind of searches you want to allow but also things like do you want all matching lines shown if you search for say "a" ...
Thomas,worth doing.
I may have missed any discussion where the OP explained more about proposed usage. If the program is designed to load the full data once, never get updates except by re-reading some file, and then handles multiple requests, then some things may be
It looked to me, and I may well be wrong, like he wanted to search for a string anywhere in the text so a grep-like solution is a reasonable start with the actual data being stored as something like a list of character strings you can search "one line"at a time. I suspect a numpy variant may work faster.
And of course any search function he builds can be made to remember some or all previous searches using a cache decorator. That generally uses a dictionary for the search keys internally.any results. But the example given wanted to match something like "V6" in middle of the text and I do not see how that would work as you would now need to search 26 dictionaries completely.
But using lots of dictionaries strikes me as only helping if you are searching for text anchored to the start of a line so if you ask for "Honda" you instead ask the dictionary called "h" and search perhaps just for "onda" then recombine the prefix in
['v60', 'GV60']models = {'v60': 'Volvo', 'GV60': 'Genesis', 'cl': 'Acura'}
entry = '60'
candidates = (m for m in models.keys() if entry in m)
list(candidates)
-----Original Message-----python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$>
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Thomas Passin
Sent: Monday, March 6, 2023 11:03 AM
To: python-list@python.org
Subject: Re: Fast full-text searching in Python (job for Whoosh?)
On 3/6/2023 10:32 AM, Weatherby,Gerard wrote:
Not sure if this is what Thomas meant, but I was also thinking dictionaries. >>
Dino could build a set of dictionaries with keys “a” through “z” that contain data with those letters in them. (I’m assuming case insensitive search) and then just search “v” if that’s what the user starts with.
Increased performance may be achieved by building dictionaries “aa”,”ab” ... “zz. And so on.
Of course, it’s trading CPU for memory usage, and there’s likely a point at which the cost of building dictionaries exceeds the savings in searching.
Chances are it would only be seconds at most to build the data cache,
and then subsequent queries would respond very quickly.
From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Thomas Passin <list1@tompassin.net>
Date: Sunday, March 5, 2023 at 9:07 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Fast full-text searching in Python (job for Whoosh?)
I would probably ingest the data at startup into a dictionary - or
perhaps several depending on your access patterns - and then you will
only need to to a fast lookup in one or more dictionaries.
If your access pattern would be easier with SQL queries, load the data
into an SQLite database on startup.
IOW, do the bulk of the work once at startup.
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/
But the example given wanted to match something like "V6" in middle of the text and I do not see how that would work as you would now need to search 26 dictionaries completely.
If mailing space is a consideration, we could all help by keeping our replies short and to the point.
I need fast text-search on a large (not huge, let's say 30k records
totally) list of items. Here's a sample of my raw data (a list of US
cars: model and make)
ne issue that was also correctly foreseen by some is that there's going
to be a new request at every user key stroke. Known problem. JavaScript programmers use a trick called "debounceing" to be reasonably sure that
the user is done typing before a request is issued:
https://schier.co/blog/wait-for-user-to-stop-typing-using-javascript
It must be nice to have a server or two...
On Mon, 6 Mar 2023 21:55:37 -0500, Dino wrote:
https://schier.co/blog/wait-for-user-to-stop-typing-using-javascript
That could be annoying. My use case is address entry. When the user types
102 ma
the suggestions might be
main
manson
maple
massachusetts
masten
in a simple case. When they enter 's' it's narrowed down. Typically I'm
only dealing with a city or county so the data to be searched isn't huge.
The maps.google.com address search covers the world and they're also
throwing in a geographical constraint so the suggestions are applicable to the area you're viewing.
On Mon, 6 Mar 2023 21:55:37 -0500, Dino wrote:
ne issue that was also correctly foreseen by some is that there's going
to be a new request at every user key stroke. Known problem. JavaScript programmers use a trick called "debounceing" to be reasonably sure that
the user is done typing before a request is issued:
https://schier.co/blog/wait-for-user-to-stop-typing-using-javascript
That could be annoying. My use case is address entry. When the user types
How can I implement this? A library called Whoosh seems very promising >(albeit it's so feature-rich that it's almost like shooting a fly with
a bazooka in my case), but I see two problems:
1) Whoosh is either abandoned or the project is a mess in terms of
community and support
(https://groups.google.com/g/whoosh/c/QM_P8cGi4v4 ) and
2) Whoosh seems to be a Python only thing, which is great for now,
but I wouldn't want this to become an obstacle should I need port it to
a different language at some point.
How can I implement this? A library called Whoosh seems very promising >(albeit it's so feature-rich that it's almost like shooting a fly with
a bazooka in my case), but I see two problems:
1) Whoosh is either abandoned or the project is a mess in terms of
community and support
(https://groups.google.com/g/whoosh/c/QM_P8cGi4v4 ) and
2) Whoosh seems to be a Python only thing, which is great for now,
but I wouldn't want this to become an obstacle should I need port it to
a different language at some point.
Played a little bit with both approaches in my little application. Re-requesting from the server seems to win hands down in my case.
It must be nice to have a server or two...
No kidding
About everything else you wrote, it makes a ton of sense, in fact it's a dilemma I am facing now. My back-end returns 10 entries (I am limiting
to max 10 matches server side for reasons you can imagine).
As the user keeps typing, should I restrict the existing result set
based on the new information or re-issue a API call to the server?
Things get confusing pretty fast for the user. You don't want too many
cooks in kitchen, I guess.
Played a little bit with both approaches in my little application. Re-requesting from the server seems to win hands down in my case.
I am sure that them google engineers reached spectacular levels of UI
finesse with stuff like this.
Some of the discussions here leave me confused as the info we think we got early does not last long intact and often morphs into something else and we find much of the discussion is misdirected or wasted.
But I'll note that I use whoosh from time to time and I find it stable
and pleasant to work with. It's true that development stopped, but it
stopped in a very stable place. I don't recommend using whoosh here, but
I would recommend experimenting with it more generally.
On 3/7/2023 7:33 AM, Dino wrote:
in fact it's a dilemma I am facing now. My back-end returns 10
entries (I am limiting to max 10 matches server side for reasons you
can imagine). As the user keeps typing, should I restrict the
existing result set based on the new information or re-issue a API
call to the server? Things get confusing pretty fast for the user.
You don't want too many cooks in kitchen, I guess.
Played a little bit with both approaches in my little application. Re-requesting from the server seems to win hands down in my case.
I am sure that them google engineers reached spectacular levels of UI finesse with stuff like this.
Subject of course to trying this out, I would be inclined to send a much larger list of responses to the client, and let the client reduce the number to be displayed. The latency for sending a longer list will be smaller than establishing a new connection or even reusing an old one to send a new,
short list of responses.
On 2023-03-08 00:12:04 -0500, Thomas Passin wrote:
On 3/7/2023 7:33 AM, Dino wrote:
in fact it's a dilemma I am facing now. My back-end returns 10
entries (I am limiting to max 10 matches server side for reasons you
can imagine). As the user keeps typing, should I restrict the
existing result set based on the new information or re-issue a API
call to the server? Things get confusing pretty fast for the user.
You don't want too many cooks in kitchen, I guess.
Played a little bit with both approaches in my little application.
Re-requesting from the server seems to win hands down in my case.
I am sure that them google engineers reached spectacular levels of UI
finesse with stuff like this.
Subject of course to trying this out, I would be inclined to send a much
larger list of responses to the client, and let the client reduce the number >> to be displayed. The latency for sending a longer list will be smaller than >> establishing a new connection or even reusing an old one to send a new,
short list of responses.
That depends very much on how long that list can become. If it's 200
matches - sure, send them all, even if the client will display only 10
of them. Probably even for 2000. But if you might get 20 million matches
you surely don't want to send them all to the client.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 71:06:11 |
Calls: | 6,712 |
Files: | 12,244 |
Messages: | 5,356,967 |