- How would you structure the caching so that different caching
strategies are "pluggable"? change one line of code (or even a config
file) and a different caching strategy is used in the next run. Is this
the job for a design pattern such as factory or facade?
First off, a big shout out to Peter J. Holzer, who mentioned roaring
bitmaps a few days ago and led me to quite a discovery.
On 11/02/2023 00:39, Dino wrote:
First off, a big shout out to Peter J. Holzer, who mentioned roaringI was intrigued to hear about roaring bitmaps and discover they really
bitmaps a few days ago and led me to quite a discovery.
were a thing (not a typo as I suspected at first).
What next, I wonder?
argumentative arrays
chattering classes (on second thoughts, we have those already)
dancing dictionaries
garrulous generators
laughing lists
piping pipelines
singing strings
speaking sets
stuttering sorts
talking tuples
whistling walruses?
The future awaits [pun not intended] ...
On 2023-02-14 22:20, Rob Cliffe via Python-list wrote:
On 11/02/2023 00:39, Dino wrote:
First off, a big shout out to Peter J. Holzer, who mentioned roaringI was intrigued to hear about roaring bitmaps and discover they really
bitmaps a few days ago and led me to quite a discovery.
were a thing (not a typo as I suspected at first).
What next, I wonder?
argumentative arrays
chattering classes (on second thoughts, we have those already)
dancing dictionaries
garrulous generators
laughing lists
piping pipelines
singing strings
speaking sets
stuttering sorts
talking tuples
whistling walruses?
babbling bytestrings?
On 11/02/2023 00:39, Dino wrote:
First off, a big shout out to Peter J. Holzer, who mentioned roaringI was intrigued to hear about roaring bitmaps and discover they really
bitmaps a few days ago and led me to quite a discovery.
were a thing (not a typo as I suspected at first).
What next, I wonder?
argumentative arrays
chattering classes (on second thoughts, we have those already)
dancing dictionaries
garrulous generators
laughing lists
piping pipelines
singing strings
speaking sets
stuttering sorts
talking tuples
whistling walruses?
The future awaits [pun not intended] ...
Roaring bitmaps claim to be an improvement not only over uncompressed structures but some other compressed versions but my reading shows it
may be limited to some uses. Bitsets in general seem to be useful only
for a largely contiguous set of integers where each sequential bit
represents whether the nth integer above the lowest is in the set or
not.
Of course, properly set up, this makes Unions and Intersections
and some other operations fairly efficient. But sets are not the same
as dictionaries and often you are storing other data types than
smaller integers.
Many normal compression techniques can require lots of time to
uncompress to find anything. My impression is that Roaring Bitmaps is
a tad like the pkzip software that tries various compression
techniques on each file and chooses whatever seems to work better on
each one. That takes extra time when zipping, but when unzipping a
file, it goes directly to the method used to compress it as the type
is in a header and just expands it one way.
My impression is that Roaring bitmaps breaks up the range of integers
into smaller chunks and depending on what is being stored in that
chunk, may leave it as an uncompressed bitmap, or a list of the sparse contents, or other storage methods and can search each version fairly quickly.
So, I have no doubt it works great for some applications such as
treating social security numbers as integers.
It likely would be overkill to store something like the components of
an IP address between 0 and 255 inclusive.
But having said that, there may well be non-integer data that can be
mapped into and out of integers. As an example, airports or radio
stations have names like LAX or WPIX. If you limit yourself to ASCII
letters then every one of them can be stored as a 32-bit integer,
perhaps with some padding.
Roaring bitmaps claim to be an improvement not only over uncompressed structures but some other compressed versions but my reading shows it
may be limited to some uses. Bitsets in general seem to be useful only
for a largely contiguous set of integers where each sequential bit
represents whether the nth integer above the lowest is in the set or
not.
Of course, properly set up, this makes Unions and Intersections and
some other operations fairly efficient. But sets are not the same as dictionaries and often you are storing other data types than smaller integers.
Many normal compression techniques can require lots of time to
uncompress to find anything. My impression is that Roaring Bitmaps is
a tad like the pkzip software that tries various compression
techniques on each file and chooses whatever seems to work better on
each one. That takes extra time when zipping, but when unzipping a
file, it goes directly to the method used to compress it as the type
is in a header and just expands it one way.
My impression is that Roaring bitmaps breaks up the range of integers
into smaller chunks and depending on what is being stored in that
chunk, may leave it as an uncompressed bitmap, or a list of the sparse contents, or other storage methods and can search each version fairly quickly.
So, I have no doubt it works great for some applications such as
treating social security numbers as integers.
It likely would be overkill to store something like the components of
an IP address between 0 and 255 inclusive.
But having said that, there may well be non-integer data that can be
mapped into and out of integers. As an example, airports or radio
stations have names like LAX or WPIX. If you limit yourself to ASCII
letters then every one of them can be stored as a 32-bit integer,
perhaps with some padding.
Analogies I am sharing are mainly for me to wrap my head around an idea by seeing if it matches any existing ideas or templates and is not meant to be exact. Fair enough?
But in this case, from my reading, the analogy is rather reasonable.
The implementation of Roaring Bitmaps seems to logically break up the
space of all possible values it looks up into multiple "zones" that
are somewhat analogous to individual files,
I did not raise the issue and thus have no interest in promoting this technology nor knocking it down. Just wondering what it was under the hood and whether I might even have a need for it. I am not saying Social Security numbers are a fit, simply that some form of ID number might fit.
If a company has a unique ID number for each client and uses it
consistently, then an implementation that holds a set stored this way
of people using product A, such as house insurance, and those using
product B, such as car insurance, and perhaps product C is an umbrella policy, might easily handle some queries such as who uses two or all
three (intersections of sets) or who might merit a letter telling them
how much they can save if they subscribed to two or all three as a way
to get more business. Again, just a made-up example I can think
about. A company which has a million customers over the years will
have fairly large sets as described.
What is helpful to me in thinking about something will naturally often not
be helpful to you or others but nothing you wrote makes me feel my first
take was in any serious way wrong. It still makes sense to me.
And FYI, the largest integer in signed 32 bits is 2_147_483_647
which is 10 digits. A Social Security number look like xxx-xx-xxxx at
this time which is only 9 digits.
As for my other EXAMPLE, I fail to see why I need to provide a specific need for an application. I don't care what they need it for. The thought was
about whether something that does not start as an integer can be uniquely mapped into and out of integers of size 32 bits.
From what you wrote, the algorithm chosen is fairly simple BUT I have to ask if these bitmaps are static or can be changed at run time? I mean if youhave a region that is sparse and then you keep adding, does the software
Analogies I am sharing are mainly for me to wrap my head around an
idea by seeing if it matches any existing ideas or templates and is
not meant to be exact. Fair enough?
But in this case, from my reading, the analogy is rather reasonable.
The implementation of Roaring Bitmaps seems to logically break up the
space of all possible values it looks up into multiple "zones" that
are somewhat analogous to individual files,
I did not raise the issue and thus have no interest in promoting this technology nor knocking it down. Just wondering what it was under themight fit.
hood and whether I might even have a need for it. I am not saying
Social Security numbers are a fit, simply that some form of ID number
If a company has a unique ID number for each client and uses it
consistently, then an implementation that holds a set stored this way
of people using product A, such as house insurance, and those using
product B, such as car insurance, and perhaps product C is an umbrella policy, might easily handle some queries such as who uses two or all
three (intersections of sets) or who might merit a letter telling them
how much they can save if they subscribed to two or all three as a way
to get more business. Again, just a made-up example I can think
about. A company which has a million customers over the years will
have fairly large sets as described.
What is helpful to me in thinking about something will naturally often
not be helpful to you or others but nothing you wrote makes me feel my
first take was in any serious way wrong. It still makes sense to me.
And FYI, the largest integer in signed 32 bits is 2_147_483_647
which is 10 digits. A Social Security number look like xxx-xx-xxxx at
this time which is only 9 digits.
As for my other EXAMPLE, I fail to see why I need to provide a
specific need for an application. I don't care what they need it for.
The thought was about whether something that does not start as an
integer can be uniquely mapped into and out of integers of size 32 bits.
I do not know the internals of any Roaring Bitmap implementation so all I
did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific name, then each
zone is stored in one of an unknown number of ways depending on some logic.
I do not know the internals of any Roaring Bitmap implementation so all I
did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific name, then each
zone is stored in one of an unknown number of ways depending on some logic. You say an implementation chose two ways and that is fine. But in theory, it may make sense to choose other ways as well and the basic outline of the algorithm remains similar. I can imagine if a region/zone is all the even numbers, then a function that checks if you are searching for an odd number may be cheaper. That is an example, not something I expect to see or that is useful enough. But the concept of storing a function as the determiner for a region is general enough that it can support many more realistic ideas.
From what you wrote, the algorithm chosen is fairly simple BUT I have to ask
if these bitmaps are static or can be changed at run time? I mean if you
have a region that is sparse and then you keep adding, does the software pause and rewrite that region as a bitmap if it is a list of offsets? Or, if a bitmap loses enough, ...
On to your other points. Again, understand I am talking way more abstractly than you and thus it really does not matter what the length of a particular ID in some country is for the main discussion. The assumption is that if you are using something with limits, like a Roaring Bitmap, that you do things within the limits. When I lived in Austria, I did not bother getting an Austrian Sozialversicherungsnummer so I have no idea it is ten digits long. In any case, many things grow over time such as the length of telephone numbers.
The same applies to things like airport codes. They can get longer for many reasons and may well exceed 4 characters, and certainly UNICODE or other
such representations may exceed four bytes now if you allow non-ASCII characters that map into multiple bytes. My point was to think about how useful a Roaring bitmap is if it takes only 32 bit integers and one trivial mapping was to use any four bytes to represent a unique integer. But clearly you could map longer strings easily enough if you restrict yourself to 26 upper case letters and perhaps a few other symbols that can be encoded in 5 bits. I am not saying it is worth the effort, but that means 6 characters
can fit in 32 bits.
I do wonder if the basic idea has to be limited to 32 bits or if it can expand to say 64 or use additional but fast methods of storing the data beyond the two mentioned.
There are variants of other ideas I can think of like sparse arrays or matrices such as you find in the scipy module in Python. If they hold a Boolean value, they sound like they are a similar idea where you simply keep track of the ones marked True, or if it makes sense, the ones considered False.
On 2/18/2023 2:59 PM, avi.e.gross@gmail.com wrote:
I do not know the internals of any Roaring Bitmap implementation so all I did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific name, then each zone is stored in one of an unknown number of ways depending on some logic.
Somewhat tangential, but back quite a while ago there was a C library called IIRC "julia list".
It implemented lists in several different ways, some quite
sophisticated, depending on the size and usage pattern. I remembered
it as soon as I took a look at Roaring Bitmap and saw that the latter
can use different representations depending on size and access
patterns.
I do not know the internals of any Roaring Bitmap implementation so
all I did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific
name, then each zone is stored in one of an unknown number of ways depending on some logic.
Note how this can cause problems with the original idea here of caching strategies. Imagine a function that checks the environment as to what encoding or human language and so on to produce text in. If you cache it so it produces results that arestored in something like a dictionary with a key, and later the user changes the environment as it continues running, the cache may now contain invalid results. You might need to keep track of the environment and empty the cache if things change, or
But Hebrew calls it the holiday of love, sort of (חג האהבה). Portuguese is similar but includes day as well as love (Dia dos
Namorados)
Esperanto tosses in more about sainthood (Sankta Valentín) and in a sense Spanish does both ways with day and saint (Día de San Valentín).
On 2023-02-18 15:59:32 -0500, Thomas Passin wrote:
On 2/18/2023 2:59 PM, avi.e.gross@gmail.com wrote:
I do not know the internals of any Roaring Bitmap implementation so all I >>> did gather was that once the problem is broken into accessing individual >>> things I chose to call zones for want of a more specific name, then each >>> zone is stored in one of an unknown number of ways depending on some logic. >>Somewhat tangential, but back quite a while ago there was a C library called >> IIRC "julia list".
ITYM Judy arrays. They were mentioned here already.
It implemented lists in several different ways, some quite
sophisticated, depending on the size and usage pattern. I remembered
it as soon as I took a look at Roaring Bitmap and saw that the latter
can use different representations depending on size and access
patterns.
Yes, Roaring Bitmaps are somewhat similar. Judy arrays are more
versatile (they support more data types) and in many ways more
sophisticated, despite being 10+ years older. OTOH Roaring Bitmaps are a
lot simpler which may have contributed to their popularity.
hp
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 75:25:07 |
Calls: | 6,716 |
Calls today: | 4 |
Files: | 12,246 |
Messages: | 5,357,399 |