• Comparing caching strategies

    From Dino@21:1/5 to All on Fri Feb 10 19:39:48 2023
    First off, a big shout out to Peter J. Holzer, who mentioned roaring
    bitmaps a few days ago and led me to quite a discovery.

    Now I am stuck with an internal dispute with another software architect
    (well, with a software architect, I should say, as I probably shouldn't
    define myself a software architect when confronted with people with more experience than me in building more complex systems).
    Anyway, now that I know what roaring bitmaps are (and what they can
    do!), my point is that we should abandon other attempts to build a
    caching layer for our project and just veer decidedly towards relying on
    those magic bitmaps and screw anything else. Sure, there is some
    overhead marshaling our entries into integers and back, but the sheer
    speed and compactness of RBMs trump any other consideration (according
    to me, not according to the other guy, obviously).

    Long story short: I want to prototype a couple of caching strategies in
    Python using bitmaps, and measure both performance and speed.

    So, here are a few questions from an inexperienced programmer for you,
    friends. Apologies if they are a bit "open ended".

    - How would you structure the caching so that different caching
    strategies are "pluggable"? change one line of code (or even a config
    file) and a different caching strategy is used in the next run. Is this
    the job for a design pattern such as factory or facade?

    - what tool should I use to measure/log performance and memory
    occupation of my script? Google is coming up with quite a few options,
    but I value the opinion of people here a lot.

    Thank you for any feedback you may be able to provide.

    Dino

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dino@21:1/5 to Dino on Tue Feb 14 17:08:18 2023
    On 2/10/2023 7:39 PM, Dino wrote:

    - How would you structure the caching so that different caching
    strategies are "pluggable"? change one line of code (or even a config
    file) and a different caching strategy is used in the next run. Is this
    the job for a design pattern such as factory or facade?


    turns out that the strategy pattern was the right one for me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rob Cliffe@21:1/5 to Dino on Tue Feb 14 22:20:12 2023
    On 11/02/2023 00:39, Dino wrote:
    First off, a big shout out to Peter J. Holzer, who mentioned roaring
    bitmaps a few days ago and led me to quite a discovery.

    I was intrigued to hear about roaring bitmaps and discover they really
    were a thing (not a typo as I suspected at first).
    What next, I wonder?
        argumentative arrays
        chattering classes (on second thoughts, we have those already)
        dancing dictionaries
        garrulous generators
        laughing lists
        piping pipelines
        singing strings
        speaking sets
        stuttering sorts
        talking tuples
        whistling walruses?
    The future awaits [pun not intended] ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to Rob Cliffe via Python-list on Fri Feb 17 04:23:53 2023
    On 2023-02-14 22:20, Rob Cliffe via Python-list wrote:
    On 11/02/2023 00:39, Dino wrote:
    First off, a big shout out to Peter J. Holzer, who mentioned roaring
    bitmaps a few days ago and led me to quite a discovery.

    I was intrigued to hear about roaring bitmaps and discover they really
    were a thing (not a typo as I suspected at first).
    What next, I wonder?
        argumentative arrays
        chattering classes (on second thoughts, we have those already)
        dancing dictionaries
        garrulous generators
        laughing lists
        piping pipelines
        singing strings
        speaking sets
        stuttering sorts
        talking tuples
        whistling walruses?

    babbling bytestrings?

    The future awaits [pun not intended] ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to MRAB on Fri Feb 17 16:04:46 2023
    On Fri, 17 Feb 2023 at 15:28, MRAB <python@mrabarnett.plus.com> wrote:

    On 2023-02-14 22:20, Rob Cliffe via Python-list wrote:
    On 11/02/2023 00:39, Dino wrote:
    First off, a big shout out to Peter J. Holzer, who mentioned roaring
    bitmaps a few days ago and led me to quite a discovery.

    I was intrigued to hear about roaring bitmaps and discover they really
    were a thing (not a typo as I suspected at first).
    What next, I wonder?
    argumentative arrays
    chattering classes (on second thoughts, we have those already)
    dancing dictionaries
    garrulous generators
    laughing lists
    piping pipelines
    singing strings
    speaking sets
    stuttering sorts
    talking tuples
    whistling walruses?

    babbling bytestrings?


    Excited exceptions. I'm sure there's a particle physics crossover
    waiting for this one.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From avi.e.gross@gmail.com@21:1/5 to Rob Cliffe via Python-list on Fri Feb 17 00:07:12 2023
    I am less interested in the choice of names than the pro and con of when these Roaring bitmaps are worth using and when they are not.

    It is a bit like discussing whether various compression techniques are worth using as the storage or memory costs can be weighed against the CPU or transient memory costs of compressing and uncompressing. The value tends to depend on many factors and
    there may even be times you want to store the data in multiple data structures with each optimized to store that kind or amount of data.

    Roaring bitmaps claim to be an improvement not only over uncompressed structures but some other compressed versions but my reading shows it may be limited to some uses. Bitsets in general seem to be useful only for a largely contiguous set of integers
    where each sequential bit represents whether the nth integer above the lowest is in the set or not. Of course, properly set up, this makes Unions and Intersections and some other operations fairly efficient. But sets are not the same as dictionaries and
    often you are storing other data types than smaller integers.

    Many normal compression techniques can require lots of time to uncompress to find anything. My impression is that Roaring Bitmaps is a tad like the pkzip software that tries various compression techniques on each file and chooses whatever seems to work
    better on each one. That takes extra time when zipping, but when unzipping a file, it goes directly to the method used to compress it as the type is in a header and just expands it one way.

    My impression is that Roaring bitmaps breaks up the range of integers into smaller chunks and depending on what is being stored in that chunk, may leave it as an uncompressed bitmap, or a list of the sparse contents, or other storage methods and can
    search each version fairly quickly.

    So, I have no doubt it works great for some applications such as treating social security numbers as integers. It likely would be overkill to store something like the components of an IP address between 0 and 255 inclusive.

    But having said that, there may well be non-integer data that can be mapped into and out of integers. As an example, airports or radio stations have names like LAX or WPIX. If you limit yourself to ASCII letters then every one of them can be stored as a
    32-bit integer, perhaps with some padding. Of course for such fairly simple data, some might choose to place the data in a balanced tree structure and get reasonable search speed.

    I am curious about the size of some of these structures but obviously it depends. Are they stored on disk in this form too?


    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of MRAB
    Sent: Thursday, February 16, 2023 11:24 PM
    To: python-list@python.org
    Subject: Re: Comparing caching strategies

    On 2023-02-14 22:20, Rob Cliffe via Python-list wrote:
    On 11/02/2023 00:39, Dino wrote:
    First off, a big shout out to Peter J. Holzer, who mentioned roaring
    bitmaps a few days ago and led me to quite a discovery.

    I was intrigued to hear about roaring bitmaps and discover they really
    were a thing (not a typo as I suspected at first).
    What next, I wonder?
    argumentative arrays
    chattering classes (on second thoughts, we have those already)
    dancing dictionaries
    garrulous generators
    laughing lists
    piping pipelines
    singing strings
    speaking sets
    stuttering sorts
    talking tuples
    whistling walruses?

    babbling bytestrings?

    The future awaits [pun not intended] ...

    --
    https://mail.python.org/mailman/listinfo/python-list

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to avi.e.gross@gmail.com on Fri Feb 17 19:47:02 2023
    On 2023-02-17 00:07:12 -0500, avi.e.gross@gmail.com wrote:
    Roaring bitmaps claim to be an improvement not only over uncompressed structures but some other compressed versions but my reading shows it
    may be limited to some uses. Bitsets in general seem to be useful only
    for a largely contiguous set of integers where each sequential bit
    represents whether the nth integer above the lowest is in the set or
    not.

    They don't really have to be that contiguous. As long as your integers
    fit into 32 bits you're fine.

    Of course, properly set up, this makes Unions and Intersections
    and some other operations fairly efficient. But sets are not the same
    as dictionaries and often you are storing other data types than
    smaller integers.

    Of course. Different data types are useful for different applications.

    Many normal compression techniques can require lots of time to
    uncompress to find anything. My impression is that Roaring Bitmaps is
    a tad like the pkzip software that tries various compression
    techniques on each file and chooses whatever seems to work better on
    each one. That takes extra time when zipping, but when unzipping a
    file, it goes directly to the method used to compress it as the type
    is in a header and just expands it one way.

    While not completely wrong, I don't think this comparison is very
    helpful.


    My impression is that Roaring bitmaps breaks up the range of integers
    into smaller chunks and depending on what is being stored in that
    chunk, may leave it as an uncompressed bitmap, or a list of the sparse contents, or other storage methods and can search each version fairly quickly.

    It's been a few years since I looked at the implementation, but that's
    the gist of it.


    So, I have no doubt it works great for some applications such as
    treating social security numbers as integers.

    Not sure what you mean here. You mean storing sets of social security
    numbers as roaring bitmaps? You might be able to do that (Last time I
    looked RB's were limited to 32 bits, which may not be enough to
    represent SSNs unmodified), but are set operations on SSNs something you routinely do?

    It likely would be overkill to store something like the components of
    an IP address between 0 and 255 inclusive.

    But having said that, there may well be non-integer data that can be
    mapped into and out of integers. As an example, airports or radio
    stations have names like LAX or WPIX. If you limit yourself to ASCII
    letters then every one of them can be stored as a 32-bit integer,
    perhaps with some padding.

    Again: What would be the application?

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPvy6AACgkQ8g5IURL+ KF3sHw//fISEsIGMXpgzKk3KLXsroyCyGKHs0yFkat5FcrqQYG4kdqEy5ypKwJwQ QUKnyKGAlPEvXC9RKEIfpO/AC6inP35grb5w05XfScMGpkf+2hciGd30HTG8juZh MOq6Yu+8uM+3yNRN3eF9ClXLfQzCBHBV3H/6aTKTdBGlz3UsxN35HYo1QFiD/cD4 H3xw2PqyQe0kg/EGSWhDw5C53tLgSztnHYsu9TGcwMJ9Iq2inzk6+0n6elEdTIrw m6byw4ohQ3D5zal9CnmqgoCkmenTlZSrNrxMcCYXqEM7eVQmJ2hdPFImwSI4RgnT 5CZlhEDPCIvkW92xilqgqq+lW6mHx9nzrnDiHb3dv8+KzBUs9dbw7oPeqMjel5xa IUkVwRbEqLti3bTEV/flo5Qb15wUPHD7u0X7KKvkXvulVm5zs0kChp8dYqWeW3Q6 kCLJOgVY9nhoreXrrOIjBfNNCyaOh3QZRgczLN7dsCGeSiZ4bL+DpbMg5kJN1nS9 IB4+cS+j6r7iUkemE00gb+VU12HjYCaRSRk9AIQnEsgONVhYMJBh1XZZHHBj9iez VwPp860kiyCBtj8v4Zw9M1qhObjyejNYEX7sC16Xpd0zFrHudIfZMxTkJkKIu3ms 3HzdRbkfPE37ArohQeSBUW6ghns9HFYofOhrkKC
  • From avi.e.gross@gmail.com@21:1/5 to avi.e.gross@gmail.com on Fri Feb 17 18:08:09 2023
    Peter,

    Analogies I am sharing are mainly for me to wrap my head around an idea by seeing if it matches any existing ideas or templates and is not meant to be exact. Fair enough?

    But in this case, from my reading, the analogy is rather reasonable. The implementation of Roaring Bitmaps seems to logically break up the space of
    all possible values it looks up into multiple "zones" that are somewhat analogous to individual files, albeit probably in memory as the program
    runs. In the pkzip analogy, each file is processed and stored independently alongside a header that provides enough detail to uniquely open up the
    contents when needed. The collection of such compressed files is then one bigger file that is saved that uses up less space. Roaring bitmaps seems to determine how best to store each zone and only processes that zone when requested, hence some of the speedup as each zone is stored in a manner that generally allows fast access.

    I did not raise the issue and thus have no interest in promoting this technology nor knocking it down. Just wondering what it was under the hood
    and whether I might even have a need for it. I am not saying Social Security numbers are a fit, simply that some form of ID number might fit. If a
    company has a unique ID number for each client and uses it consistently,
    then an implementation that holds a set stored this way of people using
    product A, such as house insurance, and those using product B, such as car insurance, and perhaps product C is an umbrella policy, might easily handle some queries such as who uses two or all three (intersections of sets) or
    who might merit a letter telling them how much they can save if they
    subscribed to two or all three as a way to get more business. Again, just a made-up example I can think about. A company which has a million customers
    over the years will have fairly large sets as described.

    What is helpful to me in thinking about something will naturally often not
    be helpful to you or others but nothing you wrote makes me feel my first
    take was in any serious way wrong. It still makes sense to me.

    And FYI, the largest integer in signed 32 bits is 2_147_483_647 which is 10 digits. A Social Security number look like xxx-xx-xxxx at this time which is only 9 digits. Not that it matters, but it seems it still qualifies to be
    used as I describe, as long as Roaring bitmaps allows it, minus any spaces
    or hyphens and converted to an integer.

    As for my other EXAMPLE, I fail to see why I need to provide a specific need for an application. I don't care what they need it for. The thought was
    about whether something that does not start as an integer can be uniquely mapped into and out of integers of size 32 bits. So I considered a few
    examples of short textual items such as three letter airport abbreviations.
    But if you cannot imagine an application, consider one similar enough to the above. I think there are currently over 47,000 such airports in the world
    and apparently about 20,000 in the US. That seems to exceed the possible combinations of 26 letters (cubed) so it seems there are 4-letter codes too such as ZSPD. It should still fit into 4 bytes, for now.

    So assume you have a variety of facts such as which airports have
    handicapped accessible bathrooms, or have an extra long/strong runway that
    can handle huge planes or anything else that is worth knowing. You might
    have bitmaps (as is being discussed) that may be sparse for some such info
    and fairly dense for other info like having jet fuel available. As above, finding an airport that has various mixtures may be doable with these sets
    and perhaps faster than say queries on a relational database storing the
    same info.

    I will end by saying this is a hypothetical discussion for me. I am not
    doing any projects now where I expect to use Roaring bitmaps but am now
    aware of them should any need or opportunity arise. My mind is very full
    with such trivia and very little is needed albeit I never know what may come
    in as useful.

    Respectfully,

    Avi

    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Peter J. Holzer
    Sent: Friday, February 17, 2023 1:47 PM
    To: python-list@python.org
    Subject: Re: Comparing caching strategies

    On 2023-02-17 00:07:12 -0500, avi.e.gross@gmail.com wrote:
    Roaring bitmaps claim to be an improvement not only over uncompressed structures but some other compressed versions but my reading shows it
    may be limited to some uses. Bitsets in general seem to be useful only
    for a largely contiguous set of integers where each sequential bit
    represents whether the nth integer above the lowest is in the set or
    not.

    They don't really have to be that contiguous. As long as your integers fit
    into 32 bits you're fine.

    Of course, properly set up, this makes Unions and Intersections and
    some other operations fairly efficient. But sets are not the same as dictionaries and often you are storing other data types than smaller integers.

    Of course. Different data types are useful for different applications.

    Many normal compression techniques can require lots of time to
    uncompress to find anything. My impression is that Roaring Bitmaps is
    a tad like the pkzip software that tries various compression
    techniques on each file and chooses whatever seems to work better on
    each one. That takes extra time when zipping, but when unzipping a
    file, it goes directly to the method used to compress it as the type
    is in a header and just expands it one way.



    My impression is that Roaring bitmaps breaks up the range of integers
    into smaller chunks and depending on what is being stored in that
    chunk, may leave it as an uncompressed bitmap, or a list of the sparse contents, or other storage methods and can search each version fairly quickly.

    It's been a few years since I looked at the implementation, but that's the
    gist of it.


    So, I have no doubt it works great for some applications such as
    treating social security numbers as integers.

    Not sure what you mean here. You mean storing sets of social security
    numbers as roaring bitmaps? You might be able to do that (Last time I looked RB's were limited to 32 bits, which may not be enough to represent SSNs unmodified), but are set operations on SSNs something you routinely do?

    It likely would be overkill to store something like the components of
    an IP address between 0 and 255 inclusive.

    But having said that, there may well be non-integer data that can be
    mapped into and out of integers. As an example, airports or radio
    stations have names like LAX or WPIX. If you limit yourself to ASCII
    letters then every one of them can be stored as a 32-bit integer,
    perhaps with some padding.

    Again: What would be the application?

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to avi.e.gross@gmail.com on Sat Feb 18 11:40:04 2023
    On 2023-02-17 18:08:09 -0500, avi.e.gross@gmail.com wrote:
    Analogies I am sharing are mainly for me to wrap my head around an idea by seeing if it matches any existing ideas or templates and is not meant to be exact. Fair enough?

    Yeah. But if you are venting your musings into a public space you
    shouldn't be surprised if people react to them. And we can only react to
    what you write, not what you think.

    But in this case, from my reading, the analogy is rather reasonable.

    Although that confused me a bit. You are clearly responding to something
    I thought about but which you didn't quote below. Did I just think about
    it and not write it, but you responded anyway because you're a mind
    reader? Nope, it just turns out you (accidentally) deleted that sentence.


    The implementation of Roaring Bitmaps seems to logically break up the
    space of all possible values it looks up into multiple "zones" that
    are somewhat analogous to individual files,

    That part is correct. But you presented that in the form of a
    performance/space tradeoff, writing about "trying multiple methods" to
    find the smallest, and that that makes compression slower. That may be
    the case for pkzip, but it's not what RB does: Instead it uses a very
    simple heuristic: If there are less than 25% of the bits set in a zone,
    it uses a list of offsets, otherwise a plain bitmap. (according to
    their 2016 paper which I just skimmed through again - maybe the
    implementation is a bit more complex now). So I think your description
    would lead the reader to anticipate problems which aren't there and
    probably miss ones which are there. So I'll stay with my "not completely
    wrong but not very helpful" assessment.


    I did not raise the issue and thus have no interest in promoting this technology nor knocking it down. Just wondering what it was under the hood and whether I might even have a need for it. I am not saying Social Security numbers are a fit, simply that some form of ID number might fit.

    Yeah, that's the point: Any form of ID which is a small-ish integer
    number fits.

    And maybe it's just because I work with databases a lot, but
    representing things with numeric ids (in relational databases we call
    them "surrogate keys") is such a basic operation that it doesn't warrant
    more than a sentence or two.


    If a company has a unique ID number for each client and uses it
    consistently, then an implementation that holds a set stored this way
    of people using product A, such as house insurance, and those using
    product B, such as car insurance, and perhaps product C is an umbrella policy, might easily handle some queries such as who uses two or all
    three (intersections of sets) or who might merit a letter telling them
    how much they can save if they subscribed to two or all three as a way
    to get more business. Again, just a made-up example I can think
    about. A company which has a million customers over the years will
    have fairly large sets as described.

    A much better example. This is indeed how you would use roaring bitmaps.


    What is helpful to me in thinking about something will naturally often not
    be helpful to you or others but nothing you wrote makes me feel my first
    take was in any serious way wrong. It still makes sense to me.

    And FYI, the largest integer in signed 32 bits is 2_147_483_647

    I know. It's been some time since I could do hexadecimal arithmetic
    in my head but the the limits of common data types are still burned into
    my brain ;-).

    which is 10 digits. A Social Security number look like xxx-xx-xxxx at
    this time which is only 9 digits.

    These are US social security numbers. Other countries have other
    schemes. For example, Austrian SSNs have 10 digits, so you would need 34
    bits to represent them exactly. However, they (obviously) contain some redundancy (one of the digits is a checksum, and there aren't 999999
    days in a century) so it could algorithmically be compressed down to 26
    bits. But you probably wouldn't do that because in almost any real
    application you wouldn't use the SSN as a primary key (some people don't
    have one, and there have been mixups resulting in two people getting the
    same SSN).


    As for my other EXAMPLE, I fail to see why I need to provide a specific need for an application. I don't care what they need it for. The thought was
    about whether something that does not start as an integer can be uniquely mapped into and out of integers of size 32 bits.

    That's what confused me. You seemed to concentrate on the "map things to integers" part which has been solved for decades and is absolutely
    incidental to roaring bitmaps and completely ignored what you would be
    using them for.

    So I thought I was missing something, but it seems I wasn't.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPwqwAACgkQ8g5IURL+ KF3mFg//aRg2c+prlaxpnmBwI8h2eAVUrd1CV7i+uUXDklrRcnPLeUk1OLrJz100 6ol+Jf419Ih0+tokxJGRCD/i15Qpmw1CAwTmIepwx4FjTFTPk+pNppSx8l13XiOC aMAQgmOh3XpmtaNWNJq7BWvq0F6QnMvJl57NOMK45oHZuRPE6qacSvzxP19PoxNb M7o3t4Dci1/aETi3cZzTFzDIlb+ULajd++RV2Pq97/epqP3TWZNSJH4jqqlQ1QVq au4z+ZYavgAZWS+BH7fNPRF79VwkRIVrWi2sxgYuzIfuiW9LoBJKAN3QRt1m6GnJ i6W/8AI5jFphYlko6xdgaGAm4lEO7AWkJQDTzqnHXn1X2ddYgdNdpJGd/UfJ6BJn LcNkYNaUOQjur21um7txE9aRPNg7UMBASjKpnJ6JsRF7D5/q3X5vn9VWyBNPPfFY 9mQTg8bvWcz+g91mi0L6N8T+d/roRHn414XndfCETXLKSFry5h8bVovHkGx4MxbE h0gSWG/hkg1Aa+s/6+din2EpGixDr3uhpb0VcpB0onunPe0b5muDnaVKMBwdGknj BdNtMElRFIw1lhoqWSOvbxkSC5ROPuXvUfTIBU0BQcng4LaCk3PRdnn/EezfBwVC 76w3qT1uG52BWntGOYP7ljL4j/eg2Ttv0MBx7ip
  • From avi.e.gross@gmail.com@21:1/5 to avi.e.gross@gmail.com on Sat Feb 18 14:59:37 2023
    David,

    This conversation strikes me as getting antagonistic and as such, I will not continue it here after this message.

    I can nitpick at least as well as you but have no interest. It is also wandering away from the original point.

    The analogy I gave remains VALID no matter if you do not accept it as being precise. Analogies are not equalities. I did not say this does at all what programs like pkzip do in entirety or even anything similarly. I simply said that pkzip (as it no doubt evolves) has a series of compression methods to choose from and each has a corresponding method to undo and get back the original. As it happens, you can write any number of algorithms that
    determine which method to use and get back the original. It depend on many relative costs including not just the size of the compressed object but how often it will be accessed and how much effort each takes. In some cases you will compress everything once and extract it many times and in many places,
    so it may be worth the effort to try various compression techniques and
    measure size and even the effort to extract from that and decide which to
    use.

    I do not know the internals of any Roaring Bitmap implementation so all I
    did gather was that once the problem is broken into accessing individual
    things I chose to call zones for want of a more specific name, then each
    zone is stored in one of an unknown number of ways depending on some logic.
    You say an implementation chose two ways and that is fine. But in theory, it may make sense to choose other ways as well and the basic outline of the algorithm remains similar. I can imagine if a region/zone is all the even numbers, then a function that checks if you are searching for an odd number
    may be cheaper. That is an example, not something I expect to see or that is useful enough. But the concept of storing a function as the determiner for a region is general enough that it can support many more realistic ideas.

    From what you wrote, the algorithm chosen is fairly simple BUT I have to ask if these bitmaps are static or can be changed at run time? I mean if you
    have a region that is sparse and then you keep adding, does the software
    pause and rewrite that region as a bitmap if it is a list of offsets? Or, if
    a bitmap loses enough, ...

    On to your other points. Again, understand I am talking way more abstractly than you and thus it really does not matter what the length of a particular
    ID in some country is for the main discussion. The assumption is that if you are using something with limits, like a Roaring Bitmap, that you do things within the limits. When I lived in Austria, I did not bother getting an Austrian Sozialversicherungsnummer so I have no idea it is ten digits long.
    In any case, many things grow over time such as the length of telephone numbers.

    The same applies to things like airport codes. They can get longer for many reasons and may well exceed 4 characters, and certainly UNICODE or other
    such representations may exceed four bytes now if you allow non-ASCII characters that map into multiple bytes. My point was to think about how
    useful a Roaring bitmap is if it takes only 32 bit integers and one trivial mapping was to use any four bytes to represent a unique integer. But clearly you could map longer strings easily enough if you restrict yourself to 26
    upper case letters and perhaps a few other symbols that can be encoded in 5 bits. I am not saying it is worth the effort, but that means 6 characters
    can fit in 32 bits.

    I do wonder if the basic idea has to be limited to 32 bits or if it can
    expand to say 64 or use additional but fast methods of storing the data
    beyond the two mentioned.

    There are variants of other ideas I can think of like sparse arrays or
    matrices such as you find in the scipy module in Python. If they hold a
    Boolean value, they sound like they are a similar idea where you simply keep track of the ones marked True, or if it makes sense, the ones considered
    False.

    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Peter J. Holzer
    Sent: Saturday, February 18, 2023 5:40 AM
    To: python-list@python.org
    Subject: Re: Comparing caching strategies

    On 2023-02-17 18:08:09 -0500, avi.e.gross@gmail.com wrote:
    Analogies I am sharing are mainly for me to wrap my head around an
    idea by seeing if it matches any existing ideas or templates and is
    not meant to be exact. Fair enough?

    Yeah. But if you are venting your musings into a public space you shouldn't
    be surprised if people react to them. And we can only react to what you
    write, not what you think.

    But in this case, from my reading, the analogy is rather reasonable.

    Although that confused me a bit. You are clearly responding to something I thought about but which you didn't quote below. Did I just think about it
    and not write it, but you responded anyway because you're a mind reader?
    Nope, it just turns out you (accidentally) deleted that sentence.


    The implementation of Roaring Bitmaps seems to logically break up the
    space of all possible values it looks up into multiple "zones" that
    are somewhat analogous to individual files,

    That part is correct. But you presented that in the form of a
    performance/space tradeoff, writing about "trying multiple methods" to find
    the smallest, and that that makes compression slower. That may be the case
    for pkzip, but it's not what RB does: Instead it uses a very simple
    heuristic: If there are less than 25% of the bits set in a zone, it uses a
    list of offsets, otherwise a plain bitmap. (according to their 2016 paper
    which I just skimmed through again - maybe the implementation is a bit more complex now). So I think your description would lead the reader to
    anticipate problems which aren't there and probably miss ones which are
    there. So I'll stay with my "not completely wrong but not very helpful" assessment.


    I did not raise the issue and thus have no interest in promoting this technology nor knocking it down. Just wondering what it was under the
    hood and whether I might even have a need for it. I am not saying
    Social Security numbers are a fit, simply that some form of ID number
    might fit.

    Yeah, that's the point: Any form of ID which is a small-ish integer number fits.

    And maybe it's just because I work with databases a lot, but representing things with numeric ids (in relational databases we call them "surrogate
    keys") is such a basic operation that it doesn't warrant more than a
    sentence or two.


    If a company has a unique ID number for each client and uses it
    consistently, then an implementation that holds a set stored this way
    of people using product A, such as house insurance, and those using
    product B, such as car insurance, and perhaps product C is an umbrella policy, might easily handle some queries such as who uses two or all
    three (intersections of sets) or who might merit a letter telling them
    how much they can save if they subscribed to two or all three as a way
    to get more business. Again, just a made-up example I can think
    about. A company which has a million customers over the years will
    have fairly large sets as described.

    A much better example. This is indeed how you would use roaring bitmaps.


    What is helpful to me in thinking about something will naturally often
    not be helpful to you or others but nothing you wrote makes me feel my
    first take was in any serious way wrong. It still makes sense to me.

    And FYI, the largest integer in signed 32 bits is 2_147_483_647

    I know. It's been some time since I could do hexadecimal arithmetic in my
    head but the the limits of common data types are still burned into my brain ;-).

    which is 10 digits. A Social Security number look like xxx-xx-xxxx at
    this time which is only 9 digits.

    These are US social security numbers. Other countries have other schemes.
    For example, Austrian SSNs have 10 digits, so you would need 34 bits to represent them exactly. However, they (obviously) contain some redundancy
    (one of the digits is a checksum, and there aren't 999999 days in a century)
    so it could algorithmically be compressed down to 26 bits. But you probably wouldn't do that because in almost any real application you wouldn't use the SSN as a primary key (some people don't have one, and there have been mixups resulting in two people getting the same SSN).


    As for my other EXAMPLE, I fail to see why I need to provide a
    specific need for an application. I don't care what they need it for.
    The thought was about whether something that does not start as an
    integer can be uniquely mapped into and out of integers of size 32 bits.

    That's what confused me. You seemed to concentrate on the "map things to integers" part which has been solved for decades and is absolutely
    incidental to roaring bitmaps and completely ignored what you would be using them for.

    So I thought I was missing something, but it seems I wasn't.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to avi.e.gross@gmail.com on Sat Feb 18 15:59:32 2023
    On 2/18/2023 2:59 PM, avi.e.gross@gmail.com wrote:
    I do not know the internals of any Roaring Bitmap implementation so all I
    did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific name, then each
    zone is stored in one of an unknown number of ways depending on some logic.

    Somewhat tangential, but back quite a while ago there was a C library
    called IIRC "julia list". It implemented lists in several different
    ways, some quite sophisticated, depending on the size and usage pattern.
    I remembered it as soon as I took a look at Roaring Bitmap and saw
    that the latter can use different representations depending on size and
    access patterns.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to avi.e.gross@gmail.com on Sat Feb 18 20:34:41 2023
    On 2023-02-18 19:59, avi.e.gross@gmail.com wrote:
    [snip]
    I do not know the internals of any Roaring Bitmap implementation so all I
    did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific name, then each
    zone is stored in one of an unknown number of ways depending on some logic. You say an implementation chose two ways and that is fine. But in theory, it may make sense to choose other ways as well and the basic outline of the algorithm remains similar. I can imagine if a region/zone is all the even numbers, then a function that checks if you are searching for an odd number may be cheaper. That is an example, not something I expect to see or that is useful enough. But the concept of storing a function as the determiner for a region is general enough that it can support many more realistic ideas.

    From what you wrote, the algorithm chosen is fairly simple BUT I have to ask
    if these bitmaps are static or can be changed at run time? I mean if you
    have a region that is sparse and then you keep adding, does the software pause and rewrite that region as a bitmap if it is a list of offsets? Or, if a bitmap loses enough, ...

    A region is either "sparse" (<= 4096 entries) or "dense" (> 4096
    entries). It'll be converted to the other type as and when necessary.

    On to your other points. Again, understand I am talking way more abstractly than you and thus it really does not matter what the length of a particular ID in some country is for the main discussion. The assumption is that if you are using something with limits, like a Roaring Bitmap, that you do things within the limits. When I lived in Austria, I did not bother getting an Austrian Sozialversicherungsnummer so I have no idea it is ten digits long. In any case, many things grow over time such as the length of telephone numbers.

    The same applies to things like airport codes. They can get longer for many reasons and may well exceed 4 characters, and certainly UNICODE or other
    such representations may exceed four bytes now if you allow non-ASCII characters that map into multiple bytes. My point was to think about how useful a Roaring bitmap is if it takes only 32 bit integers and one trivial mapping was to use any four bytes to represent a unique integer. But clearly you could map longer strings easily enough if you restrict yourself to 26 upper case letters and perhaps a few other symbols that can be encoded in 5 bits. I am not saying it is worth the effort, but that means 6 characters
    can fit in 32 bits.

    The Unicode Consortium have said that Unicode will be limited to U+0000..U+10FFFF (21 bits).

    I do wonder if the basic idea has to be limited to 32 bits or if it can expand to say 64 or use additional but fast methods of storing the data beyond the two mentioned.

    There are variants of other ideas I can think of like sparse arrays or matrices such as you find in the scipy module in Python. If they hold a Boolean value, they sound like they are a similar idea where you simply keep track of the ones marked True, or if it makes sense, the ones considered False.

    [snip]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Thomas Passin on Sat Feb 18 23:55:44 2023
    On 2023-02-18 15:59:32 -0500, Thomas Passin wrote:
    On 2/18/2023 2:59 PM, avi.e.gross@gmail.com wrote:
    I do not know the internals of any Roaring Bitmap implementation so all I did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific name, then each zone is stored in one of an unknown number of ways depending on some logic.

    Somewhat tangential, but back quite a while ago there was a C library called IIRC "julia list".

    ITYM Judy arrays. They were mentioned here already.

    It implemented lists in several different ways, some quite
    sophisticated, depending on the size and usage pattern. I remembered
    it as soon as I took a look at Roaring Bitmap and saw that the latter
    can use different representations depending on size and access
    patterns.

    Yes, Roaring Bitmaps are somewhat similar. Judy arrays are more
    versatile (they support more data types) and in many ways more
    sophisticated, despite being 10+ years older. OTOH Roaring Bitmaps are a
    lot simpler which may have contributed to their popularity.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPxV2sACgkQ8g5IURL+ KF238w//S3eklTADlvphM++zcRU8Skwd2TJ/Rh5e+c25EmRU5dmiaPztuJoO37mq Bgl+yiFqIoQ6RFikF2vuHJf9TGvJVQgfJqA0Ww9R/lqNCJ9RisgReIq/8M1WtNQk DCp9Usf+O1SdFZ/0A75PiHEso5lqjrKacG2Ud45haAKd9fMedwVtXLq/B/rURm3I gVLq43L+F3cRwMjWZz/1p+u8KFzP1azDpIhp3xOrjEbNCAa5IuGsYK0+GQRtykUH UNYeKfXrNw3jx5SnUtM6b/6u67Rq/a8tmlFUu/MRIF0196qR62mPL1eOL6g1xLSw g1qVkTEDMPYFYcwrIvTsPyiXe6aZy8rtuBKngNIGBABghtsfIaoqYymO95uMVOEr ZpGGtZOz0tpwwwovuY7e6lTdCKABsf/pyk5zrUl5mdB1qLBz6rSd3aLpEn1Kokc3 cTcq8zJgBOObGkd44YHoh7q4JDBbiNvicXHHFF2LnsX7Mp9KBScTCjTuJIVAgs0s WxAM5x75OW8DtEQvCt0xSRFuVUTG4gJVMTt3S9U/2HBoM4vja3o/MgVvdxpj0LeT tUYmKsx2lNiBY9x2VYBuC2Kf9Wwmqez1EF7KY9jp7txw2qAsxnRwdtNhh27VFe6D 1RBGAvQrvWyqjVZv2tikvKuMNip0jIKMch0v/Vv
  • From avi.e.gross@gmail.com@21:1/5 to avi.e.gross@gmail.com on Sat Feb 18 18:04:25 2023
    It is not an unusual pattern, Thomas, to do something selective to some object rather than do all parts just one way.

    The history of computing has often been one where you had to deal with scarcity of expensive resources.

    Consider the Python "list" as a rather wasteful container that is best used for diverse contents. As soon as you make all the contents of the same basic type, you start wondering if you are better off with a restricted type that holds only one kind of
    item, often as something like a numpy array. Or, you start wondering if maybe a good way to store the contents is in a dictionary instead, as searching it is often way faster.

    But fundamentally, there is nothing WRONG with uses of a Python list as it handles almost anything and for small sample sizes, it is often not work spending time redoing it. Still, if you implement a 3-D matrix as a list of lists of lists, and use
    slicing and other techniques to do things like matrix multiplication, it gets unwieldly enough so arguably it is the wrong tool.

    If you look at the history of Python, they deliberately left out many of the containers other programming languages used and stressed lists and tuples. No vectors or arrays were the focus. Later stuff had to be added as people noted that generality has
    costs. If you look at a language like JavaScript, it is beyond weird how they decided to use attributes of an object to make an array. So an object might have attributes like "name" and "length" alongside some like "1" and "2" and "666" and when you
    wanted to treat the object like an array, it might look at the attributes and ignore the non-numerical ones and look like it has a sparsely populated array/vector of items indexed from 1 to 666. You can do all kinds of arithmetic operations and add
    missing indices or remove some and it still works like a typical array but with weird costs such as sometimes having to reindex lots of items if you want to insert a new item as the 3rd or 1st element so any above need renumbering the hard way! It works
    but strikes me as a kludge.

    If you look within Python at numpy and pandas and similar utilities, they are well aware of the idea of abstracting out concepts with different implementations and use lots of Python tools that access different storage methods as needed often by using
    objects that implement various "protocols" to the point where manipulating them similarly seems straightforward. In principle, you could create something like a pandas data.frame and then call a function that examines the columns, and perhaps rows, and
    returns a modified (or new) data.frame where the contents have been adjusted so the storage is smaller or can be accessed more efficiently, based on any other preferences and hints you supply, such as if it can be immutable. A column which is all 1's or
    Trues can obviously be done other ways, and one that has all values under 256, again, can use less storage.

    Of course this would need to be done very carefully so that changes that violate assumptions will result in a refactoring to a version that handles the changes, or results in an error. And a long-running system could be set to keep track of how an object
    is used and perhaps make adjustments. As one example, after some number of accesses, you might change a function "live" to begin caching, or have an object reconfigure to be faster to search even if occupying more room.

    Back to Python lists, a typical change is simply to convert them to a tuple which makes them easier to store and often faster to search. And, if the keys are unique, now that dictionaries maintain order of insertion, sometimes you may want to convert the
    list to a dict.

    But I hasten to note some "improvements" may not really improve. In a language like R, many operations such as changing one vector in a data.frame are done lazily and the new copy is still sharing mostly the same storage as the old. Making some changes
    can result in negative effects. A common problem people have is that trying to store some objects in an existing vector can work except when done, the entire vector has been transformed into one that can carry any of the contents. A vector of integers
    may become a vector of doubles or even a vector of characters that now have entries like "777" as a character string. The flexibility comes with lots of programming errors!

    Note how this can cause problems with the original idea here of caching strategies. Imagine a function that checks the environment as to what encoding or human language and so on to produce text in. If you cache it so it produces results that are stored
    in something like a dictionary with a key, and later the user changes the environment as it continues running, the cache may now contain invalid results. You might need to keep track of the environment and empty the cache if things change, or start a new
    cache and switch to it. An example would be the way I use Google Translate. I sometimes am studying or using a language and want to look up a word or phrase or entire sentence. If Google Translate keeps track, it may notice repeated requests like "
    Valentines Day" and cache it for re-use. But I often click to switch languages and see if the other one uses a similar or different way to describe what it means or something similar but spelled another way. German does the latter as in Valentinstag
    which is a fairly literal translation as does Dutch (Valentijnsdag ) and Hungarian (Valentin nap) .

    But Hebrew calls it the holiday of love, sort of (חג האהבה). Portuguese is similar but includes day as well as love (Dia dos Namorados)

    Esperanto tosses in more about sainthood (Sankta Valentín) and in a sense Spanish does both ways with day and saint (Día de San Valentín).

    You could continue playing with such phrases quite a bit as Translate translates N languages into N languages. So even caching from English to anything is problematic and the same word spelling can be found in many languages and the translation of that
    word into English also varies.

    So if you had a function like ask_translate(from, to, phrase) then something like an @lru_cache() decorator would not suffice as caching might require either combining the three arguments into one longer string to cache, or multiple caches accessed by
    something like an NxN data structure to select which cache.

    And it can also have other considerations such as not using naked queries as keys, but customizing them first into some canonical format such as removing leading and trailing whitespace as well as multiple instances between words, shifting everything to
    lower case, removing some punctuation and perhaps replacing special characters with a single version. Otherwise, the cache may be too large and often less effective.

    As stated earlier, any analogies or comparisons I use are for the purpose of discussion only and opinions are mine.

    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Thomas Passin
    Sent: Saturday, February 18, 2023 4:00 PM
    To: python-list@python.org
    Subject: Re: Comparing caching strategies

    On 2/18/2023 2:59 PM, avi.e.gross@gmail.com wrote:
    I do not know the internals of any Roaring Bitmap implementation so
    all I did gather was that once the problem is broken into accessing individual things I chose to call zones for want of a more specific
    name, then each zone is stored in one of an unknown number of ways depending on some logic.

    Somewhat tangential, but back quite a while ago there was a C library called IIRC "julia list". It implemented lists in several different ways, some quite sophisticated, depending on the size and usage pattern.
    I remembered it as soon as I took a look at Roaring Bitmap and saw that the latter can use different representations depending on size and access patterns.

    --
    https://mail.python.org/mailman/listinfo/python-list

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to All on Sun Feb 19 00:04:26 2023
    T24gMjAyMy0wMi0xOCAyMzowNCwgYXZpLmUuZ3Jvc3NAZ21haWwuY29tIHdyb3RlOg0KW3Nu aXBdDQo+IA0KPiBOb3RlIGhvdyB0aGlzIGNhbiBjYXVzZSBwcm9ibGVtcyB3aXRoIHRoZSBv cmlnaW5hbCBpZGVhIGhlcmUgb2YgY2FjaGluZyBzdHJhdGVnaWVzLiBJbWFnaW5lIGEgZnVu Y3Rpb24gdGhhdCBjaGVja3MgdGhlIGVudmlyb25tZW50IGFzIHRvIHdoYXQgZW5jb2Rpbmcg b3IgaHVtYW4gbGFuZ3VhZ2UgYW5kIHNvIG9uIHRvIHByb2R1Y2UgdGV4dCBpbi4gSWYgeW91 IGNhY2hlIGl0IHNvIGl0IHByb2R1Y2VzIHJlc3VsdHMgdGhhdCBhcmUgc3RvcmVkIGluIHNv bWV0aGluZyBsaWtlIGEgZGljdGlvbmFyeSB3aXRoIGEga2V5LCBhbmQgbGF0ZXIgdGhlIHVz ZXIgY2hhbmdlcyB0aGUgZW52aXJvbm1lbnQgYXMgaXQgY29udGludWVzIHJ1bm5pbmcsIHRo ZSBjYWNoZSBtYXkgbm93IGNvbnRhaW4gaW52YWxpZCByZXN1bHRzLiBZb3UgbWlnaHQgbmVl ZCB0byBrZWVwIHRyYWNrIG9mIHRoZSBlbnZpcm9ubWVudCBhbmQgZW1wdHkgdGhlIGNhY2hl IGlmIHRoaW5ncyBjaGFuZ2UsIG9yIHN0YXJ0IGEgbmV3IGNhY2hlIGFuZCBzd2l0Y2ggdG8g aXQuICBBbiBleGFtcGxlIHdvdWxkIGJlIHRoZSB3YXkgSSB1c2UgR29vZ2xlIFRyYW5zbGF0 ZS4gSSBzb21ldGltZXMgYW0gc3R1ZHlpbmcgb3IgdXNpbmcgYSBsYW5ndWFnZSBhbmQgd2Fu dCB0byBsb29rIHVwIGEgd29yZCBvciBwaHJhc2Ugb3IgZW50aXJlIHNlbnRlbmNlLiBJZiBH b29nbGUgVHJhbnNsYXRlIGtlZXBzIHRyYWNrLCBpdCBtYXkgbm90aWNlIHJlcGVhdGVkIHJl cXVlc3RzIGxpa2UgIlZhbGVudGluZXMgRGF5IiBhbmQgY2FjaGUgaXQgZm9yIHJlLXVzZS4g QnV0IEkgb2Z0ZW4gY2xpY2sgdG8gc3dpdGNoIGxhbmd1YWdlcyBhbmQgc2VlIGlmIHRoZSBv dGhlciBvbmUgdXNlcyBhIHNpbWlsYXIgb3IgZGlmZmVyZW50IHdheSB0byBkZXNjcmliZSB3 aGF0IGl0IG1lYW5zIG9yIHNvbWV0aGluZyBzaW1pbGFyIGJ1dCBzcGVsbGVkIGFub3RoZXIg d2F5LiBHZXJtYW4gZG9lcyB0aGUgbGF0dGVyIGFzIGluIFZhbGVudGluc3RhZyB3aGljaCBp cyBhIGZhaXJseSBsaXRlcmFsIHRyYW5zbGF0aW9uIGFzIGRvZXMgRHV0Y2ggKFZhbGVudGlq bnNkYWcgKSBhbmQgIEh1bmdhcmlhbiAoVmFsZW50aW4gbmFwKSAuDQo+IA0KPiBCdXQgSGVi cmV3IGNhbGxzIGl0IHRoZSBob2xpZGF5IG9mIGxvdmUsIHNvcnQgb2YgKNeX15Ig15TXkNeU 15HXlCkuIFBvcnR1Z3Vlc2UgaXMgc2ltaWxhciBidXQgaW5jbHVkZXMgZGF5IGFzIHdlbGwg YXMgbG92ZSAoRGlhIGRvcyBOYW1vcmFkb3MpDQo+IA0KPiBFc3BlcmFudG8gdG9zc2VzIGlu IG1vcmUgYWJvdXQgc2FpbnRob29kIChTYW5rdGEgVmFsZW50w61uKSBhbmQgaW4gYSBzZW5z ZSBTcGFuaXNoIGRvZXMgYm90aCB3YXlzIHdpdGggZGF5IGFuZCBzYWludCAoRMOtYSBkZSBT YW4gVmFsZW50w61uKS4NCj4gDQpUaGUgRXNwZXJhbnRvIGRpZG4ndCBsb29rIHJpZ2h0IHRv IG1lOyBpdCdzICJWYWxlbnRlbmEgdGFnbyIgb3IgDQoiU2Fua3QtVmFsZW50ZW5hIHRhZ28i Lg0KDQpbc25pcF0NCg0K

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From avi.e.gross@gmail.com@21:1/5 to avi.e.gross@gmail.com on Sat Feb 18 20:08:59 2023
    MRAB,

    I made it very clear I was using the translation provided by Google Translate. I copied exactly what it said and as I speak the languages involved, they seemed reasonable. I often find it provides somewhat different translations than I expect and
    sometimes I need to supply a longer sentence to get it on track and sometimes it is just plain wrong. Getting it to provide female-specific sentences can be an exercise in frustration, for example.

    What it produces is not important to the main point.

    It is hard to have conversations when people keep niggling on details and especially details that make absolutely no difference.

    My point was you can have functions that cannot be cached for results just any trivial way. The EXAMPLE I gave suggests if your had a Python program that did something long these lines between multiple languages, the RESULTS will depend on not just the
    phrase used. But they can be cached well in several ways if you want.

    Let me point out another related use. When you type a message on your phone, you may have a sort of prediction algorithm running that keeps offering to auto-complete your words or fix spelling errors. It does this dynamically and learns new words and
    eventually may start remembering patterns. I have totally mucked up my ENGLISH keyboard because it now remembers snippets of my typing in languages like Spanish and German and lately Esperanto and it makes suggestions from a very confused composite state
    including lots of funny accented characters. I now switch keyboards when I remember but much of the damage is done! LOL!

    That too is an example of some hidden data the program uses which is loaded with past data of what I have typed and of short phrases so it can make a guess that word A is often followed by word B and after seeing both A and B is extremely likely to be
    followed by C. Now obviously this would not be a trivial use of caching as part of guessing but could be part of a method that increments some keys like "A B" or "A B C" with a count of how often they have seen that combo. Yes, this is not direct caching,
    but also a side effect you might want to add to a function.

    As for Esperanto, I am still learning it and as a created language, the snippet I used can likely be translated multiple ways and as someone who could care less about Valentines Day, I don't ever intend on using any of those ways in conversation. Most
    languages have alternate ways of saying things and it would not shock me if there are sentences like "The Holiday used in Capitalist Western Countries to commemorate a day of love based on a person some religions consider of merit and have given a
    religious title" or whatever.

    So back on topic, an original question here was how to cache things, perhaps using a LRU algorithm with a data structure using some maximum.

    My comment was that using a function decorator that caches any result may not be adequate in many cases. I presented several examples and my point is that in the above example, it may make more sense to have multiple caches that exist perhaps outside any
    one function, or a more complex cache that stores using a more complex key

    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of MRAB
    Sent: Saturday, February 18, 2023 7:04 PM
    To: python-list@python.org
    Subject: Re: Comparing caching strategies

    On 2023-02-18 23:04, avi.e.gross@gmail.com wrote:
    [snip]

    Note how this can cause problems with the original idea here of caching strategies. Imagine a function that checks the environment as to what encoding or human language and so on to produce text in. If you cache it so it produces results that are
    stored in something like a dictionary with a key, and later the user changes the environment as it continues running, the cache may now contain invalid results. You might need to keep track of the environment and empty the cache if things change, or
    start a new cache and switch to it. An example would be the way I use Google Translate. I sometimes am studying or using a language and want to look up a word or phrase or entire sentence. If Google Translate keeps track, it may notice repeated requests
    like "Valentines Day" and cache it for re-use. But I often click to switch languages and see if the other one uses a similar or different way to describe what it means or something similar but spelled another way. German does the latter as in
    Valentinstag which is a fairly literal translation as does Dutch (Valentijnsdag ) and Hungarian (Valentin nap) .

    But Hebrew calls it the holiday of love, sort of (חג האהבה). Portuguese is similar but includes day as well as love (Dia dos
    Namorados)

    Esperanto tosses in more about sainthood (Sankta Valentín) and in a sense Spanish does both ways with day and saint (Día de San Valentín).

    The Esperanto didn't look right to me; it's "Valentena tago" or "Sankt-Valentena tago".

    [snip]

    --
    https://mail.python.org/mailman/listinfo/python-list

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Peter J. Holzer on Sat Feb 18 19:31:13 2023
    On 2/18/2023 5:55 PM, Peter J. Holzer wrote:
    On 2023-02-18 15:59:32 -0500, Thomas Passin wrote:
    On 2/18/2023 2:59 PM, avi.e.gross@gmail.com wrote:
    I do not know the internals of any Roaring Bitmap implementation so all I >>> did gather was that once the problem is broken into accessing individual >>> things I chose to call zones for want of a more specific name, then each >>> zone is stored in one of an unknown number of ways depending on some logic. >>
    Somewhat tangential, but back quite a while ago there was a C library called >> IIRC "julia list".

    ITYM Judy arrays. They were mentioned here already.

    Ha! Fading memory, I guess. Maybe that's why I couldn't find them with
    an internet search.

    It implemented lists in several different ways, some quite
    sophisticated, depending on the size and usage pattern. I remembered
    it as soon as I took a look at Roaring Bitmap and saw that the latter
    can use different representations depending on size and access
    patterns.

    Yes, Roaring Bitmaps are somewhat similar. Judy arrays are more
    versatile (they support more data types) and in many ways more
    sophisticated, despite being 10+ years older. OTOH Roaring Bitmaps are a
    lot simpler which may have contributed to their popularity.

    hp



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)