• Integer sizes and DFAs

    From Christopher F Clark@21:1/5 to All on Sat Mar 26 17:16:15 2022
    One of the posters mentioned that ignoring the size of integers used
    in the DFAs was "not computer science". Actually, it is. It's called
    the RAM model of computing. And, in it you assume each "integer"
    occupies one location and is also sufficient to address any memory
    location. Sa, accessing any integer any where in memory is an O(1)
    operation.

    Note how that even works for the DNA DFAs, assuming 32 bit integers
    and a byte addressed machine. With that model, you can fit 10 million
    (1E7) states in the 2 billion byte (2GB) address space (each state has
    4, 4 byte entries or 16 bytes). You might even fit something over 100
    million (1E8) states there. Thus, your memory access time is linear
    in the worst case (assuming every request is a cache miss).

    With 64 bit integers, one is essentially only limited by total memory (including disk) and it is still linear although the worst case time
    is every access is a page fault. Probably not particularly practical,
    but CS is about theory and how you model it.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Christopher F Clark on Sat Mar 26 18:39:53 2022
    On 2022-03-26, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
    One of the posters mentioned that ignoring the size of integers used
    in the DFAs was "not computer science". Actually, it is. It's called
    the RAM model of computing. And, in it you assume each "integer"
    occupies one location and is also sufficient to address any memory
    location. Sa, accessing any integer any where in memory is an O(1) operation.

    The O notation is based on the independent variable(s) like n being able
    to taken on unlimited, arbitrarily large values.

    In certain common situations we model elementary operations as being
    O(1), with certain justifications. That simplifies the analysis, while
    leaving its results still relevant and true for practical applications
    running on present hardware.

    The justification needs to be articulated though.

    Note how that even works for the DNA DFAs, assuming 32 bit integers
    and a byte addressed machine. With that model, you can fit 10 million
    (1E7) states in the 2 billion byte (2GB) address space (each state has
    4, 4 byte entries or 16 bytes).

    Each 32 bit state has 4 byte entries only if it doesn't express that
    it branches to other states for certain input symbols.

    A DFA state is a row in a table, not just the number which identifies/enumerates that row.

    If the input symbols are bytes, then the abstract row size capped to 256 entries.

    (Larger symbol sets can exist (e.g. the example of Unicode) but
    it's a reasonabl assumption that the symbol set is finite. Still, it can
    be large.)

    If we have a DFA built from certain lexical rules, and we extend those
    rules, it could be the case that existing states blow up in size,
    because of having to branch to many more other states on new characters
    that were not included in the old rules.

    If a machine initially recongnizes the set of strings { "foo", "bar" }
    then its state zero might have a condensed representation of two
    transitions on the sybmols 'f' and 'b', such as a tiny two-element
    array that is subject to a linear search.

    If we extend the recognized set of strings, state 0 may have to branch
    on more symbols than just 'f' and 'b'. Depending on how we are
    representing states, the size of the object may change.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    [No existing computer is actually Turing equivalent because every real
    computer has a finite memory, but we manage to do O(whatever) calculations anyway. I'm not sure how productive this line of argument is beyond noting what sorts of tasks are or aren't likely to fit in real computers. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to I don't know whether you simply did on Sun Mar 27 00:54:35 2022
    Kaz, I don't know whether you simply didn't read what I wrote or chose
    to ignore it.

    In certain common situations we model elementary operations as being
    O(1), with certain justifications. That simplifies the analysis, while leaving its results still relevant and true for practical applications running on present hardware.

    And, my point was 2**32 is large enough to be considered arbitrarily large with respect to most DFAs. Not quite the human genome, see extended analysis
    below. Here was my first analysis.

    me > Note how that even works for the DNA DFAs, assuming 32 bit integers
    me > and a byte addressed machine. With that model, you can fit 10 million
    me > (1E7) states in the 2 billion byte (2GB) address space (each state has
    me > 4, 4 byte entries or 16 bytes).

    Each 32 bit state has 4 byte entries only if it doesn't express that
    it branches to other states for certain input symbols.

    A state isn't 32 bits, a transition is. A transition (in a DFA) is
    just the address of another state. A state in a DFA for DNA has 4 such transitions (one for A, T, C, and G--the 4 DNA bases). Those are the
    only legal symbols in DNA. So a state is 4 * 4 bytes, each transition
    is 4 bytes and there are 4 of them per state.

    Thus, I explained what the justification is and how that allows more
    than 100 million states in the combined DFA. That's without any fancy
    tricks.

    Now, I just looked up the size of the human genome. it is 3 billion,
    so that's a little more than another order or magnitude bigger, so you definitely need slightly bigger integers (or to do some compression,
    although 64 bit integers are overkill, but convenient to work with and
    with those you could easily fit the genome in an 128 GB SSD, which
    would be addressible by 64 bit integer/pointers). And, for most other applications even 1 million states are more than sufficient. Thus,
    saying a transition can happen in O(1) time is a perfectly reasonable assumption. You can layout nearly any DFA we expect to see in memory
    and directly address it. Thus, the RAM model holds.

    By the way with 40 bit integers, you could fit it in 60GB (20 bytes
    per state and 3G states), We don't quite have laptops with that much
    DRAM memory yet.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Christopher F Clark on Sat Mar 26 19:32:17 2022
    On Saturday, March 26, 2022 at 4:42:55 PM UTC-7, Christopher F Clark wrote:

    (snip)

    And, my point was 2**32 is large enough to be considered arbitrarily large with
    respect to most DFAs. Not quite the human genome, see extended analysis below. Here was my first analysis.

    About 24 years ago I was working with a DNA sequencing group, and was interested in speeding up this problem. The one I was most interested in
    was special purpose hardware with many of the largest DRAM I could find, arranged just to do this operation.

    (Note that you need one more bit, to indicate when a match is found.)

    There would be logic to read data off disk, and pass it directly to the DFA array. There is, then, logic to store the offset into the disk file, and the state at which the hit occured, to be read out later.

    But we went onto other projects, and I never got to build one.

    Since then, DRAM has gotten much larger, but so has the DNA database.

    Yes the human genome is 3 gigabase, but the whole of GenBank is
    now about 16 terabase, including WGS (whole genome sequences).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sun Mar 27 15:02:16 2022
    @gah4: Ok, The state space for the whole GenBank is larger. And, if
    we ever get to the point where we can deal with epigenomic factors, we
    probably need extra coding space for that too. But, I still think it
    is likely that we will be in the range where we can make "disks" (or
    "disk arrays") large enough for them and use 64 bit integers to
    address them.

    And, wandering off into the space of hardware implementations. At
    Intel we were considering doing that circa 2010, so a bit later than
    you were. In particular, we were looking to see if we could adapt our technology to do the "FASTA" algorithm. Like you, we never brought
    that to market.

    But, I want to make a couple of slight points on the encoding of transitions.

    If one is using byte addressing and one always puts the states on
    aligned boundaries, one has a few low order bits in the address that
    will always be 0. You can tweak one of those to be your "match
    complete/final" bit and then mask that off when using it to address
    the next state. Alternately, one keeps a separate list of "final
    states". That is more common in implementations. And, the third choice
    is to have a state header with extra data, like whether the state is
    final or not. The last two alternatives expand your memory footprint
    somewhat, but by a relatively minimal amount.

    Now, if you don't use byte addressing, but use "word" addressing (a la
    most computers in the 1960s), you actually have shrunk your address requirement. You've moved those 0 bits from the bottom of the address
    to the top. Even more so, if you use "state" addressing. Of course, if
    your hardware has byte addressable memory, you may have to convert
    those addresses into byte address before going to actual memory. But
    all of that doesn't involve extra memory lookups, just some small
    amount of integer/bit arithmetic which on modern computers is
    essentially irrelevant and gets swamped by pipeline stalls waiting for
    the cache.

    Per your idea of keeping them on disk, you can "split" a 64 bit
    address into a "page address" and an "address within a page" and use
    the paging mechanism to bring in the relevant pages. I believe some
    "huge page" implementations actually do that to work well with TLBs
    and caches. Anyway, what you proposed is not far fetched.

    Beyond that there are lots of tricks one can play. We thought for a
    while of Huffman encoding the addresses and using relative pointers.
    If you are doing something like the Aho-Corasick algorithm, pointers
    that "match" and lead further on in the machine, so they are always
    positive numbers, while "failure" pointers are the ones that point
    backward and those do so to a limited number of states, where absolute addressing makes sense. There is more you can do along those lines,
    but those are the basic steps.

    Now, in actual genes there are "network expressions" and "spacers".
    The network expressions can be segments that are repeated, and that
    leads to some extra backward links, but the number of them is also
    small (and those are relative and not failure links).

    ---------

    Still, none of this impacts the fact, that when doing analysis, you
    can still treat those memory references as O(1) and say your lexer is
    linear in terms of bytes processed per second (and that having more
    states doesn't impact that), because fundamentally reading from the
    memory is the bottleneck, and we are always talking fixed memory
    images (RAM or "disks", that is it is a FINITE state machine) and not
    reading from a Turing Machine tape.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Christopher F Clark on Sat Mar 26 19:45:48 2022
    On Saturday, March 26, 2022 at 4:42:55 PM UTC-7, Christopher F Clark wrote:

    (snip)

    Now, I just looked up the size of the human genome. it is 3 billion,
    so that's a little more than another order or magnitude bigger, so you definitely need slightly bigger integers

    Note also that there are some larger genomes, such as the Japanese
    flower, Paris japonica at about 150 terabase.

    https://en.wikipedia.org/wiki/Paris_japonica

    We do like to think that we are the most special species, but it seems
    that in genome size, we aren't the winner.
    [Unless we have have some new way to apply DFAs to genomes, this seems to
    be wandering away from our toptic. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)