Forum: >>> Magnum BBS <<<

Integer sizes and DFAs

From Christopher F Clark@21:1/5 to All on Sat Mar 26 17:16:15 2022

One of the posters mentioned that ignoring the size of integers used
in the DFAs was "not computer science". Actually, it is. It's called
the RAM model of computing. And, in it you assume each "integer"
occupies one location and is also sufficient to address any memory
location. Sa, accessing any integer any where in memory is an O(1)
operation.

Note how that even works for the DNA DFAs, assuming 32 bit integers
and a byte addressed machine. With that model, you can fit 10 million
(1E7) states in the 2 billion byte (2GB) address space (each state has
4, 4 byte entries or 16 bytes). You might even fit something over 100
million (1E8) states there. Thus, your memory access time is linear
in the worst case (assuming every request is a cache miss).

With 64 bit integers, one is essentially only limited by total memory (including disk) and it is still linear although the worst case time
is every access is a page fault. Probably not particularly practical,
but CS is about theory and how you model it.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Christopher F Clark on Sat Mar 26 18:39:53 2022

On 2022-03-26, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:

One of the posters mentioned that ignoring the size of integers used
in the DFAs was "not computer science". Actually, it is. It's called
the RAM model of computing. And, in it you assume each "integer"
occupies one location and is also sufficient to address any memory
location. Sa, accessing any integer any where in memory is an O(1) operation.

The O notation is based on the independent variable(s) like n being able
to taken on unlimited, arbitrarily large values.

In certain common situations we model elementary operations as being
O(1), with certain justifications. That simplifies the analysis, while
leaving its results still relevant and true for practical applications
running on present hardware.

The justification needs to be articulated though.

Note how that even works for the DNA DFAs, assuming 32 bit integers
and a byte addressed machine. With that model, you can fit 10 million
(1E7) states in the 2 billion byte (2GB) address space (each state has
4, 4 byte entries or 16 bytes).

Each 32 bit state has 4 byte entries only if it doesn't express that
it branches to other states for certain input symbols.

A DFA state is a row in a table, not just the number which identifies/enumerates that row.

If the input symbols are bytes, then the abstract row size capped to 256 entries.

(Larger symbol sets can exist (e.g. the example of Unicode) but
it's a reasonabl assumption that the symbol set is finite. Still, it can
be large.)

If we have a DFA built from certain lexical rules, and we extend those
rules, it could be the case that existing states blow up in size,
because of having to branch to many more other states on new characters
that were not included in the old rules.

If a machine initially recongnizes the set of strings { "foo", "bar" }
then its state zero might have a condensed representation of two
transitions on the sybmols 'f' and 'b', such as a tiny two-element
array that is subject to a linear search.

If we extend the recognized set of strings, state 0 may have to branch
on more symbols than just 'f' and 'b'. Depending on how we are
representing states, the size of the object may change.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[No existing computer is actually Turing equivalent because every real
computer has a finite memory, but we manage to do O(whatever) calculations anyway. I'm not sure how productive this line of argument is beyond noting what sorts of tasks are or aren't likely to fit in real computers. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to I don't know whether you simply did on Sun Mar 27 00:54:35 2022

Kaz, I don't know whether you simply didn't read what I wrote or chose
to ignore it.

In certain common situations we model elementary operations as being
O(1), with certain justifications. That simplifies the analysis, while leaving its results still relevant and true for practical applications running on present hardware.

And, my point was 2**32 is large enough to be considered arbitrarily large with respect to most DFAs. Not quite the human genome, see extended analysis
below. Here was my first analysis.

me > Note how that even works for the DNA DFAs, assuming 32 bit integers
me > and a byte addressed machine. With that model, you can fit 10 million
me > (1E7) states in the 2 billion byte (2GB) address space (each state has
me > 4, 4 byte entries or 16 bytes).

Each 32 bit state has 4 byte entries only if it doesn't express that
it branches to other states for certain input symbols.

A state isn't 32 bits, a transition is. A transition (in a DFA) is
just the address of another state. A state in a DFA for DNA has 4 such transitions (one for A, T, C, and G--the 4 DNA bases). Those are the
only legal symbols in DNA. So a state is 4 * 4 bytes, each transition
is 4 bytes and there are 4 of them per state.

Thus, I explained what the justification is and how that allows more
than 100 million states in the combined DFA. That's without any fancy
tricks.

Now, I just looked up the size of the human genome. it is 3 billion,
so that's a little more than another order or magnitude bigger, so you definitely need slightly bigger integers (or to do some compression,
although 64 bit integers are overkill, but convenient to work with and
with those you could easily fit the genome in an 128 GB SSD, which
would be addressible by 64 bit integer/pointers). And, for most other applications even 1 million states are more than sufficient. Thus,
saying a transition can happen in O(1) time is a perfectly reasonable assumption. You can layout nearly any DFA we expect to see in memory
and directly address it. Thus, the RAM model holds.

By the way with 40 bit integers, you could fit it in 60GB (20 bytes
per state and 3G states), We don't quite have laptops with that much
DRAM memory yet.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Christopher F Clark on Sat Mar 26 19:32:17 2022

On Saturday, March 26, 2022 at 4:42:55 PM UTC-7, Christopher F Clark wrote:

(snip)

And, my point was 2**32 is large enough to be considered arbitrarily large with
respect to most DFAs. Not quite the human genome, see extended analysis below. Here was my first analysis.

About 24 years ago I was working with a DNA sequencing group, and was interested in speeding up this problem. The one I was most interested in
was special purpose hardware with many of the largest DRAM I could find, arranged just to do this operation.

(Note that you need one more bit, to indicate when a match is found.)

There would be logic to read data off disk, and pass it directly to the DFA array. There is, then, logic to store the offset into the disk file, and the state at which the hit occured, to be read out later.

But we went onto other projects, and I never got to build one.

Since then, DRAM has gotten much larger, but so has the DNA database.

Yes the human genome is 3 gigabase, but the whole of GenBank is
now about 16 terabase, including WGS (whole genome sequences).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Sun Mar 27 15:02:16 2022

@gah4: Ok, The state space for the whole GenBank is larger. And, if
we ever get to the point where we can deal with epigenomic factors, we
probably need extra coding space for that too. But, I still think it
is likely that we will be in the range where we can make "disks" (or
"disk arrays") large enough for them and use 64 bit integers to
address them.

And, wandering off into the space of hardware implementations. At
Intel we were considering doing that circa 2010, so a bit later than
you were. In particular, we were looking to see if we could adapt our technology to do the "FASTA" algorithm. Like you, we never brought
that to market.

But, I want to make a couple of slight points on the encoding of transitions.

If one is using byte addressing and one always puts the states on
aligned boundaries, one has a few low order bits in the address that
will always be 0. You can tweak one of those to be your "match
complete/final" bit and then mask that off when using it to address
the next state. Alternately, one keeps a separate list of "final
states". That is more common in implementations. And, the third choice
is to have a state header with extra data, like whether the state is
final or not. The last two alternatives expand your memory footprint
somewhat, but by a relatively minimal amount.

Now, if you don't use byte addressing, but use "word" addressing (a la
most computers in the 1960s), you actually have shrunk your address requirement. You've moved those 0 bits from the bottom of the address
to the top. Even more so, if you use "state" addressing. Of course, if
your hardware has byte addressable memory, you may have to convert
those addresses into byte address before going to actual memory. But
all of that doesn't involve extra memory lookups, just some small
amount of integer/bit arithmetic which on modern computers is
essentially irrelevant and gets swamped by pipeline stalls waiting for
the cache.

Per your idea of keeping them on disk, you can "split" a 64 bit
address into a "page address" and an "address within a page" and use
the paging mechanism to bring in the relevant pages. I believe some
"huge page" implementations actually do that to work well with TLBs
and caches. Anyway, what you proposed is not far fetched.

Beyond that there are lots of tricks one can play. We thought for a
while of Huffman encoding the addresses and using relative pointers.
If you are doing something like the Aho-Corasick algorithm, pointers
that "match" and lead further on in the machine, so they are always
positive numbers, while "failure" pointers are the ones that point
backward and those do so to a limited number of states, where absolute addressing makes sense. There is more you can do along those lines,
but those are the basic steps.

Now, in actual genes there are "network expressions" and "spacers".
The network expressions can be segments that are repeated, and that
leads to some extra backward links, but the number of them is also
small (and those are relative and not failure links).

---------

Still, none of this impacts the fact, that when doing analysis, you
can still treat those memory references as O(1) and say your lexer is
linear in terms of bytes processed per second (and that having more
states doesn't impact that), because fundamentally reading from the
memory is the bottleneck, and we are always talking fixed memory
images (RAM or "disks", that is it is a FINITE state machine) and not
reading from a Turing Machine tape.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Christopher F Clark on Sat Mar 26 19:45:48 2022

On Saturday, March 26, 2022 at 4:42:55 PM UTC-7, Christopher F Clark wrote:

(snip)

Now, I just looked up the size of the human genome. it is 3 billion,
so that's a little more than another order or magnitude bigger, so you definitely need slightly bigger integers

Note also that there are some larger genomes, such as the Japanese
flower, Paris japonica at about 150 terabase.

https://en.wikipedia.org/wiki/Paris_japonica

We do like to think that we are the most special species, but it seems
that in genome size, we aren't the winner.
[Unless we have have some new way to apply DFAs to genomes, this seems to
be wandering away from our toptic. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	65:35:48
Calls:	6,712
Files:	12,244
Messages:	5,356,220

Integer sizes and DFAs

Who's Online

System Info