• Why are there no many core Forth FPGA CPUs?

    From Christopher Lozinski@21:1/5 to All on Sat Dec 31 00:03:17 2022
    There are many single core Forth FPGA CPUs, but not yet any many core cpus. There is the Core 1 project. Not yet quite ready. https://www.youtube.com/watch?v=KXjQdKBl7ag&t=1115s
    There is the GA144, inspirational, but that has too little memory.
    There is the Parallax P2 with Taqoz Forth, also a bit tight on memory.
    There is the 6Ghz project mentioned here. Still not shipping.
    If a forth core is so small, an obvious win is to put lots of them on an FPGA, but so far no one has done that yet.

    I am the only one interested in using such a thing?
    Chris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to caloz...@gmail.com on Sat Dec 31 00:55:50 2022
    On Saturday, December 31, 2022 at 3:03:18 AM UTC-5, caloz...@gmail.com wrote:
    There are many single core Forth FPGA CPUs, but not yet any many core cpus. There is the Core 1 project. Not yet quite ready. https://www.youtube.com/watch?v=KXjQdKBl7ag&t=1115s
    There is the GA144, inspirational, but that has too little memory.
    There is the Parallax P2 with Taqoz Forth, also a bit tight on memory.
    There is the 6Ghz project mentioned here. Still not shipping.
    If a forth core is so small, an obvious win is to put lots of them on an FPGA, but so far no one has done that yet.

    I am the only one interested in using such a thing?
    Chris

    I never know what people actually mean when they talk about "core", so I'll ignore that. I believe you are asking why, while there are so many stack processor designs for FPGAs, there are almost no available stack processors in the form of an ASIC.

    The Parallax P2 is not a stack processor. I believe it is just a CPU with a Forth program like any other processor. I don't know if you will ever see the 6 GHz stack processor in a usable form. This is almost certainly intended for proprietary work,
    rather than a general purpose CPU.

    You can put as many stack processors on an FPGA as you wish. That's the beauty of FPGAs. But why would you have a need for that? If you need it, why can't you do that yourself?

    --

    Rick C.

    - Get 1,000 miles of free Supercharging
    - Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to All on Sat Dec 31 06:01:04 2022
    If you need it, why can't you do that yourself?
    Sadly I am not yet an FPGA designer. But I am looking into going back to school to learn how to do this.

    Yes I am aware that the Parallax is not a stack processor.

    And no I am not asking about why no stack processors in ASIC.
    I am looking for multiple stack processors on a single FPGA.

    But your inability to understand my question, is very interesting. As if the idea of many forth cpus on an FPGA is nutty. Something that almost no one would even considering doing.

    That was helpful. Thank you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to caloz...@gmail.com on Sat Dec 31 11:39:02 2022
    On Saturday, December 31, 2022 at 9:01:06 AM UTC-5, caloz...@gmail.com wrote:
    If you need it, why can't you do that yourself?
    Sadly I am not yet an FPGA designer. But I am looking into going back to school to learn how to do this.

    Yes I am aware that the Parallax is not a stack processor.

    And no I am not asking about why no stack processors in ASIC.
    I am looking for multiple stack processors on a single FPGA.

    But your inability to understand my question, is very interesting. As if the idea of many forth cpus on an FPGA is nutty. Something that almost no one would even considering doing.

    That was helpful. Thank you.

    If you had a design for many processors on an FPGA, what would you do with it?

    I think that is the reason there are not stack multiprocessors for FPGAs. There's no real need for them.

    I actually have an idea for a stack processor, that rather than running very fast (200 MIPS +) would run like eight independent processors at perhaps 25 or 30 MIPS each. The only reason this would be useful, is to use the independent processors to run
    separate tasks, rather than trying to run multiple tasks on a single processor. Then there is no context switching to slow down the processor. The trick will be to provide for complete change in context on every clock cycle as the design cycles through
    the processors, without incurring additional delay.

    Being an FPGA designer, it is uncommon to deal with issues of running multiple processes on a single processor. In an FPGA, when you have multiple processes, you assign multiple processors. I'm referring to processes and any computation and processors
    as any hardware required to implement the process. So logic that calculates a sum and logic that chooses between two values, would be an adder and a multiplexer, separate and operating in parallel.

    This may seem simple and obvious, but it is very alien in the software world where processors typically are complex and run fast, so are shared across many processes.

    --

    Rick C.

    + Get 1,000 miles of free Supercharging
    + Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Matthias Koch on Sat Dec 31 14:13:16 2022
    On Saturday, December 31, 2022 at 4:49:24 PM UTC-5, Matthias Koch wrote:
    I am doing Forth-on-stackprocessor-on-FPGA (Mecrisp-Ice) both for work and for fun, and the reason is simple: Because in FPGA one has dedicated logic for complex peripheral IO. A traditional manycore microcontroller is used to get timing right on
    multiple interfaces or run two timing-critical tasks in parallel, but in an FPGA, the Forth is usually only orchestrating the various peripherals that work standalone otherwise.

    Nevertheless, it would be perfectly possible to do if a need arises.

    Tranditional processors use a fair amount of logic and implement complex instructions that take multiple clock cycles. One of the big advantages of a stack processor is they are often also MISC. My CPU designs use one clock cycle per instruction, which
    greatly facilitates context switching.

    So, most MCUs have few CPUs but lots of peripherals, as peripherals tend to be smaller. But if your processing requirements are not high, you can share the same CPU hardware to interleave multiple processors on the same hardware. This requires adding
    a small amount of hardware to facilitate the context switching, compared to multiple processors or the many peripherals. Essentially, each register needs to be a small RAM, such as the TOS (assuming it is actually a register), possibly the NOS, and the
    stack pointers. Most FPGAs support register files in the logic element LUTs, so that is a small cost.

    So, along the lines of the GA144 or the Propeller, multiple processors can implement many peripherals, but without adding significant logic.

    --

    Rick C.

    -- Get 1,000 miles of free Supercharging
    -- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Matthias Koch@21:1/5 to All on Sat Dec 31 22:49:22 2022
    I am doing Forth-on-stackprocessor-on-FPGA (Mecrisp-Ice) both for work and for fun, and the reason is simple: Because in FPGA one has dedicated logic for complex peripheral IO. A traditional manycore microcontroller is used to get timing right on
    multiple interfaces or run two timing-critical tasks in parallel, but in an FPGA, the Forth is usually only orchestrating the various peripherals that work standalone otherwise.

    Nevertheless, it would be perfectly possible to do if a need arises.

    Happy new year,
    Matthias

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to Christopher Lozinski on Tue Jan 3 11:31:43 2023
    Christopher Lozinski <calozinski@gmail.com> wrote:
    There are many single core Forth FPGA CPUs, but not yet any many core cpus. There is the Core 1 project. Not yet quite ready. https://www.youtube.com/watch?v=KXjQdKBl7ag&t=1115s
    There is the GA144, inspirational, but that has too little memory.
    There is the Parallax P2 with Taqoz Forth, also a bit tight on memory.
    There is the 6Ghz project mentioned here. Still not shipping.
    If a forth core is so small, an obvious win is to put lots of them on an FPGA, but so far no one has done that yet.

    What do you plan to use this for?

    One of the challenges with FPGA design concerns memory bandwidth. You can
    have a tightly coupled local memory to a small core, and that scales with
    the number of cores you lay down. But those memories are only of the order
    of kilobytes.

    If you want to go with off-chip memories, you are contended by bandwidth. Everything has to share the same 16/32/64 bit memory interface, that can
    only be accessed by one core at once. DRAM also has a lot of latency so
    it's slow to switch from one address to another. This means a lot of small cores making small requests to a DRAM is quite inefficient.

    The solution to the DRAM problem is either to go wider (vector style) or to
    use caches, but they come at a cost. You also need a memory interconnect
    that connects all your cores to the memory, and that takes area.

    Another option is not to have memory interconnect but just a lot of communicating cores, and have them pass message through other cores to
    access the DRAM. This only works if DRAM accesses are rare.

    One model is the 'systolic array' where a core only needs to communicate
    with its neighbours. This is fine for a 2D problem that maps nicely to a 2D chip, but as soon as you go to more dimensions the point to point wiring
    gets complicated. The solution to that is a network rather than point to
    point wiring, and we're now back to the interconnect question.

    Having small cores don't really help here, because now you're doing less compute in the cores but for more cores you need more interconnect overhead.
    It make sense to spend area on bigger (wider) cores and proportionately less area on interconnect.

    So my question would be: what applications would fit a sea of small cores
    with small local memories, but little inter-core communication?
    Are there any which aren't currently served by existing hardware, and for
    which a tiled Forth core would beat a tile of simple RISC (eg RISC-V) cores?

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Theo on Tue Jan 3 06:44:58 2023
    Theo schrieb am Dienstag, 3. Januar 2023 um 12:31:48 UTC+1:
    Christopher Lozinski <caloz...@gmail.com> wrote:
    There are many single core Forth FPGA CPUs, but not yet any many core cpus. There is the Core 1 project. Not yet quite ready. https://www.youtube.com/watch?v=KXjQdKBl7ag&t=1115s
    There is the GA144, inspirational, but that has too little memory.
    There is the Parallax P2 with Taqoz Forth, also a bit tight on memory. There is the 6Ghz project mentioned here. Still not shipping.
    If a forth core is so small, an obvious win is to put lots of them on an FPGA, but so far no one has done that yet.
    What do you plan to use this for?

    One of the challenges with FPGA design concerns memory bandwidth. You can have a tightly coupled local memory to a small core, and that scales with
    the number of cores you lay down. But those memories are only of the order
    of kilobytes.

    If you want to go with off-chip memories, you are contended by bandwidth. Everything has to share the same 16/32/64 bit memory interface, that can
    only be accessed by one core at once. DRAM also has a lot of latency so
    it's slow to switch from one address to another. This means a lot of small cores making small requests to a DRAM is quite inefficient.

    The solution to the DRAM problem is either to go wider (vector style) or to use caches, but they come at a cost. You also need a memory interconnect
    that connects all your cores to the memory, and that takes area.

    Another option is not to have memory interconnect but just a lot of communicating cores, and have them pass message through other cores to
    access the DRAM. This only works if DRAM accesses are rare.

    One model is the 'systolic array' where a core only needs to communicate
    with its neighbours. This is fine for a 2D problem that maps nicely to a 2D chip, but as soon as you go to more dimensions the point to point wiring
    gets complicated. The solution to that is a network rather than point to point wiring, and we're now back to the interconnect question.

    Having small cores don't really help here, because now you're doing less compute in the cores but for more cores you need more interconnect overhead. It make sense to spend area on bigger (wider) cores and proportionately less area on interconnect.

    So my question would be: what applications would fit a sea of small cores with small local memories, but little inter-core communication?
    Are there any which aren't currently served by existing hardware, and for which a tiled Forth core would beat a tile of simple RISC (eg RISC-V) cores?


    Good points. Thinking about RISC-V arrays, there is already a plethora of development
    tools availabe, even some QEMU emulations.

    IOW the tool stack for a hypothetical Forth CPU array would have to be developed
    as well, incl. design decisions whether vector instructions would make sense
    to increase the use case surface.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Theo on Tue Jan 3 10:22:24 2023
    On Tuesday, January 3, 2023 at 12:31:48 PM UTC+1, Theo wrote:
    Christopher Lozinski <caloz...@gmail.com> wrote:
    [..]
    What do you plan to use this for?

    One of the challenges with FPGA design concerns memory bandwidth. You can have a tightly coupled local memory to a small core, and that scales with
    the number of cores you lay down. But those memories are only of the order
    of kilobytes.
    [..]
    So my question would be: what applications would fit a sea of small cores with small local memories, but little inter-core communication?
    Are there any which aren't currently served by existing hardware, and for which a tiled Forth core would beat a tile of simple RISC (eg RISC-V) cores?

    Thanks Theo, very enlightening!

    Up to now I was toying with the idea of eventually putting my algorithm
    in an FPGA, using a few hundred cores that all worked on a tiny part of
    the problem. Running the numbers with your scheme, I come up with 24
    bytes/10ns (the other data + code can be kept locally in say 10 kbytes),
    so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
    AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
    bound.

    I will have to wait for a few more FPGA/memory generations, or buy an
    A100 (2TB memory bandwidth $32,097.00) instead.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Marcel Hendrix on Tue Jan 3 11:47:27 2023
    On Tuesday, January 3, 2023 at 1:22:25 PM UTC-5, Marcel Hendrix wrote:
    On Tuesday, January 3, 2023 at 12:31:48 PM UTC+1, Theo wrote:
    Christopher Lozinski <caloz...@gmail.com> wrote:
    [..]
    What do you plan to use this for?

    One of the challenges with FPGA design concerns memory bandwidth. You can have a tightly coupled local memory to a small core, and that scales with the number of cores you lay down. But those memories are only of the order of kilobytes.
    [..]
    So my question would be: what applications would fit a sea of small cores with small local memories, but little inter-core communication?
    Are there any which aren't currently served by existing hardware, and for which a tiled Forth core would beat a tile of simple RISC (eg RISC-V) cores?
    Thanks Theo, very enlightening!

    Up to now I was toying with the idea of eventually putting my algorithm
    in an FPGA, using a few hundred cores that all worked on a tiny part of
    the problem. Running the numbers with your scheme, I come up with 24 bytes/10ns (the other data + code can be kept locally in say 10 kbytes),
    so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
    AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
    bound.

    I will have to wait for a few more FPGA/memory generations, or buy an
    A100 (2TB memory bandwidth $32,097.00) instead.

    Don't expect FPGA memory to increase significantly in any reasonable amount of time. 1 MB of SRAM sucks down power like you wouldn't believe, even at ever smaller feature sizes.

    If you need three times the data rate of existing memory, why can't you use three memories? They've done that on CPUs for some time now, two memory buses for higher throughput.

    If you need lots of CPUs and lots of memory bandwidth, have you looked at using a graphic processor board? You will get a lot more than 100 CPUs.

    --

    Rick C.

    -+ Get 1,000 miles of free Supercharging
    -+ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Theo on Tue Jan 3 14:08:14 2023
    On Tuesday, January 3, 2023 at 10:30:55 PM UTC+1, Theo wrote:
    Marcel Hendrix <m...@iae.nl> wrote:
    [..]
    so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
    AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
    bound.

    Note that with 240GB/s I meant Gbytes/s, not Gbits/s.

    And I am not an Underpants Gnome :--)

    Another option is HBM: the Agilex M parts can do up to 820Gbits/s (102Gbytes/s) and that's a lot easier to use than DRAM.

    Still a factor 3 too low, but nice to know!
    GPUs like the A100 seem to be able to do it but at what cost...

    However, in either case your application will have to be built around maximising the DRAM bandwidth, and you'll have to do whatever it takes to
    get that. You'll be micromanaging everything to get peak DRAM performance.

    I agree that the problem is quite different from what I expected before
    reading your message: It is all in the amount and speed of RAM and
    com links, not in tricky new processor ideas.

    I don't see Forth being anywhere near being optimal.

    I am working on SPICE hardware simulation. This software is stuck in
    a rut because they want/need to stay compatible with the greybeards and
    their dusty decks. With iSPICE (SPICE in iForth) I can prove (demonstrate)
    that it is possible to write a competitive program (10 - 100 times faster
    than other packages) in a relatively short amount of time. This would not
    have been possible without using some of the essential Forth ingredients,
    not the least of it, the Forth philosophy.

    Plug: a while ago we had a paper on this - not a Forth CPU but custom logic, and compared it with vector processing for maximising DRAM bandwidth. Many
    of the same issues apply though: https://www.cl.cam.ac.uk/~atm26/pubs/FPL2013-BlueVec.pdf

    Reading it.

    [.. refreshlingly level headed arguments skipped ..]

    I think it's one of those things that seems 'obvious' until you discover
    what you have to do in addition to make the 'obvious' thing work, which make it a lot less attractive.

    Thanks for showing that one of the vague ideas I have is certain to fail.

    Unfortunately there are also things that seem 'impossible' until you just start doing them.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to Marcel Hendrix on Tue Jan 3 21:30:50 2023
    Marcel Hendrix <mhx@iae.nl> wrote:
    Up to now I was toying with the idea of eventually putting my algorithm
    in an FPGA, using a few hundred cores that all worked on a tiny part of
    the problem. Running the numbers with your scheme, I come up with 24 bytes/10ns (the other data + code can be kept locally in say 10 kbytes),
    so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
    AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
    bound.

    I will have to wait for a few more FPGA/memory generations, or buy an
    A100 (2TB memory bandwidth $32,097.00) instead.

    You can just about get there with DDR5: 5T/s, 64 bits wide, 320Gbits/s. If you're using an FPGA you can also use ECC bits, so that's 72 bits per
    channel. 4 channels are 180Gbytes/s.

    Another option is HBM: the Agilex M parts can do up to 820Gbits/s
    (102Gbytes/s) and that's a lot easier to use than DRAM.

    However, in either case your application will have to be built around maximising the DRAM bandwidth, and you'll have to do whatever it takes to
    get that. You'll be micromanaging everything to get peak DRAM performance.
    I don't see Forth being anywhere near being optimal.

    Plug: a while ago we had a paper on this - not a Forth CPU but custom logic, and compared it with vector processing for maximising DRAM bandwidth. Many
    of the same issues apply though: https://www.cl.cam.ac.uk/~atm26/pubs/FPL2013-BlueVec.pdf


    On Forth hardware more generally, it seems a little like the Underpants
    Gnomes:
    1. Build a Forth chip
    2. ????
    3. Profit!

    and step 2 is never quite clear.

    I'm not convinced by the massive array pitch for the reasons above.

    The size argument (for a single processor) belies the issue that a chip
    needs to be a certain minimum square mm just to have areas for the bond pads (which connect the silicon to the outside world), so you already have enough mm2 to fit a small conventional (RISC etc) CPU in any silicon process that isn't seriously antique.

    (another point is that chips are built on silicon wafers which are then cut into rectangular dice. If the size of your chip is of the order of the kerf
    of the cutter, you're wasting more area in cuts than you have in usable silicon. So there's another practical limit of how small you can go)

    The one remaining selling point is power. I could sorta see there might be
    an argument that a tiny CPU would have the absolute minimum number of transistors switching, and that could matter in some super low power applications. But again there's the overhead of whatever it's trying to
    *do* (ie sense/actuate/communicate), and that may end up dwarfing the power taken by the compute.

    I think it's one of those things that seems 'obvious' until you discover
    what you have to do in addition to make the 'obvious' thing work, which make
    it a lot less attractive.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Wed Jan 4 07:30:24 2023
    Marcel Hendrix <mhx@iae.nl> writes:
    Up to now I was toying with the idea of eventually putting my algorithm
    in an FPGA, using a few hundred cores that all worked on a tiny part of
    the problem. Running the numbers with your scheme, I come up with 24 >bytes/10ns (the other data + code can be kept locally in say 10 kbytes),
    so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
    AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
    bound.

    How about using a shared cache? I found the following numbers on <https://www.hardwaretimes.com/amd-ryzen-9-7900x-delivers-nearly-50-more-cache-bandwidth-than-the-12th-gen-intel-core-cpus-leak/>:

    |The 12-core chip [Ryzen 7900X) recorded a [L3 cache] memory read and
    |write bandwidth of 1,494.8 GB/s and 1,445.7 GB/s,
    |respectively. Meanwhile, the memory copy speed peaks at 1,476.6 GB/s
    |with the latency coming up to 10,1ns.

    If you can organize the computation such that it stays mostly in the
    cache for single-core implementation, it should be possible to stay in
    the cache for multi-core implementation, no? Yes, there is some
    additional buffering necessary, because the next core does not pick up
    the data immediately, but with cache sizes of 32MB and more, you can
    afford ~0.1ms of average slack given the data rate you mentioned.

    As for FPGAs, they immediately give you a slowdown and power
    disadvantage by about a factor of 10 over full-custom silicon; and
    that includes software on the full-custom silicon, if the operations
    on the full-custom silicon match the problem (e.g., if you do a
    sequence of FP adds in software, then a sequence of FP adds on an FPGA
    is going to suffer the slowdown even if you perform the sequencing in
    (FPGA) hardware rather than in software). FPGA wins only if software
    would require ~10 times more steps, e.g., for some bit-swizzling code
    like some crypto or encoding/decoding operations (and for the more
    popular of those, CPUs or GPUs get full-custom hardware).

    If you can split your problem into hundreds of small parallel tasks,
    you can make good use of multi-core CPUs, possibly with a custom
    scheduler for each core/thread.

    I don't know enough about your problem and GPUs to comment on whether
    GPUs are useful for it. The general impression I have about GPUs is
    that they are good for doing the same thing to a lot of data.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Wed Jan 4 05:03:14 2023
    On Wednesday, January 4, 2023 at 9:20:33 AM UTC+1, Anton Ertl wrote:
    Marcel Hendrix <m...@iae.nl> writes:
    [..]
    If you can organize the computation such that it stays mostly in the
    cache for single-core implementation, it should be possible to stay in
    the cache for multi-core implementation, no? Yes, there is some
    additional buffering necessary, because the next core does not pick up
    the data immediately, but with cache sizes of 32MB and more, you can
    afford ~0.1ms of average slack given the data rate you mentioned.

    Circuit simulation can't be done completely in cache because we want
    to see the results (typically N * 3 doubles per 10ns cycle, N the number
    of cores/devices)). However, this could be minimized to a number of *significant* results, typically 3 .. 10 instead of N in case of everything.

    If you can split your problem into hundreds of small parallel tasks,
    you can make good use of multi-core CPUs, possibly with a custom
    scheduler for each core/thread.

    Yes, and it would be more convenient to develop for (and a lot cheaper).

    I don't know enough about your problem and GPUs to comment on whether
    GPUs are useful for it. The general impression I have about GPUs is
    that they are good for doing the same thing to a lot of data.

    For a considerable number of cycles that is about it, but an
    input, event, or system state may trigger transition to different
    system matrices, of which there can be 2^M with M the number
    of switches in the circuit. The 2^M is much larger than the actual
    number of physically possible states, which is why I have an
    LRU cache of the system matrices.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)