• 8 MicroCore Barrel Processor

    From Christopher Lozinski@21:1/5 to All on Sun Apr 30 22:34:07 2023
    Lorum Ipsum recommended a Barrel Processor. I now think he had the right idea.

    One problem is off chip memory access. The CPU stalls while waiting for external memory. Reportedly Ting's EP32/24/16 stalls during a branch instruction, It has to go and find the code. The microCore takes two clock cycles for on chip memory access.

    Okay, so just run a barrel processor, and suddenly there is lots of time for memory access, or for long math operations. The clock speed can be increased. They can all access the same memory.

    The Propeller Parallax has 8 cores which can access the largest memory every 8 clock cycles. Which means that a number of the 8 cores could be sitting idle at any point in time waiting for memory. Multiple copies of the ALU sitting idle. A barrel
    processor, with shared ALU, would make much more sense to me.

    The mainstream CPU's are optimized for single thread speed. So they have big cache memories. Better to have lots of threads running, which can all wait some time for external memory access. Certainly for web crawlers , the network delay is so large,
    one does not need the fastest threads, one needs lots of energy efficient threads.

    I like how the microCore has a large number of instructions (84), and 26 software traps. The instructions were selected based on many years of building real applications. Big fat instructions. Very energy efficient computing. With so many
    instructions, ignoring memory access and a few stack operations, less chance that two barrel cores will need the same instruction at the same time. And if they do conflict, one processor just pauses.

    I like how the microcore exists. I can test my modules against their working modules.

    The microCore is an actively maintained project. This Saturday is the German language virtual Forth meetup. (I wish I were studying German instead of Italian!) I suspect the author will be there.

    Here is the Wikipedia page on Barrel processors. https://en.wikipedia.org/wiki/Barrel_processor
    They claim that the XMOS chips, are barrel processors. Reportedly XMOS evolved from the Transputer's Communicating Sequential Processes.

    Why was I not able to hear Lorum's advice the first time? I like the idea of a GA144-style systolic array. I can't wait for the 6 mHz chip to be released. But I do need a path forward, and this seems like a good way to go.

    I am sure that once again I got some things wrong in this post, but I am making progress. Your comments are most welcome.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Mon May 1 07:49:10 2023
    On Monday, May 1, 2023 at 1:34:09 AM UTC-4, Christopher Lozinski wrote:
    Lorum Ipsum recommended a Barrel Processor. I now think he had the right idea.

    One problem is off chip memory access. The CPU stalls while waiting for external memory. Reportedly Ting's EP32/24/16 stalls during a branch instruction, It has to go and find the code. The microCore takes two clock cycles for on chip memory access.

    I've never designed a CPU for FPGA which needed off chip memory. If your design needs the size of memory that can't fit on an FPGA, why are you designing your own CPU? Why not use one of the many, many CPUs available in the world?


    Okay, so just run a barrel processor, and suddenly there is lots of time for memory access, or for long math operations. The clock speed can be increased. They can all access the same memory.

    I don't think it works that way. When you talk of external memory, these days that almost universally means DRAM, which takes multiple clock cycles for a single memory access. That is why DRAM is used to transfer blocks of memory. Once the several
    clock cycle penalty is paid to start the transaction, data is streamed on every clock cycle.

    It may be possible to overlap accesses to DRAM. It has been a long time since I looked at DRAM, so I'm not fresh on the details. But, the CPUs will stall when accessing external memory. That much I'm sure of.

    I don't get what you mean about "long math operations". The barrel processor is pipelined so as to share the calculations between processes, not to speed up the calculations. If the CPU were not pipelined at all, it would run at about the same speed as
    a single thread on the barrel processor. The single thread of the barrel processor gains no advantage from the pipeling.


    The Propeller Parallax has 8 cores which can access the largest memory every 8 clock cycles. Which means that a number of the 8 cores could be sitting idle at any point in time waiting for memory. Multiple copies of the ALU sitting idle. A barrel
    processor, with shared ALU, would make much more sense to me.

    The Parallax would be running each CPU at 8x the speed of the barrel processor, if everything else is the same. So this is a false comparison. Eight processors is not remotely like an 8 way barrel processor, so, of course there will be significant
    advantages to using eight separate processors.


    The mainstream CPU's are optimized for single thread speed. So they have big cache memories. Better to have lots of threads running, which can all wait some time for external memory access. Certainly for web crawlers , the network delay is so large,
    one does not need the fastest threads, one needs lots of energy efficient threads.

    If you think there are not energy optimized, multi threaded processors, you are mistaken. Network servers are a huge part of the CPU business, with many designs targeting that market.


    I like how the microCore has a large number of instructions (84), and 26 software traps. The instructions were selected based on many years of building real applications. Big fat instructions. Very energy efficient computing. With so many instructions,
    ignoring memory access and a few stack operations, less chance that two barrel cores will need the same instruction at the same time. And if they do conflict, one processor just pauses.

    Large numbers of instructions mean lots more logic and most likely multiple clock cycles to run, again, complicating implementation. The decode of a simple instruction set for a single clock processor, depends on the instruction word only (and possibly
    flags for conditionals). When you have multiple clocks per instruction, the decode now has to track the cycle number and add that to the decode logic. It also makes hard real time code harder to count. Many multiple clock cycle CPUs have variable
    numbers of clocks for some instructions, which is even harder to count. Pipelined processors make the process much more difficult along with pipeline stalls. Cache makes it virtually impossible, and speed is measured, rather than calculated.

    A processor with every instruction taking one clock cycle, in contrast, is idiot simple to calculate speed. You just count the instructions. No stalls, no cache, no off chip memory access. If you have all those things, why not buy a CPU?


    I like how the microcore exists. I can test my modules against their working modules.

    The microCore is an actively maintained project. This Saturday is the German language virtual Forth meetup. (I wish I were studying German instead of Italian!) I suspect the author will be there.

    Here is the Wikipedia page on Barrel processors. https://en.wikipedia.org/wiki/Barrel_processor
    They claim that the XMOS chips, are barrel processors. Reportedly XMOS evolved from the Transputer's Communicating Sequential Processes.

    Why was I not able to hear Lorum's advice the first time?

    It's just my experiences and biases. It may, or may not fit your interests. To be honest, I'm not clear on your interests and goals. It's been a while since you last posted on that, and I don't recall. I think you said it was about education, rather
    than an actual use for a processor.


    I like the idea of a GA144-style systolic array. I can't wait for the 6 mHz chip to be released. But I do need a path forward, and this seems like a good way to go.

    If they are going to design a chip like the GA144, they need to at least provide some libraries of code to support the communications. Chuck Moore sat down to write some code for the chip, well after the chip was built, and found limitations in the
    comms. Why wasn't this explored *before* the chip was designed, so that the chip could have been built without this limitation? Add in the very limited memory size and you end up with a nearly pointless CPU. That's why it has found virtually no use in
    the world. It's a rare CPU that finds so few customers. It must have very serious flaws.


    I am sure that once again I got some things wrong in this post, but I am making progress. Your comments are most welcome.

    Ok, now you have a few more of my thoughts, and here's one more.

    My last exploration was to try to minimize the impact of the stack on code design. By that I mean the stack juggling instructions like DUP, SWAP, OVER, etc. Some people here will say that is the programmer's job, but I'm not just talking about what the
    programmer can do. I'm talking about the limitations of the stack vs. the flexibility of registers. Register based machines have no instructions like DROP, for example.

    Some prototyping I did with pencil and paper (a spread sheet actually) showed my example task could use 30% fewer instructions and run that much faster. That's a *huge* improvement. This would require more data paths in the stack and more logic to
    control it, but it would seem to be well worth it... as long as the instructions don't run 30% slower. lol

    --

    Rick C.

    - Get 1,000 miles of free Supercharging
    - Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)