• 8 Forth Cores for Real Time Control

    From Christopher Lozinski@21:1/5 to All on Sat Apr 1 04:25:39 2023
    Okay, no one was too excited about my previous proposal, so I am proposing something else.

    For my master's thesis at Silesian University of Technology, I am now considering building an 8 core Forth CPU for real time control. Hard real time control on a single cpu is difficult, much better to allocate one cpu to controlling each signal.

    This would be a bit like the Propeller Parallax 2, and a bit different. That device uses register machines, this would be based on stack machines.
    That device emulates Forth. This would have native Forth instructions.
    In that device, each core has
    512 longs of dual-port register RAM for code and fast variables;
    512 longs of dual-port lookup RAM for code, streamer lookup, and variables; Access to to (1M?) hub RAM every 8 clock cycles.
    Pairwise Parallax cores can access some of their neighbors registers.

    I would like to make it 8 proper Forth CPU's rather than register machines. Rather than a big central hub memory, I would like each core to have more memory. How much? With two port memories, they could each share 2 * (1/8)th of the memory. (1/4) of the total memory. Not bad.

    I would like communication between adjacent cores. I like how the GA144 allows neighboring cores to communicate. That seems important to me.

    I wonder if this would be of interest to anyone?

    There is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU. Ting's ep16/24/32 are also interesting. There are a bunch of other cores I need to evaluate as well. Everyone speaks well of the J1.

    In other news, school is going well. I am really impressed with the education here. As a software developer, I completely misunderstood how to write Verilog. If I had tried it, it would have been a disaster One software developer famously used
    nested verilog while loops to generate a slow clock pulse. I strongly advise any developer considering designing a chip, to get educated in digital design first.

    Alternatively, you can do something like use the Intel design tools, to lay out components and their connectivity. I am sure that there are other such tools out there. But starting with Verilog, or even vhdl, for a software developer, is bound to come
    to an endless stream of problems.

    Warm Regards
    Christopher Lozinski









    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Sat Apr 1 10:01:26 2023
    On Saturday, April 1, 2023 at 7:25:42 AM UTC-4, Christopher Lozinski wrote:
    Okay, no one was too excited about my previous proposal, so I am proposing something else.

    For my master's thesis at Silesian University of Technology, I am now considering building an 8 core Forth CPU for real time control. Hard real time control on a single cpu is difficult, much better to allocate one cpu to controlling each signal.

    This would be a bit like the Propeller Parallax 2, and a bit different. That device uses register machines, this would be based on stack machines.
    That device emulates Forth. This would have native Forth instructions.
    In that device, each core has
    512 longs of dual-port register RAM for code and fast variables;
    512 longs of dual-port lookup RAM for code, streamer lookup, and variables;
    Access to to (1M?) hub RAM every 8 clock cycles.
    Pairwise Parallax cores can access some of their neighbors registers.

    I would like to make it 8 proper Forth CPU's rather than register machines.

    It has been mentioned elsewhere, that multiple CPUs can share the same hardware by the use of pipelining. In this case, rather than have 8 processors, one processor would have an 8 deep pipeline, each virtual processor having one phase of the pipeline.
    One advantage is that there is no need to consider the impacts of flow control on the pipeline, since from the perspective of the virtual processors, there is no pipeline.

    The only disadvantage, is that it is unlikely the pipelined processor will run 8 times as fast as 8 distinct processors, but you should be able to achieve clock speedups of 4 or so.


    Rather than a big central hub memory, I would like each core to have more memory. How much? With two port memories, they could each share 2 * (1/8)th of the memory. (1/4) of the total memory. Not bad.

    You need to consider the applications to know how large the memory needs to be. Another advantage of the pipelined CPU is that if memory access is limited to one phase, every virtual processor has full access to the entire memory. Since FPGA memories
    are very fast, this should not be a bottle neck.


    I would like communication between adjacent cores. I like how the GA144 allows neighboring cores to communicate. That seems important to me.

    "Seems important" is not an engineering evaluation. The communications in the GA144 is one of its weaknesses. To communicate between two arbitrary processors requires the message to be routed through other processors. With only 8 processors, much
    better would be the use of shared memory. This allows the exchange of data between *any* two processors, or even between them all.

    The GA144 idea was to provide for an arbitrary number of CPUs to communicate. But even that scheme does not scale, because it works poorly for even small arrays.


    I wonder if this would be of interest to anyone?

    There is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU.

    I'm not familiar with this CPU. Is it multiple processors?


    Ting's ep16/24/32 are also interesting.

    What in particular is interesting about them?


    There are a bunch of other cores I need to evaluate as well. Everyone speaks well of the J1.

    In other news, school is going well. I am really impressed with the education here. As a software developer, I completely misunderstood how to write Verilog.

    Yes, HDL stands for Hardware Description Language. It's not really the same as writing most software that is purely functional. The code in an HDL has to be written so it can be efficiently translated into simple hardware elements. I have a hardware
    background, so I tend to think in terms of the hardware first, then figure out what HDL will give me that.

    I once helped a software guy learn to program in VHDL. He had a very large chip and virtually no performance limitations. He just needed to get a demo running. He wrote code like it was software, and got it running. I was impressed. So, I no longer
    tell software people to forget everything they know, just half of it. lol


    If I had tried it, it would have been a disaster One software developer famously used nested verilog while loops to generate a slow clock pulse. I strongly advise any developer considering designing a chip, to get educated in digital design first.

    Are you talking about designing a custom CHIP? That's a whole different animal from an FPGA design. I recommend that you work with FPGAs for the first ten or twenty iterations of your design. Maybe 100. Designing a chip requires a lot more attention
    to detail and a lot more details to pay attention to. Get someone who has done chip design to assist you.


    Alternatively, you can do something like use the Intel design tools, to lay out components and their connectivity. I am sure that there are other such tools out there. But starting with Verilog, or even vhdl, for a software developer, is bound to come
    to an endless stream of problems.

    I'm not sure what design tools you are talking about. I'm not aware that anyone supports anything other than HDL. I guess there might be some schematic capture for the top level design, but these are very seldom used, since they make version control
    much more difficult. How do you do a diff on a schematic?

    --

    Rick C.

    - Get 1,000 miles of free Supercharging
    - Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to Lorem Ipsum on Sat Apr 1 11:04:47 2023
    On Saturday, April 1, 2023 at 7:01:28 PM UTC+2, Lorem Ipsum wrote:

    It has been mentioned elsewhere, that multiple CPUs can share the same hardware by the use of pipelining.
    You totally lost me on that one. I can clearly imagine 8 Forth Cores. I have no idea how to turn them into a pipeline. Is there a link? My whole goal was small fast and simple.

    With only 8 processors, much better would be the use of shared memory. This allows the exchange of data between *any* two processors, or even between them all.
    Very very interesting point. That is clearly how modern cpus work. How EE's think. A large L1 cache. I am rather interested in the other end of the design space. Maybe by the time I complete my training, I will agree more with traditional
    electrical engineers. We will see.


    There is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU.
    I'm not familiar with the CORE 1 CPU. Is it multiple processors?
    https://www.youtube.com/watch?v=KXjQdKBl7ag&t=599s https://github.com/angelus9/AI-Robotics

    He wrote code like it was software, and got it running.
    Great story.

    I'm not aware that anyone supports anything other than HDL.
    I could not find it, but when we start using it after Easter, I will post the link.

    How do you do a diff on a schematic?
    Great question.

    Thank you for the engagement. If I recall correctly, you also wrote a Forth CPU, could you be so kind as to provide a link?

    But still no one else is in many core Forth cpus. I am a minority in a minority.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Sat Apr 1 11:41:49 2023
    On Saturday, April 1, 2023 at 2:04:49 PM UTC-4, Christopher Lozinski wrote:
    On Saturday, April 1, 2023 at 7:01:28 PM UTC+2, Lorem Ipsum wrote:

    It has been mentioned elsewhere, that multiple CPUs can share the same hardware by the use of pipelining.
    You totally lost me on that one. I can clearly imagine 8 Forth Cores. I have no idea how to turn them into a pipeline. Is there a link? My whole goal was small fast and simple.

    It's not 8 processors. It's one processor, that is broken up with an 8 stage pipeline. Each stage of the pipeline would be a different logical processor.

    Let's say you have a CPU design, that ran at 25 MHz. It could be turned into a pipelined design, by adding registers in each path of logic. For simplicity, we'll just add one extra set to make it a 2 stage pipeline. The ALU is a part that is often
    speed limiting, so add a register that handles the add, and half the carry, latching that into an intermediate pipeline register. Then a second stage, handles the rest of the carry, producing the final result, two clocks after the data was input. Now,
    you can provide inputs on every clock cycle, but the first result will only emerge after two clock cycles. Think of a bucket brigade carrying water to a fire.

    The entire CPU could be broken up this way. If the data being entered into the pipeline is for different logical processors on each phase of the pipeline, it is the same as having multiple processors. A single block RAM can be the multiple stacks for
    every processor. In a single instruction, the stack is potentially read twice, and potentially written once. It is common for the top of stack to be a register separate from the stack memory. So you can read the stack RAM, read the top of stack
    register (actually a register array corresponding to the number of processors) and write the stack RAM and top of stack register, all somewhere in the multiple phases of a given CPU.

    I know this is a bit to take in. It's good if you first understand pipelining well. This is actually more simple than typical pipelining, where multiple instructions are being processed for the same processor. In that case, you have to worry about
    interactions between instructions. Did the first instruction write its result before the next instruct read that result from the stack? Branches in the code become very difficult to handle. The Pentium 4 processor could run at a very high clock rate,
    because they increased the length of the pipeline. But the impact of program branching requiring the pipeline to be refilled, resulted in it not being a lot faster than the Pentium 3.


    With only 8 processors, much better would be the use of shared memory. This allows the exchange of data between *any* two processors, or even between them all.
    Very very interesting point. That is clearly how modern cpus work. How EE's think. A large L1 cache. I am rather interested in the other end of the design space. Maybe by the time I complete my training, I will agree more with traditional electrical
    engineers. We will see.

    I'm not sure what design space you are at the other end of.


    There is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU.
    I'm not familiar with the CORE 1 CPU. Is it multiple processors?
    https://www.youtube.com/watch?v=KXjQdKBl7ag&t=599s https://github.com/angelus9/AI-Robotics

    Was that a yes or a no? I don't have time to watch videos to get answers to simple questions.


    He wrote code like it was software, and got it running.
    Great story.
    I'm not aware that anyone supports anything other than HDL.
    I could not find it, but when we start using it after Easter, I will post the link.
    How do you do a diff on a schematic?
    Great question.

    Thank you for the engagement. If I recall correctly, you also wrote a Forth CPU, could you be so kind as to provide a link?

    I don't think I ever posted it on the web. I shared my notes with some people, but they were too cryptic for anyone else to understand. It wasn't anything special. To me, this is just a fancy way of implementing counters and comparisons in software,
    rather than directly in hardware. So one important goal was for every instruction to be only 1 clock cycle. Even the interrupt takes just one clock cycle, as it is implemented as a call instruction, which also saves the processor status word on the
    data stack, in addition to the return address on the return stack.

    I had worked a bit on a combined design, which uses a stack, with offset addressing. This allows many stack manipulation instructions to be avoided. I was using as a test case, an interrupt routine that controlled a DDS. The interrupt has it's own
    stack space, so it could keep relevant variables on that stack, rather than in memory, but a lot of stack ops were required to get the data to the top, work on it, and then put it away. The offset addressing allowed 30% of the instructions to be
    eliminated, which means 30% faster, as well as less code. Again, I never did enough with it for anyone else to be able to understand my notes.


    But still no one else is in many core Forth cpus. I am a minority in a minority.

    Yea, Forth is such a minority, that people laugh at the idea of using it. Stack processors pretty much died off several decades ago. But they make sense for very small designs, particularly in FPGAs.

    --

    Rick C.

    + Get 1,000 miles of free Supercharging
    + Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to Lorem Ipsum on Sat Apr 1 13:16:00 2023
    On Saturday, April 1, 2023 at 8:41:51 PM UTC+2, Lorem Ipsum wrote:

    I'm not familiar with the CORE 1 CPU. Is it multiple processors?

    The current verilog is a single processor, they plan on going multi-processor, but like me they are not sure which application to target.

    I do understand how to make a register machine pipelined. I have no idea how to make a stack machine pipelined.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Sat Apr 1 14:20:15 2023
    On Saturday, April 1, 2023 at 4:16:01 PM UTC-4, Christopher Lozinski wrote:
    On Saturday, April 1, 2023 at 8:41:51 PM UTC+2, Lorem Ipsum wrote:

    I'm not familiar with the CORE 1 CPU. Is it multiple processors?
    The current verilog is a single processor, they plan on going multi-processor, but like me they are not sure which application to target.

    I do understand how to make a register machine pipelined. I have no idea how to make a stack machine pipelined.

    How is it any different???

    --

    Rick C.

    -- Get 1,000 miles of free Supercharging
    -- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to All on Sat Apr 1 21:20:40 2023

    I do understand how to make a register machine pipelined. I have no idea how to make a stack machine pipelined.
    How is it any different???

    Fetch the instruction,
    fetch the operands,
    do the instruction,
    write the results.

    On a stack machine, the operands are already on the stack, and the result is written to the stack,
    so there is no opportunity to pipeline those. The only thing you could do is to fetch the next instruction at the same time, or to parse a word into multiple instructions, but that is only a two stage pipeline, so the word pipeline did not come to mind.
    I just thought of that as doing two things simultaneously. Maybe it is 3 things.

    Thank you for the question, it helped my understanding grow.

    Still nobody I know of but me and the CORE-1 guys are are interested in a multi-core forth machines. I am not sure what I am going to do next.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Christopher Lozinski on Sun Apr 2 08:18:53 2023
    Christopher Lozinski <calozinski@gmail.com> writes:
    =20
    I do understand how to make a register machine pipelined. I have no ide= >a how to make a stack machine pipelined.
    How is it any different???=20

    Fetch the instruction,=20
    fetch the operands,=20
    do the instruction,=20
    write the results.=20

    On a stack machine, the operands are already on the stack, and the result i= >s written to the stack,=20
    so there is no opportunity to pipeline those.

    The way you present it, you have just the same opportunities as for a
    register machine (and of course, also the costs, such as forwarding
    the result to the input if you want to be able to execute instructions back-to-back). And if you do it as a barrel processor, as suggested
    by Lorem Ipsum, AFAICS you have to do that.

    I don't think that pipelining to make a barrel processor makes much
    sense for you. It increases the design cost to possibly save some transistors/area compared to having that many individual processors,
    but the pipelining itself also costs transistors/area, and it's not
    clear that you actually save something. Note that nobody has ever
    done a successful barrel processor design for a CPU (and I only
    remember the Tera MTA as an attempt to do it at all); the well-known
    example of a barrel processor is the I/O processor of the CDC 6600.

    Of course, if you put 8 individual cores on a chip with a single
    memory interface, you somehow have to arbitrate the access to the
    memory; one way to do that may be to have a single load/store unit
    that gets requests from the individual cores and proecesses them one
    after the other, somewhat like a barrel processor.

    Back to the question of pipelining: If you let you stack machine run
    only a single thread, you save quite a bit compared to a register
    machine: the output of the ALU is one of its inputs (well, you may
    want to MUX the output of the load unit (and other units) in between),
    so you get the benefits of forwarding automatically. Let's assume you implement the rest of the stack as a register file (plus a stack
    pointer register, maybe predecoded); then the stack architecture tells
    you early which register is the other operand, and you can perform the
    access in parallel with the instruction fetch.

    You can perform the execution of a given instruction in parallel with
    the fetch of the next one, if you do the instruction fetch in a
    separate pipeline stage.

    The only thing you could do i=
    s to fetch the next instruction at the same time, or to parse a word into m= >ultiple instructions, but that is only a two stage pipeline, so the word pi= >peline did not come to mind.

    A two-stage pipeline is still a pipeline.

    Still nobody I know of but me and the CORE-1 guys are are interested in a = >multi-core forth machines.

    Chuck Moore's work for quite a while has been on multi-core Forth
    machines, but the interest from potential users seems to be limited;
    most interest seems to be based on his earlier merits as discoverer
    (as he puts it) of Forth.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Sun Apr 2 02:52:38 2023
    On Sunday, April 2, 2023 at 12:20:42 AM UTC-4, Christopher Lozinski wrote:
    I do understand how to make a register machine pipelined. I have no idea how to make a stack machine pipelined.
    How is it any different???
    Fetch the instruction,
    fetch the operands,
    do the instruction,
    write the results.

    On a stack machine, the operands are already on the stack, and the result is written to the stack,
    so there is no opportunity to pipeline those. The only thing you could do is to fetch the next instruction at the same time, or to parse a word into multiple instructions, but that is only a two stage pipeline, so the word pipeline did not come to mind.
    I just thought of that as doing two things simultaneously. Maybe it is 3 things.

    I don't think you are familiar with what has to happen in a CPU design. How many different sources do you have for things that go onto the stack? Each of those things have to be selected via a multiplexer. That can be pipelined. The ALU is in the
    path between the stack output and the stack input, that can be pipelined.

    This is one of those areas where having little understanding of the hardware produced limits your understanding of how to optimize a design. Pipelining can be done at a very fine level. What you are looking at, is more an issue of parallel processes.
    These can be pipelined, but it can also be much finer.


    Thank you for the question, it helped my understanding grow.

    Still nobody I know of but me and the CORE-1 guys are are interested in a multi-core forth machines. I am not sure what I am going to do next.

    Exactly what advantage do you see from using a stack processor over register based processors, when going multi-core?

    --

    Rick C.

    -+ Get 1,000 miles of free Supercharging
    -+ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Sun Apr 2 03:14:17 2023
    On Sunday, April 2, 2023 at 5:50:41 AM UTC-4, Christopher Lozinski wrote:
    Note that nobody has ever done a successful barrel processor design for a CPU
    If i understand correclty, then XMOS has a barrel processor. Each core has multiple sets of registers, which can get swapped instantly.

    But they are a tiny company.

    What does that have to do with anything???

    I don't know that the term "barrel processor" has any real meaning. I thought Anton was using the term for the sort of pipelined design multi-processor I was referring to. That is not the same as the XMOS thing. There, they have multiple, independent
    processors, which use a common memory, by interleaving accesses. I believe they do this in a way that does not slow the processors. Some people swear by these devices, but given that most processors are picked by performance vs. cost, the XMOS
    processors don't do any better than other designs. In fact, they are often excluded because of their higher cost. Not many projects require the higher performance and most people are very familiar with designing standard processors. Not entirely
    unlike the issues of promoting stack processors.

    --

    Rick C.

    ++ Get 1,000 miles of free Supercharging
    ++ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to All on Sun Apr 2 02:50:39 2023
    Note that nobody has ever done a successful barrel processor design for a CPU

    If i understand correclty, then XMOS has a barrel processor. Each core has multiple sets of registers, which can get swapped instantly.

    But they are a tiny company.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Anton Ertl on Sun Apr 2 03:07:29 2023
    On Sunday, April 2, 2023 at 4:53:14 AM UTC-4, Anton Ertl wrote:
    Christopher Lozinski <caloz...@gmail.com> writes:
    =20
    I do understand how to make a register machine pipelined. I have no ide= >a how to make a stack machine pipelined.
    How is it any different???=20

    Fetch the instruction,=20
    fetch the operands,=20
    do the instruction,=20
    write the results.=20

    On a stack machine, the operands are already on the stack, and the result i=
    s written to the stack,=20
    so there is no opportunity to pipeline those.
    The way you present it, you have just the same opportunities as for a register machine (and of course, also the costs, such as forwarding
    the result to the input if you want to be able to execute instructions back-to-back). And if you do it as a barrel processor, as suggested
    by Lorem Ipsum, AFAICS you have to do that.

    I don't know what AFAICS means, but in a "barrel" processor, as you call it, you don't need any special additions to the design to accommodate this type of pipelining, because there is no overlap of processing instructions of a single, virtual processor.
    The instruction is processed 100% before beginning the next instruction. With no overlap, there's no need for "forwarding the result".


    I don't think that pipelining to make a barrel processor makes much
    sense for you. It increases the design cost to possibly save some transistors/area compared to having that many individual processors,
    but the pipelining itself also costs transistors/area, and it's not
    clear that you actually save something.

    If you say that, you don't understand what is going on. The only added cost in a barrel processor, are the added FFs, which are not "added" relative to multiple cores. Meanwhile, you have saved all the logic between the FFs. The amount of additional
    logic, would be very minimal. So there would be a large savings in logic overall.


    Note that nobody has ever
    done a successful barrel processor design for a CPU (and I only
    remember the Tera MTA as an attempt to do it at all); the well-known
    example of a barrel processor is the I/O processor of the CDC 6600.

    You are referring to commercial successes. How many commercial stack processors have you seen in the last 20 years? I know of none. So why bother trying to design a stack processor?


    Of course, if you put 8 individual cores on a chip with a single
    memory interface, you somehow have to arbitrate the access to the
    memory; one way to do that may be to have a single load/store unit
    that gets requests from the individual cores and proecesses them one
    after the other, somewhat like a barrel processor.

    Back to the question of pipelining: If you let you stack machine run
    only a single thread, you save quite a bit compared to a register
    machine: the output of the ALU is one of its inputs (well, you may
    want to MUX the output of the load unit (and other units) in between),
    so you get the benefits of forwarding automatically. Let's assume you implement the rest of the stack as a register file (plus a stack
    pointer register, maybe predecoded); then the stack architecture tells
    you early which register is the other operand, and you can perform the access in parallel with the instruction fetch.

    You can perform the execution of a given instruction in parallel with
    the fetch of the next one, if you do the instruction fetch in a
    separate pipeline stage.

    The only thing you could do i=
    s to fetch the next instruction at the same time, or to parse a word into m=
    ultiple instructions, but that is only a two stage pipeline, so the word pi= >peline did not come to mind.
    A two-stage pipeline is still a pipeline.

    Still nobody I know of but me and the CORE-1 guys are are interested in a = >multi-core forth machines.

    Chuck Moore's work for quite a while has been on multi-core Forth
    machines, but the interest from potential users seems to be limited;
    most interest seems to be based on his earlier merits as discoverer
    (as he puts it) of Forth.

    The Moore stack processors excel in one area only, providing raw MIPS at low power and small area. But these MIPS are not very usable, because of the many bottlenecks in the design. This includes very tiny, per CPU memories, I/O bandwidth limitations,
    limited inter-processor communications, limited memory interfaces. There's actually little good about chips such as the GA144.

    --

    Rick C.

    +- Get 1,000 miles of free Supercharging
    +- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lorem Ipsum on Sun Apr 2 12:29:50 2023
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    On Sunday, April 2, 2023 at 5:50:41=E2=80=AFAM UTC-4, Christopher Lozinski = >wrote:
    Note that nobody has ever done a successful barrel processor design for=
    a CPU
    If i understand correclty, then XMOS has a barrel processor. Each core ha= >s multiple sets of registers, which can get swapped instantly.=20
    ...
    I don't know that the term "barrel processor" has any real meaning. I thou= >ght Anton was using the term for the sort of pipelined design multi-process= >or I was referring to.

    I did.

    That is not the same as the XMOS thing. There, the=
    y have multiple, independent processors, which use a common memory, by inte= >rleaving accesses.

    <https://en.wikipedia.org/wiki/Barrel_processor> claims that the XCore
    XS1 is a barrel processor. So either this claim is wrong, or they
    have switched from a barrel processor design to one more along the
    lines of what I have suggested (if he wants to go for multi-core at
    all). Either variant supports my claim of lack of success for barrell processors in the CPU market.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lorem Ipsum on Sun Apr 2 12:36:25 2023
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    On Sunday, April 2, 2023 at 4:53:14=E2=80=AFAM UTC-4, Anton Ertl wrote:
    Christopher Lozinski <caloz...@gmail.com> writes:=20
    =3D20=20
    I do understand how to make a register machine pipelined. I have no = >ide=3D
    a how to make a stack machine pipelined.
    How is it any different???=3D20=20
    =20
    Fetch the instruction,=3D20=20
    fetch the operands,=3D20=20
    do the instruction,=3D20=20
    write the results.=3D20=20
    =20
    On a stack machine, the operands are already on the stack, and the resul= >t i=3D=20
    s written to the stack,=3D20
    so there is no opportunity to pipeline those.
    The way you present it, you have just the same opportunities as for a=20
    register machine (and of course, also the costs, such as forwarding=20
    the result to the input if you want to be able to execute instructions=20
    back-to-back). And if you do it as a barrel processor, as suggested=20
    by Lorem Ipsum, AFAICS you have to do that.=20

    I don't know what AFAICS means,

    As Far As I Can See.

    but in a "barrel" processor, as you call it=
    , you don't need any special additions to the design to accommodate this ty= >pe of pipelining, because there is no overlap of processing instructions of=
    a single, virtual processor. The instruction is processed 100% before beg=
    inning the next instruction. With no overlap, there's no need for "forward= >ing the result".=20

    Yes. My wording was misleading. What I meant: If you want to
    implement a barrel processor with a stack architecture, you have to
    treat the stack in many respects like a register file, possibly
    resulting in a pipeline like above.

    By contrast, for a single-thread stack-based CPU, what is the
    forwarding bypass (i.e., an optimization) of a register machine is the
    normal path for the TOS of a stack machine; but not for a barrel
    processor with a stack architecture.

    If you say that, you don't understand what is going on. The only added cos= >t in a barrel processor, are the added FFs, which are not "added" relative = >to multiple cores. Meanwhile, you have saved all the logic between the FFs= >. The amount of additional logic, would be very minimal. So there would b= >e a large savings in logic overall.=20

    The logic added in pipelining depends on what is pipelined (over in
    comp.arch Mitch Alsup has explained several times how expensive a
    deeply pipelined multiplier is: at some design points it's cheaper to
    have two multipliers with half the pipelining that are used in
    alternating cycles). In any case, the cost is significant in
    transistors, in area and in power; in the early 2000s Intel and AMD
    planned to continue their clock race by even deeper pipelining than
    they had until then (looking at pipelines with 8 FO4 gate equivalents
    per stage), but they found that they had trouble cooling the resulting
    CPUs, and so settled on ~16 FO4 gate equivalents per stage.

    How many commercial stack proce=
    ssors have you seen in the last 20 years? I know of none. So why bother=
    trying to design a stack processor? =20

    My understanding is that this is a project he does for educational
    purposes. I think that he can learn something from designing a stack processor; and if that's not enough, maybe some extension or other.
    He may also learn something from designing a barrel processor. But
    from designing a barrel processor with a stack architecture, at best
    he will learn why that squanders the implementation benefits of a
    stack architecture; but without first designing a single-threaded
    stack machine, I fear that he would miss that, and would not learn
    much about what the difference between stack and register machines
    means for the implementation, and he may also miss some interesting
    properties of barrel processors.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Anton Ertl on Sun Apr 2 10:29:38 2023
    On Sunday, April 2, 2023 at 8:36:22 AM UTC-4, Anton Ertl wrote:
    Lorem Ipsum <gnuarm.del...@gmail.com> writes:
    On Sunday, April 2, 2023 at 5:50:41=E2=80=AFAM UTC-4, Christopher Lozinski =
    wrote:
    Note that nobody has ever done a successful barrel processor design for=
    a CPU
    If i understand correclty, then XMOS has a barrel processor. Each core ha=
    s multiple sets of registers, which can get swapped instantly.=20
    ...
    I don't know that the term "barrel processor" has any real meaning. I thou= >ght Anton was using the term for the sort of pipelined design multi-process=
    or I was referring to.

    I did.

    That is not the same as the XMOS thing. There, the=
    y have multiple, independent processors, which use a common memory, by inte=
    rleaving accesses.

    <https://en.wikipedia.org/wiki/Barrel_processor> claims that the XCore
    XS1 is a barrel processor. So either this claim is wrong, or they
    have switched from a barrel processor design to one more along the
    lines of what I have suggested (if he wants to go for multi-core at
    all). Either variant supports my claim of lack of success for barrell processors in the CPU market.

    I had some discussions with a strong XMOS proponent some time ago and he never said their chips were barrel processors. Wikipedia is often wrong about details, and this is one of those times.

    https://www.xmos.ai/download/The-XMOS-XS1-Architecture(X7879A).pdf

    From XMOS,

    3 Concurrent Threads
    Each XCore has hardware support for executing a number of concurrent threads. This
    includes:
    • a set of registers for each thread.
    • a thread scheduler which dynamically selects which thread to execute.
    • a set of synchronisers to synchronise thread execution.
    ...

    This is not a pipelined barrel processor. This is simply a processor with multiple register sets and various comms mechanisms to facilitate multithreading/multitasking. Multiple register sets has been done more than once. Doesn't the ARM have two
    register sets, one for some sort of interrupt?

    --

    Rick C.

    --- Get 1,000 miles of free Supercharging
    --- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Anton Ertl on Sun Apr 2 10:53:11 2023
    On Sunday, April 2, 2023 at 9:03:48 AM UTC-4, Anton Ertl wrote:
    Lorem Ipsum <gnuarm.del...@gmail.com> writes:
    On Sunday, April 2, 2023 at 4:53:14=E2=80=AFAM UTC-4, Anton Ertl wrote:
    Christopher Lozinski <caloz...@gmail.com> writes:=20
    =3D20=20
    I do understand how to make a register machine pipelined. I have no =
    ide=3D
    a how to make a stack machine pipelined.
    How is it any different???=3D20=20
    =20
    Fetch the instruction,=3D20=20
    fetch the operands,=3D20=20
    do the instruction,=3D20=20
    write the results.=3D20=20
    =20
    On a stack machine, the operands are already on the stack, and the resul=
    t i=3D=20
    s written to the stack,=3D20
    so there is no opportunity to pipeline those.
    The way you present it, you have just the same opportunities as for a=20 >> register machine (and of course, also the costs, such as forwarding=20
    the result to the input if you want to be able to execute instructions=20 >> back-to-back). And if you do it as a barrel processor, as suggested=20
    by Lorem Ipsum, AFAICS you have to do that.=20

    I don't know what AFAICS means,
    As Far As I Can See.

    but in a "barrel" processor, as you call it=
    , you don't need any special additions to the design to accommodate this ty=
    pe of pipelining, because there is no overlap of processing instructions of=
    a single, virtual processor. The instruction is processed 100% before beg=
    inning the next instruction. With no overlap, there's no need for "forward= >ing the result".=20

    Yes. My wording was misleading. What I meant: If you want to
    implement a barrel processor with a stack architecture, you have to
    treat the stack in many respects like a register file, possibly
    resulting in a pipeline like above.

    I'm still not following. I'm not sure what you have to do with the register file, other than to have N of them like all other logic. The stack can be implemented in block RAM. A small counter points to the stack being processed at that time. You can
    only perform one stack read and one write for each processor per instruction.

    To make it simple, say it was a 4x design. The four stages could be instruction decode, ALU1, ALU2 and final. The instruction fetch happens on the final cycle, as do stack ops. There is no special stack "read", as a stack always presents the top item
    and next on stack, but the inputs to the ALU need to be captured at the end of instruction decode in the additional pipeline registers. IIRC, in my designs (not pipelined), I had the memory operations a half clock out of step which would be equivalent
    to doing memory read/write in ALU1 cycle.

    Some aspects of the stack operations might be pipelined. In my early CPU design, the stack ops were speed limiting to the entire CPU. But this had to do with producing over/underflow flags, which were reported in a processor status word. This is not
    an essential part of a stack processor. In the above example, the stack ops could be split and half done in the instruction decode phase.

    I would expect register ops to be simple and fast enough to not require pipelining. But the address (register index) calculation might require pipelining. Register CPUs are typically RMW, since the registers have to be selected before being "read". A
    stack processor can be designed to have it's top two elements available, immediately after an stack operation. It's a bit like a register machine with dedicated ALU registers. I recall some processors always did ALU ops using one fixed register and a
    selectable other register.


    By contrast, for a single-thread stack-based CPU, what is the
    forwarding bypass (i.e., an optimization) of a register machine is the normal path for the TOS of a stack machine; but not for a barrel
    processor with a stack architecture.

    I guess I simply don't know what you mean by "forwarding bypass". I found this.

    https://en.wikipedia.org/wiki/Operand_forwarding

    But I don't follow that either. This has to do with the data of the two instruction being related. In the barrel stack processor, each phase of the processor is an independent instruction stream. So there are no data dependencies involving the stack.
    In a pipelined stack CPU, there very much could be data dependencies. Every time the stack is adjusted, the CPU would stall.


    If you say that, you don't understand what is going on. The only added cos= >t in a barrel processor, are the added FFs, which are not "added" relative =
    to multiple cores. Meanwhile, you have saved all the logic between the FFs= >. The amount of additional logic, would be very minimal. So there would b= >e a large savings in logic overall.=20

    The logic added in pipelining depends on what is pipelined (over in comp.arch Mitch Alsup has explained several times how expensive a
    deeply pipelined multiplier is: at some design points it's cheaper to
    have two multipliers with half the pipelining that are used in
    alternating cycles).

    If you are talking about adding logic for a pipeline, that is some optimization you are performing. It's not inherent in the pipelining itself. Pipelining only requires that the logic flow be broken into steps by registers. This reduces the clock
    cycle time. In a pipeline with independent instruction streams, there is no added logic to deal with problems like stalls from data interactions.


    In any case, the cost is significant in
    transistors, in area and in power; in the early 2000s Intel and AMD
    planned to continue their clock race by even deeper pipelining than
    they had until then (looking at pipelines with 8 FO4 gate equivalents
    per stage), but they found that they had trouble cooling the resulting
    CPUs, and so settled on ~16 FO4 gate equivalents per stage.

    I can't say anything about massive Intel processors. In the small CPUs we are working with, this problem does not exist, mostly because there is no additional logic, other than the registers and the phase counter.


    How many commercial stack proce=
    ssors have you seen in the last 20 years? I know of none. So why bother=
    trying to design a stack processor? =20

    My understanding is that this is a project he does for educational
    purposes. I think that he can learn something from designing a stack processor; and if that's not enough, maybe some extension or other.
    He may also learn something from designing a barrel processor. But
    from designing a barrel processor with a stack architecture, at best
    he will learn why that squanders the implementation benefits of a
    stack architecture; but without first designing a single-threaded
    stack machine, I fear that he would miss that, and would not learn
    much about what the difference between stack and register machines
    means for the implementation, and he may also miss some interesting properties of barrel processors.

    He is talking about building a chip. That doesn't sound like an educational project. If he wants to learn, I think he should design both the register CPU, and a stack CPU. How else to compare the issues of each?

    So you are suggesting he build both the stack and register machine as non-pipelined and as pipelined? How else to learn about all types?

    How does a barrel stack processor "squander" anything??? He wants to design a chip with eight processors. I'm showing him he can design a single logical processor, and pipeline it to work as eight processors. His initial statement was about a real
    time control CPU for his thesis. That's where the barrel processor excels. It provides eight processors in much less logic than 8 separate processors would take. Multiple processors are often essential because multitasking on a single processor can
    have significant limitations and place significant burdens on the CPUs and software.

    I realize this is just a master's thesis, but designing what is in reality, a simple CPU, doesn't seem to come up to the level required. Using pipelining to implement eight processors in a single CPU architecture would seem to be a bit more "interesting"
    project.

    I've changed a lot since I entered the workplace. Now, I would expect the student to have done an analysis to determine the requirements for this processor, and how the unique features of the design contribute to meeting those requirements. In school,
    I was not taught a single thing about the real world, other than that digital waveforms were not the smooth, clean signals in our textbooks. Even in labs, we didn't get much practical information.

    --

    Rick C.

    --+ Get 1,000 miles of free Supercharging
    --+ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to Rick C on Mon Apr 3 08:53:42 2023
    Lorum Ipsum wrote:
    Barrel Processor.

    Okay, now I get it. share the ALU between cores on a time sliced basis.
    in some sense the Propeller Parallax does this. You only get access to core memory every 8 cycles.
    But we can do even better than that.

    Ting did an ALU, where he calculated everything at once.
    One could share all of that logic, and each "barrel core" could just use the ALU operation it needed.
    Some , some delays, but overall huge sharing, maybe energy and space savings.
    Thank you.


    Rick C asked:
    Exactly what advantage do you see from using a stack processor over register based processors, when going multi-core?

    I believe that a stack machine is smaller than a register machine, takes up less real estate, so you can have a lot more on a single chip/fpga. Faster clock cycles too, less energy per useful computation.

    Rick C. asks:
    How many commercial stack processors have you seen in the last 20 years? I know of none. So why bother trying to design a stack processor?

    Because many small processors should be able to out perform a few big processors. Because all the engineers keep putting more into each layer of the hardware and software stacks. And there is a huge benefit to just shrinking the entire stack and
    making it understandable to mere mortals. Don't optimize the pieces, optimize the entire system.

    Thank you everyone for the most interesting discussions. I will read it all a few more times to make sure I get it all.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Mon Apr 3 10:34:08 2023
    On Monday, April 3, 2023 at 11:53:44 AM UTC-4, Christopher Lozinski wrote:
    Lorum Ipsum wrote:
    Barrel Processor.

    Okay, now I get it. share the ALU between cores on a time sliced basis.
    in some sense the Propeller Parallax does this. You only get access to core memory every 8 cycles.
    But we can do even better than that.

    Ting did an ALU, where he calculated everything at once.
    One could share all of that logic, and each "barrel core" could just use the ALU operation it needed.
    Some , some delays, but overall huge sharing, maybe energy and space savings.
    Thank you.

    By calulating "everything at once", I assume you mean the various operations such as ADD, SUB, OR, AND, etc. For a single such ALU to be shared between instruction streams, would require a massive amount of multiplexers, which are as resource intensive
    as addition. I think you will find no ALU has been shared this way, because it is obvious with a simple inspection it is very resource hungry, with no advantage.

    I'm not trying to give you a hard time, but I think the failure to see this comes from a lack of experience with hardware design. Try working with a few designs segments and you will see what works well and what doesn't.


    Rick C asked:
    Exactly what advantage do you see from using a stack processor over register based processors, when going multi-core?
    I believe that a stack machine is smaller than a register machine, takes up less real estate, so you can have a lot more on a single chip/fpga. Faster clock cycles too, less energy per useful computation.

    Here is a link to Jim Brakefield's compendium of soft core CPUs. It's a lot of data, but you will be able to see that there's no inherent advantage to stack processors in terms of the speed/resource usage tradeoff.

    https://github.com/jimbrake/cpu_soft_cores

    You will find the performance metric of KIPS/LUT to have both register and stack based processors near the top ranks. The top designs are register based, with a few stack designs in the top 10 or 20. The J1 is in the top 10.

    But there are other metrics, such as code size for a given application. That was the impetus for designing the J1, the microblaze used too much memory for a particular application. The initial J1 used less memory and ran faster. However, that was the
    Xilinx version of the uBlaze. Other's have streamlined it to a much faster processor, but with the same instruction set, so the same code size.


    Rick C. asks:
    How many commercial stack processors have you seen in the last 20 years? I know of none. So why bother trying to design a stack processor?

    Because many small processors should be able to out perform a few big processors.

    I fail to see why this should be true. The only reason the large CPU makers have gone with multiple CPUs on a chip, is because they simply have fewer ways to use more transistors productively. So they add cores, and struggle with the memory bandwidth.


    Because all the engineers keep putting more into each layer of the hardware and software stacks. And there is a huge benefit to just shrinking the entire stack and making it understandable to mere mortals. Don't optimize the pieces, optimize the entire
    system.

    If the applications could be reduced in complexity, they would be. Don't make the mistake of comparing current desktop computers to simple, single chip MCUs. Different applications, so different approaches. I've been assuming you are talking about
    something equivalent to an ARM, used for real time control. Multiple cores can definitely help there, but it has to be a thoughtful design with specific design goals in mind.

    --

    Rick C.

    -+- Get 1,000 miles of free Supercharging
    -+- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to Lorum Ipsum on Mon Apr 3 12:34:06 2023
    Lorum Ipsum wrote:
    The top designs are register based, with a few stack designs in the top 10 or 20. The J1 is in the top 10.

    What a great link. What superb guidance you are giving me. I think you are quite right
    that I need to test my theory (it is only a theory) that a stack machine gives greater performance per lut/area than a register machine.

    I cannot deal with 500 cpus, but I could compare a few. I like the idea of comparing a leading soft core RISC-V chip vs the J1 or Ting's EP24.

    A week from Thursday we program our first FPGA board. Turn a light on or off by pressing a button. The teaching here is really really thorough.

    That is the second time you have shifted my thinking. (My highest compliment.) First about sharing resources between cpus (Barrel), and now about actually collecting the data to support my theory. I am hugely grateful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lorem Ipsum on Sun Apr 9 17:20:51 2023
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    On Sunday, April 2, 2023 at 9:03:48=E2=80=AFAM UTC-4, Anton Ertl wrote:
    Yes. My wording was misleading. What I meant: If you want to=20
    implement a barrel processor with a stack architecture, you have to=20
    treat the stack in many respects like a register file, possibly=20
    resulting in a pipeline like above.=20

    I'm still not following. I'm not sure what you have to do with the registe= >r file, other than to have N of them like all other logic. The stack can b= >e implemented in block RAM.

    Like a register file.

    By contrast, with a single-threaded approach, you can use the ALU
    output latch or the left ALU input latch as the TOS, reducing the
    porting requirements or increasing the performance.

    A small counter points to the stack being proc=
    essed at that time. You can only perform one stack read and one write for = >each processor per instruction. =20

    That means that an instruction like + would need two cycles if both
    operands come from the block RAM. By contrast, with a single-threaded
    stack processor you can use a single-ported SRAM block for the stack
    items below the TOS, and still perform + in one cycle.

    By contrast, for a single-thread stack-based CPU, what is the=20
    forwarding bypass (i.e., an optimization) of a register machine is the=20
    normal path for the TOS of a stack machine; but not for a barrel=20
    processor with a stack architecture.=20

    I guess I simply don't know what you mean by "forwarding bypass". I found = >this.=20

    https://en.wikipedia.org/wiki/Operand_forwarding

    But I don't follow that either. This has to do with the data of the two in= >struction being related. In the barrel stack processor, each phase of the = >processor is an independent instruction stream.

    Yes, so you throw away the advantage that the stack architecture gives
    you:

    For a register architecture, the barrel processor approach means that
    you don't need to implement the forwarding bypass.

    For a sigle-threaded stack architecture, you don't need the data path
    of the TOS through the register file/SRAM block (well, not quite, you
    need to put the TOS in the register file when you perform an
    instruction that just pushes something, but the usual path is directly
    from the ALU output to the left ALU input). I discussed the
    advantages of that above. A barrel processor approach means that this advantage goes away or at least the whole thing becomes quite a bit
    more complex.

    Every time the stack is adjusted, the CPU would s=
    tall. =20

    Does not sound like a competent microarchitectural design to me.

    The logic added in pipelining depends on what is pipelined (over in=20
    comp.arch Mitch Alsup has explained several times how expensive a=20
    deeply pipelined multiplier is: at some design points it's cheaper to=20
    have two multipliers with half the pipelining that are used in=20
    alternating cycles).=20

    If you are talking about adding logic for a pipeline, that is some optimiza= >tion you are performing. It's not inherent in the pipelining itself. Pipe= >lining only requires that the logic flow be broken into steps by registers.=

    Yes, and these registers are additional logic that costs area. In the
    case of the deeply pipelined multiplier there would be so many bits
    that would have to be stored in registers for some pipeline stage that
    it's cheaper to have a second multiplier with half the pipelining
    depth.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Anton Ertl on Sun Apr 9 12:38:17 2023
    On Sunday, April 9, 2023 at 1:46:46 PM UTC-4, Anton Ertl wrote:
    Lorem Ipsum <gnuarm.del...@gmail.com> writes:
    On Sunday, April 2, 2023 at 9:03:48=E2=80=AFAM UTC-4, Anton Ertl wrote:
    Yes. My wording was misleading. What I meant: If you want to=20
    implement a barrel processor with a stack architecture, you have to=20
    treat the stack in many respects like a register file, possibly=20
    resulting in a pipeline like above.=20

    I'm still not following. I'm not sure what you have to do with the registe= >r file, other than to have N of them like all other logic. The stack can b= >e implemented in block RAM.

    Like a register file.

    In what way does this impact the pipeline??? You are talking, but not explaining.


    By contrast, with a single-threaded approach, you can use the ALU
    output latch or the left ALU input latch as the TOS, reducing the
    porting requirements or increasing the performance.

    Sorry, I don't know what you mean. You are describing something that is in your head, without explaining it.

    The ALU does not require a register on the output. You can do that, but you also need multiplexing to allow other sources to reach the TOS register. You can try to use the ALU as your mux, but, in reality, that just moves the mux to the input of the
    ALU. For example, R> needs a data path from the return stack to the data stack. That can be input to a mux feeding the TOS register, or it can be input to a mux feeding an ALU input. It's a mux, either way.


    A small counter points to the stack being proc=
    essed at that time. You can only perform one stack read and one write for = >each processor per instruction. =20

    That means that an instruction like + would need two cycles if both
    operands come from the block RAM. By contrast, with a single-threaded
    stack processor you can use a single-ported SRAM block for the stack
    items below the TOS, and still perform + in one cycle.

    I don't know what a single threaded anything is. I don't understand your usage.

    The TOS can be a separate register from the block ram, OR you can use two ports on the block RAM. I prefer to use a TOS register, and use the two block ram ports for read and write, because the addresses are typically different. You read from address x
    or you write to address x+1. So the address counter for the stack has an output from the register and an output from the increment/decrement logic.


    By contrast, for a single-thread stack-based CPU, what is the=20
    forwarding bypass (i.e., an optimization) of a register machine is the=20 >> normal path for the TOS of a stack machine; but not for a barrel=20
    processor with a stack architecture.=20

    I guess I simply don't know what you mean by "forwarding bypass". I found = >this.=20

    https://en.wikipedia.org/wiki/Operand_forwarding

    But I don't follow that either. This has to do with the data of the two in= >struction being related. In the barrel stack processor, each phase of the = >processor is an independent instruction stream.
    Yes, so you throw away the advantage that the stack architecture gives
    you:

    Sorry, that is not remotely clear to me. Using a pipeline to turn a single processor into multiple processors, uses the same logic in the same way, for multiple instruction streams, with no interference. Using pipelining to speed up a single
    instruction stream results in extra logic being required and limited speed up from pipeline stalls and flushes.


    For a register architecture, the barrel processor approach means that
    you don't need to implement the forwarding bypass.

    Which is not needed for the stack processor. What is your point???


    For a sigle-threaded stack architecture, you don't need the data path
    of the TOS through the register file/SRAM block (well, not quite, you
    need to put the TOS in the register file when you perform an
    instruction that just pushes something, but the usual path is directly
    from the ALU output to the left ALU input). I discussed the
    advantages of that above. A barrel processor approach means that this advantage goes away or at least the whole thing becomes quite a bit
    more complex.

    Sorry, I have no idea what you are talking about. Why are you talking about TOS and register files??? Do you mean TOS and stack?


    Every time the stack is adjusted, the CPU would s=
    tall. =20

    Does not sound like a competent microarchitectural design to me.

    Whatever. You have so butchered the quoting and this statement is hanging in isolation, so I have no idea what the context is.

    Can you reply without the garbage at the ends of lines? What is the =20 thing?


    The logic added in pipelining depends on what is pipelined (over in=20
    comp.arch Mitch Alsup has explained several times how expensive a=20
    deeply pipelined multiplier is: at some design points it's cheaper to=20 >> have two multipliers with half the pipelining that are used in=20
    alternating cycles).=20

    If you are talking about adding logic for a pipeline, that is some optimiza=
    tion you are performing. It's not inherent in the pipelining itself. Pipe= >lining only requires that the logic flow be broken into steps by registers.=

    Yes, and these registers are additional logic that costs area. In the
    case of the deeply pipelined multiplier there would be so many bits
    that would have to be stored in registers for some pipeline stage that
    it's cheaper to have a second multiplier with half the pipelining
    depth.

    I have no idea what you are getting at. Of course pipeline registers use space a chip. Duh! Do you have a point about this, or are you just looking to debate the topic ad infinitum?

    1) In FPGAs, the registers are typically free. They have a register with nearly every logic element.

    2) When pipelining a stack processor, there is no need to pipeline the stack, unless you have an overly complex design that was overly slow to begin with. A stack is a block of RAM with an address pointer. In a barrel processor, the address pointer is
    a small RAM as well, rotating through the phases as the pipeline progresses (typically implemented in distributed RAM). An instruction like ADD pops the stack and writes the ALU result into the TOS register. One operation, one clock cycle, no need for
    confusing anything between phases. No pipelining of the stack.

    Is there anything here, that is not clear?

    --

    Rick C.

    -++ Get 1,000 miles of free Supercharging
    -++ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lorem Ipsum on Sun Apr 9 21:24:44 2023
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    On Sunday, April 9, 2023 at 1:46:46=E2=80=AFPM UTC-4, Anton Ertl wrote:
    By contrast, with a single-threaded approach, you can use the ALU=20
    output latch or the left ALU input latch as the TOS, reducing the=20
    porting requirements or increasing the performance.=20

    Sorry, I don't know what you mean. You are describing something that is in=
    your head, without explaining it. =20

    The ALU does not require a register on the output. You can do that, but yo= >u also need multiplexing to allow other sources to reach the TOS register. =
    You can try to use the ALU as your mux, but, in reality, that just moves t=
    he mux to the input of the ALU. For example, R> needs a data path from the=
    return stack to the data stack. That can be input to a mux feeding the TO=
    S register, or it can be input to a mux feeding an ALU input. It's a mux, = >either way.=20

    If you really want to avoid that, you can feed all the other stuff
    through the ALU on the other input, but yes, it's probably more
    efficient to have a mux somewhere in the TOS->ALU->TOS loop.

    But the point I was trying to make is that the TOS is not part of the
    register file (or "block RAM" in FPGA jargon) for the rest of the
    stack, and therefore you don't need a register file with two read and
    one write port per cycle (which you do for a 1-wide register machine
    that should execute 1 instruction per cycle).

    That means that an instruction like + would need two cycles if both=20
    operands come from the block RAM. By contrast, with a single-threaded=20
    stack processor you can use a single-ported SRAM block for the stack=20
    items below the TOS, and still perform + in one cycle.=20

    I don't know what a single threaded anything is. I don't understand your u= >sage.=20

    That's a normal processor, in contrast to the multi-threaded barrel
    processor.

    The TOS can be a separate register from the block ram, OR you can use two p= >orts on the block RAM. I prefer to use a TOS register, and use the two bloc= >k ram ports for read and write, because the addresses are typically differe= >nt.

    Now consider how that changes for a barrel processor.

    Sorry, that is not remotely clear to me. Using a pipeline to turn a single=
    processor into multiple processors, uses the same logic in the same way, f=
    or multiple instruction streams, with no interference.

    Now you have, say, 8 TOSs, 8 stack pointers, and 8 copies of the rest
    of the stack contents. And the 8 TOSs are on the critical path that
    determines the clock rate. You probably can work around that with
    more pipelining, but that increases the design complexity and area.

    For a register architecture, the barrel processor approach means that=20
    you don't need to implement the forwarding bypass.=20

    Which is not needed for the stack processor. What is your point???

    What is the forwarding bypass for a register machine is the TOS in a
    stack machine.

    Sorry, I have no idea what you are talking about. Why are you talking abou= >t TOS and register files??? Do you mean TOS and stack?=20

    The stack is what the programmer sees. In the implementation you
    implement the part of the stack that's not the TOS as block RAM in an
    FPGA or as register file in custom hardware (plus a stack pointer).
    Other options are possible, but these are the ones that are usually
    used.

    Can you reply without the garbage at the ends of lines? What is the =3D20 = >thing?

    That's the quoted-printable garbage that is coming from your Usenet
    client. Some clients repair this garbage, but mine doesn't, so it
    gets cited like you posted it. You can see in


    <http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cec17a8fd-b59b-4e16-b8a7-2225c6a2a9f2n%40googlegroups.com%3E>
    <http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cccf8d2a0-6ee4-4896-8f82-e49791c66729n%40googlegroups.com%3E>

    how your last two postings has been butchered by your Usenet client.

    I have no idea what you are getting at. Of course pipeline registers use s= >pace a chip. Duh! Do you have a point about this

    You claimed that pipelining needs no additional logic, but it does.

    1) In FPGAs, the registers are typically free. They have a register with n= >early every logic element. =20

    For custom hardware, you have to pay extra for the registers, and they
    are not cheap.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Anton Ertl on Sun Apr 9 17:00:11 2023
    On Sunday, April 9, 2023 at 6:00:36 PM UTC-4, Anton Ertl wrote:
    Lorem Ipsum <gnuarm.del...@gmail.com> writes:
    On Sunday, April 9, 2023 at 1:46:46=E2=80=AFPM UTC-4, Anton Ertl wrote:
    By contrast, with a single-threaded approach, you can use the ALU=20
    output latch or the left ALU input latch as the TOS, reducing the=20
    porting requirements or increasing the performance.=20

    Sorry, I don't know what you mean. You are describing something that is in=
    your head, without explaining it. =20

    The ALU does not require a register on the output. You can do that, but yo= >u also need multiplexing to allow other sources to reach the TOS register. =
    You can try to use the ALU as your mux, but, in reality, that just moves t=
    he mux to the input of the ALU. For example, R> needs a data path from the=
    return stack to the data stack. That can be input to a mux feeding the TO=
    S register, or it can be input to a mux feeding an ALU input. It's a mux, = >either way.=20

    If you really want to avoid that, you can feed all the other stuff
    through the ALU on the other input, but yes, it's probably more
    efficient to have a mux somewhere in the TOS->ALU->TOS loop.

    The point is, you need the mux. The only question is where to put it.


    But the point I was trying to make is that the TOS is not part of the register file (or "block RAM" in FPGA jargon) for the rest of the
    stack, and therefore you don't need a register file with two read and
    one write port per cycle (which you do for a 1-wide register machine
    that should execute 1 instruction per cycle).

    That means that an instruction like + would need two cycles if both=20
    operands come from the block RAM. By contrast, with a single-threaded=20 >> stack processor you can use a single-ported SRAM block for the stack=20 >> items below the TOS, and still perform + in one cycle.=20

    I don't know what a single threaded anything is. I don't understand your u= >sage.=20

    That's a normal processor, in contrast to the multi-threaded barrel processor.

    So why confuse the issue by using multiple terms for the same thing?

    The point is, the barrel processor does not require much extra logic to run N processes, without interference, resulting in a much higher processing rate, but more importantly... running multiple processes invisibly, no complex mult-tasking software
    required.

    I don't know what other people do with processors, but my designs typically need to have multiple events monitored and acted on. I don't like dealing with the potential hazards of conventional multitasking. Running independent processes on independent
    processors is ideal for my work. That's what a barrel processor gives me. Simple and effective.


    The TOS can be a separate register from the block ram, OR you can use two p=
    orts on the block RAM. I prefer to use a TOS register, and use the two bloc=
    k ram ports for read and write, because the addresses are typically differe=
    nt.

    Now consider how that changes for a barrel processor.

    So how does it? The TOS is now an N way register or small RAM, just like the rest of the stack. Instead of asking open ended questions, why not make a statement?


    Sorry, that is not remotely clear to me. Using a pipeline to turn a single=
    processor into multiple processors, uses the same logic in the same way, f=
    or multiple instruction streams, with no interference.
    Now you have, say, 8 TOSs, 8 stack pointers, and 8 copies of the rest
    of the stack contents. And the 8 TOSs are on the critical path that determines the clock rate. You probably can work around that with
    more pipelining, but that increases the design complexity and area.

    The TOS is distributed RAM (one LUT per bit). The stacks are all in one block RAM. The stack pointers are also distributed RAM.

    Where's the problem? The distributed RAM is nearly as fast as registers, so pipelining impact. It is the logic and multiplexers that need to be pipelined.


    For a register architecture, the barrel processor approach means that=20 >> you don't need to implement the forwarding bypass.=20

    Which is not needed for the stack processor. What is your point???
    What is the forwarding bypass for a register machine is the TOS in a
    stack machine.

    Sorry, I have no idea what you are talking about. Why are you talking abou= >t TOS and register files??? Do you mean TOS and stack?=20

    The stack is what the programmer sees. In the implementation you
    implement the part of the stack that's not the TOS as block RAM in an
    FPGA or as register file in custom hardware (plus a stack pointer).
    Other options are possible, but these are the ones that are usually
    used.

    Why do you keep calling the stack a register file. A rose by any other name... You are not making a point. You seem to be hiding from whatever point you want to make.

    Yes, registers are registers. So??? In a CPU they need to be addressed by some mechanism. Please get to a point.


    Can you reply without the garbage at the ends of lines? What is the =3D20 = >thing?

    That's the quoted-printable garbage that is coming from your Usenet
    client. Some clients repair this garbage, but mine doesn't, so it
    gets cited like you posted it. You can see in

    It only shows up in your replies. I can't do anything about it. Can you?


    <http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cec17a8fd-b59b-4e16-b8a7-2225c6a2a9f2n%40googlegroups.com%3E>
    <http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cccf8d2a0-6ee4-4896-8f82-e49791c66729n%40googlegroups.com%3E>

    how your last two postings has been butchered by your Usenet client.

    Ok, I'll just ignore this.


    I have no idea what you are getting at. Of course pipeline registers use s= >pace a chip. Duh! Do you have a point about this
    You claimed that pipelining needs no additional logic, but it does.

    1) In FPGAs, the registers are typically free. They have a register with n= >early every logic element. =20

    For custom hardware, you have to pay extra for the registers, and they
    are not cheap.

    The registers are there in any event. The comparison is an N-way barrel processor, or N processors. Same number of registers, but in one case, much less logic.

    If you want to run a single processor, it won't run N times faster, unless you pipeline it, adding the registers back.

    So what are you talking about?

    --

    Rick C.

    -++ Get 1,000 miles of free Supercharging
    -++ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to All on Sun Apr 9 23:07:59 2023
    These conversations are really interesting.
    The microcore has one cpu, but multiple stack regions in memory.
    The barrel processor has multiple stacks sharing resources round robin.
    The Propeller Parallax has multiple cpus sharing central memory round robin. The transputer has multiple cpus, each with its own memory, and multiple tasks time sliced.
    The XMOS chip has several poorly connected cpus, each with multiple sets of registers sharing resources time sliced.
    I had originally thought of doing multiple small cpus.
    So many choices. I can see now why you ask what is my goal. Then the best choice would be obvious.

    My imagination of what is possible has been hugely stretched. My certainty of what I wanted to build has evaporated.

    Thank you
    -Chris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Christopher Lozinski on Mon Apr 10 01:28:08 2023
    On Monday, April 10, 2023 at 2:08:01 AM UTC-4, Christopher Lozinski wrote:
    These conversations are really interesting.
    The microcore has one cpu, but multiple stack regions in memory.
    The barrel processor has multiple stacks sharing resources round robin.

    The barrel processor actually only has one stack. It has multiple stack pointers. That is what is hard to explain to people. The stack is one block RAM, with the upper n bits of address controlled by the pipeline counter. So each section of stack is
    addressed at the appropriate time.


    The Propeller Parallax has multiple cpus sharing central memory round robin. The transputer has multiple cpus, each with its own memory, and multiple tasks time sliced.
    The XMOS chip has several poorly connected cpus, each with multiple sets of registers sharing resources time sliced.

    Isn't the XMOS chip the same as the Parallax Propeller? I know they have more than one chip design, but I don't think they are fundamentally different. They just did a better job of it in the second go around.


    I had originally thought of doing multiple small cpus.
    So many choices. I can see now why you ask what is my goal. Then the best choice would be obvious.

    That's the point. The best choice is not obvious, until the requirements are understood. So far, I think this is the question you have not understood, mostly because you don't have requirements. You are learning, and don't actually have requirements
    as such.


    My imagination of what is possible has been hugely stretched. My certainty of what I wanted to build has evaporated.

    Yeah, I think your original image of what these processors are was not realistic. Hopefully you have a better grasp of hardware. I would recommend that you design a stack processor in an HDL, to learn about the process, and more so, the nature of
    hardware. Few software people have a good understanding of what it takes to make good hardware.

    --

    Rick C.

    +-- Get 1,000 miles of free Supercharging
    +-- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lorem Ipsum on Mon Apr 10 07:15:59 2023
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    On Sunday, April 9, 2023 at 6:00:36=E2=80=AFPM UTC-4, Anton Ertl wrote:
    [...]
    The point is, the barrel processor does not require much extra logic to run=
    N processes, without interference, resulting in a much higher processing r=
    ate

    Where do you get the "much higher processing rate" from, especially
    without "much extra logic"? If you just multiply all the stuff
    implementing the stack by, say, 8, the clock rate and thus the
    processing rate slows down. You need to introduce additional
    pipelining (i.e., additional logic) to compensate for that. And once
    you have compensated for that, each thread runs at 1/8th of the speed.

    If you have memory with a latency >1 cycle, you can pipeline the
    memory access with extra logic, and then you can use the
    multi-threading to fill the memory access latency. In that case you
    would have an increased rate of processing, if you use all 8 threads,
    but each individual thread is still dog slow.

    As an example, here's the benchmark numbers for Gforth 0.7.0 on two 2005-vintage CPUs (compiled with 2006-vintage compilers):

    sieve bubble matrix fib
    2.114 2.665 1.494 1.912 0.7.0; UltraSparc T1 1GHz; gcc-4.0.2
    0.176 0.244 0.100 0.308 0.7.0; K8 2.2Ghz (Athlon 64 X2 4400+); gcc-4.0.4

    The UltraSPARC T1 has 4 threads per core (and <https://en.wikipedia.org/wiki/UltraSPARC_T1> describes it as a barrel processor), while the K8 has only one. Both are implemented in a 90nm
    process. Admittedly, the UltraSPARC T1 has less area/core (8 cores on
    378mm^2) than the Athlon 64 X2 (2 cores on 199mm^2). But if we compute
    the throughput per mm^2 when using all threads (assuming perfect
    scaling for both, which is more questionable for the UltraSPARC T1),
    the Athlon 64 X2 wins with 0.012 executions/(s*mm^2) (executions of
    all these benchmarks) compared to 0.010 for the UltraSparc T1.

    ( T1) 32e 2.114e 2.665e 1.494e 1.912e f+ f+ f+ f/ 379e f/ f.
    ( K8) 2e 0.176e 0.244e 0.100e 0.308e f+ f+ f+ f/ 199e f/ f.

    And of course, when you have less than 32 threads, things look even
    worse for the T1. When you have only one thread, it's more than 10
    times slower.

    And these are both register-machine architectures. There is a reason
    why barrel processors have not taken off for CPUs.

    I don't know what other people do with processors, but my designs typically=
    need to have multiple events monitored and acted on. I don't like dealing= with the potential hazards of conventional multitasking. Running independ=
    ent processes on independent processors is ideal for my work. That's what = >a barrel processor gives me. Simple and effective. =20

    And the customers of Sun had servers that served multiple customers simulteneously, so Sun thought something like the UltraSPARC T1 would
    be simple and effective for them.

    Now consider how that changes for a barrel processor.=20

    So how does it? The TOS is now an N way register or small RAM, just like t= >he rest of the stack.

    Which means that you need more fan-out from the ALU to the TOS's and multiplexing from the TOS's to the ALU, both of which slows down the
    cycle time.

    Instead of asking open ended questions, why not make=
    a statement?=20

    I am sorry that I did not know that you needed this all spelled out.
    I was expecting that my question was enough to make the additional
    costs obvious.

    Can you reply without the garbage at the ends of lines? What is the =3D3= >D20 =3D=20
    thing?=20
    =20
    That's the quoted-printable garbage that is coming from your Usenet=20
    client. Some clients repair this garbage, but mine doesn't, so it=20
    gets cited like you posted it. You can see in=20

    It only shows up in your replies. I can't do anything about it. Can you?= >=20

    It's you who complained, so why should I?

    If you wanted to do something about it, you might ask what you can do
    about it, so I conclude that you don't want to do anything about it.

    how your last two postings has been butchered by your Usenet client.=20

    Ok, I'll just ignore this.=20

    This confirms my conclusion.

    The registers are there in any event. The comparison is an N-way barrel pr= >ocessor, or N processors. Same number of registers, but in one case, much = >less logic.=20

    If you don't want to incur longer cycle times, and increase the
    pipelining for that purpose, you need additional registers.

    If you want to run a single processor, it won't run N times faster, unless = >you pipeline it, adding the registers back. =20

    For the same amount of pipelining, a single-threaded processor will
    run at a higher clock rate than an N-way barrel processor, and the
    barrel processor will process the instructions of a thread at a 1/N
    per-cycle rate, so one thread will run more than N times faster. N
    threads will still run faster if you switch the threads rarely enough.
    You need to add more pipelining (more complexity, more logic, more
    area) to get a throughput benefit from barrel processing.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Anton Ertl on Mon Apr 10 07:18:41 2023
    On Monday, April 10, 2023 at 4:21:46 AM UTC-4, Anton Ertl wrote:
    Lorem Ipsum <gnuarm.del...@gmail.com> writes:
    On Sunday, April 9, 2023 at 6:00:36=E2=80=AFPM UTC-4, Anton Ertl wrote: [...]
    The point is, the barrel processor does not require much extra logic to run=
    N processes, without interference, resulting in a much higher processing r=
    ate

    Where do you get the "much higher processing rate" from, especially
    without "much extra logic"? If you just multiply all the stuff
    implementing the stack by, say, 8, the clock rate and thus the
    processing rate slows down. You need to introduce additional
    pipelining (i.e., additional logic) to compensate for that. And once
    you have compensated for that, each thread runs at 1/8th of the speed.

    That's the purpose of pipelining. The original path is M ns long. Add N-1 pipeline registers and it is now M/N ns long. Of course, that is ideal, but that's the concept. So the clock can now run at F * N. Since each phase of the barrel takes N
    clocks to get through the pipeline, the rate of execution is F, for each processor.

    If this is not clear, we are talking about different things.


    If you have memory with a latency >1 cycle, you can pipeline the
    memory access with extra logic, and then you can use the
    multi-threading to fill the memory access latency. In that case you
    would have an increased rate of processing, if you use all 8 threads,
    but each individual thread is still dog slow.

    You are looking at a very different technology. I'm not interested in ASIC design. I also am not in the kennel business. Let's stick to processor design.


    As an example, here's the benchmark numbers for Gforth 0.7.0 on two 2005-vintage CPUs (compiled with 2006-vintage compilers):

    sieve bubble matrix fib
    2.114 2.665 1.494 1.912 0.7.0; UltraSparc T1 1GHz; gcc-4.0.2
    0.176 0.244 0.100 0.308 0.7.0; K8 2.2Ghz (Athlon 64 X2 4400+); gcc-4.0.4

    The UltraSPARC T1 has 4 threads per core (and <https://en.wikipedia.org/wiki/UltraSPARC_T1> describes it as a barrel processor), while the K8 has only one. Both are implemented in a 90nm process. Admittedly, the UltraSPARC T1 has less area/core (8 cores on 378mm^2) than the Athlon 64 X2 (2 cores on 199mm^2). But if we compute
    the throughput per mm^2 when using all threads (assuming perfect
    scaling for both, which is more questionable for the UltraSPARC T1),
    the Athlon 64 X2 wins with 0.012 executions/(s*mm^2) (executions of
    all these benchmarks) compared to 0.010 for the UltraSparc T1.

    ( T1) 32e 2.114e 2.665e 1.494e 1.912e f+ f+ f+ f/ 379e f/ f.
    ( K8) 2e 0.176e 0.244e 0.100e 0.308e f+ f+ f+ f/ 199e f/ f.

    And of course, when you have less than 32 threads, things look even
    worse for the T1. When you have only one thread, it's more than 10
    times slower.

    And these are both register-machine architectures. There is a reason
    why barrel processors have not taken off for CPUs.

    Ok.


    I don't know what other people do with processors, but my designs typically=
    need to have multiple events monitored and acted on. I don't like dealing= with the potential hazards of conventional multitasking. Running independ=
    ent processes on independent processors is ideal for my work. That's what = >a barrel processor gives me. Simple and effective. =20

    And the customers of Sun had servers that served multiple customers simulteneously, so Sun thought something like the UltraSPARC T1 would
    be simple and effective for them.

    I have no idea why you are comparing such distinct processors. You've already acknowledged that they are very different with wildly different numbers of transistors. This is getting silly.


    Now consider how that changes for a barrel processor.=20

    So how does it? The TOS is now an N way register or small RAM, just like t= >he rest of the stack.

    Which means that you need more fan-out from the ALU to the TOS's and multiplexing from the TOS's to the ALU, both of which slows down the
    cycle time.

    I'm sorry that you don't understand FPGA design. There is no fan out between the ALU and the TOS. The TOS is a single entity in the FPGA, a small RAM, with two ports, one read, one write.


    Instead of asking open ended questions, why not make=
    a statement?=20

    I am sorry that I did not know that you needed this all spelled out.
    I was expecting that my question was enough to make the additional
    costs obvious.

    What question?


    Can you reply without the garbage at the ends of lines? What is the =3D3=
    D20 =3D=20
    thing?=20
    =20
    That's the quoted-printable garbage that is coming from your Usenet=20
    client. Some clients repair this garbage, but mine doesn't, so it=20
    gets cited like you posted it. You can see in=20

    It only shows up in your replies. I can't do anything about it. Can you?= >=20

    It's you who complained, so why should I?

    I can't do anything to fix it.


    If you wanted to do something about it, you might ask what you can do
    about it, so I conclude that you don't want to do anything about it.

    Ok, I'm sorry that you are getting hostile about this discussion.


    how your last two postings has been butchered by your Usenet client.=20

    Ok, I'll just ignore this.=20

    This confirms my conclusion.

    The registers are there in any event. The comparison is an N-way barrel pr= >ocessor, or N processors. Same number of registers, but in one case, much = >less logic.=20

    If you don't want to incur longer cycle times, and increase the
    pipelining for that purpose, you need additional registers.

    Sorry, you are not at all clear.


    If you want to run a single processor, it won't run N times faster, unless =
    you pipeline it, adding the registers back. =20

    For the same amount of pipelining, a single-threaded processor will
    run at a higher clock rate than an N-way barrel processor,

    If you want to compare pipelining to barrel design, then they will run at roughly the same clock speed.


    and the
    barrel processor will process the instructions of a thread at a 1/N per-cycle rate, so one thread will run more than N times faster.

    You aren't clear here, but I assume you mean the pipelined design will run faster than a single thread on the barrel processsor. That is true, but as I've already stated, my goal is to process multiple threads, not a single thread. It is exactly the
    complications that come from multitasking on a single-threaded processor, that I wish to avoid. So if you wish to completely ignore the multiple threads running on the barrel processor, then why have the discussion?


    N
    threads will still run faster if you switch the threads rarely enough.
    You need to add more pipelining (more complexity, more logic, more
    area) to get a throughput benefit from barrel processing.

    You keep saying things with zero support. You also fail to state clearly what you are saying. In this paragraph, I don't know what you are comparing. N threads in which processor?

    I think we are not actually talking about the same thing. You continually fail to understand the architecture I am describing and focus only on ASICs. I don't see any point in continuing to discuss this. So you get the last word.

    Thank you,

    --

    Rick C.

    +-+ Get 1,000 miles of free Supercharging
    +-+ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher Lozinski@21:1/5 to Lorum Ipsum on Mon Apr 10 13:11:31 2023
    Lorum Ipsum said:
    I would recommend that you design a stack processor in an HDL, to learn about the process, and more so, the nature of hardware.

    Great advice. First we will be pressing buttons, then blinking lights, adders, etc. I have a ways to go.

    We are also in a different era. There are so many interesting cpus out there. An important part of the process is to read them. In practice I will probably start with an existing CPU, and existing network on a chip, and merge them together.

    Few software people have a good understanding of what it takes to make good hardware.
    I think that is very deeply true. Very very different mindsets.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From S@21:1/5 to Christopher Lozinski on Fri Apr 28 04:05:25 2023
    On Saturday, April 1, 2023 at 9:25:42 PM UTC+10, Christopher Lozinski wrote:
    Okay, no one was too excited about my previous proposal, so I am proposing something else.

    For my master's thesis at Silesian University of Technology, I am now considering building an 8 core Forth CPU for real time control. Hard real time control on a single cpu is difficult, much better to allocate one cpu to controlling each signal.

    This would be a bit like the Propeller Parallax 2, and a bit different. That device uses register machines, this would be based on stack machines.
    That device emulates Forth. This would have native Forth instructions.
    In that device, each core has
    512 longs of dual-port register RAM for code and fast variables;
    512 longs of dual-port lookup RAM for code, streamer lookup, and variables;
    Access to to (1M?) hub RAM every 8 clock cycles.
    Pairwise Parallax cores can access some of their neighbors registers.

    I would like to make it 8 proper Forth CPU's rather than register machines. Rather than a big central hub memory, I would like each core to have more memory. How much? With two port memories, they could each share 2 * (1/8)th of the memory. (1/4) of the total memory. Not bad.

    I would like communication between adjacent cores. I like how the GA144 allows neighboring cores to communicate. That seems important to me.

    I wonder if this would be of interest to anyone?

    There is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU. Ting's ep16/24/32 are also interesting. There are a bunch of other cores I need to evaluate as well. Everyone speaks well of the J1.

    In other news, school is going well. I am really impressed with the education here. As a software developer, I completely misunderstood how to write Verilog. If I had tried it, it would have been a disaster One software developer famously used nested
    verilog while loops to generate a slow clock pulse. I strongly advise any developer considering designing a chip, to get educated in digital design first.

    Alternatively, you can do something like use the Intel design tools, to lay out components and their connectivity. I am sure that there are other such tools out there. But starting with Verilog, or even vhdl, for a software developer, is bound to come
    to an endless stream of problems.

    Warm Regards
    Christopher Lozinski
    Don't worry too much about him. Look up my threads for ideas.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)