• Interpreter Dispatch in C (was: Effect of CPP Tags)

    From bart@21:1/5 to David Brown on Tue Jan 16 19:21:27 2024
    On 16/01/2024 15:29, David Brown wrote:
    On 16/01/2024 13:26, bart wrote:

    Let me just say that, in my interpreter, the extensive use of inline
    assembly in one module, makes some programs run twice as fast, as a
    gcc-O3-compiled C rendering.

    That can sometimes happen, for particular kinds of code.  It is not
    often that hand-written assembly will be so significantly faster than
    gcc unless there are close matches for unusual instructions that the
    compiler does not generate.  But it can certainly happen.

    Often the C code can be improved in various ways - perhaps using
    extensions (such as gcc attributes).  Getting the very best out of a compiler for key performance-critical code is rarely just a matter of
    writing "-O3" and hoping for the best.

    If you can boil this down to a short and manageable piece of code, then
    it might be fun to look at ways of improving the speed using either pure standard C, or gcc extensions, and compare it to the hand-generated assembly.  I realise that making such as small sample that keeps the
    effect is unlikely to be a trivial task.  (And if you do this, put it in
    a new thread :-) )

    I have done experiments that comprise under 100 lines of code and that
    test a very small number of bytecodes, maybe just one.

    But I don't believe you can extract useful conclusions that will scale.

    gcc's agressive optimiser is likely to reduce such a simple program to
    nothing, or it will be able to infer things are not possible when spread
    over 20,000 lines of code over multiple modules, and where the test
    bytecode is a runtime input, not set up in a data structure.

    I've worked with four main kinds of bytecode dispatcher:

    (1) Using a table of function pointers. The dispatcher is then a simple
    3-line loop

    (2) Using a large switch statement. Each case can be either inline code
    or can call a function. gcc likes this one because it can inline
    function calls

    (3) Using computed-goto. In C, this would need gcc's extension to use
    label pointers

    (4) Using a threaded-code dispatcher, making extensive use of inline
    assembly, which is an overlay over (1): when a bytecode can't be
    fully handled here, it calls one of the handlers from (1).

    These are progressively faster. I no longer use (2) or (3); there's no
    point if I have (4).

    The C version used (1), compared with (4) using my compiler and language.

    I have in the past compared C versions of (2) and (3) with (4), and (4)
    was still faster, although that was some time ago.

    When I lookat at CPython sources a decade ago, they used method (3) when compiled on Linux. On Windows however, they used method (2), since it
    needs MSVC to build, which does not have the needed extension.

    (My language also has label pointers, but it also has a built-in feature
    for computed-goto: you only have to tweak one line to change a regular switch-loop into one that uses computed-goto. That is, using multiple
    loop-back points so that each can have its own branch prediction.

    I don't use that in this product, only in a separate project.)

    -------------

    My (4) dispatcher uses thread-code functions that try and keep execution
    within a tight, register-based environment:

    * Essential globals are kept in registers

    * There is no function entry/exit code: each handler jumps directly to
    the next, without any loops

    * ABI considerations are put aside (eg. all non-volatiles are saved once
    at the beginning, and restored at the end)

    * Most handlers use inline assembly

    * When it is necessary to call a normal HLL handler, the environment
    must be saved and restored.

    This an example of a very simple handler that uses inline assembly:

    threadedproc j_jump*=
    assem
    mov Dprog,[Dprog+kopnda]
    *jumpnext
    end
    end

    And this is one which uses the HLL handler:

    threadedproc j_jumpptr*=
    saveregs
    k_jumpptr()
    loadregs
    jumpnext
    end

    saveregs/loadregs are macros. In both cases, 'jumpnext' is this macro
    (it needs * to invoke it from assembly):

    macro jumpnext = asm jmp [Dprog]

    So, the overheads of executing 'goto L' in the interpreted language are
    two machine instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to bart on Tue Jan 16 23:24:42 2024
    On 16/01/2024 20:21, bart wrote:
    On 16/01/2024 15:29, David Brown wrote:
    On 16/01/2024 13:26, bart wrote:

    Let me just say that, in my interpreter, the extensive use of inline
    assembly in one module, makes some programs run twice as fast, as a
    gcc-O3-compiled C rendering.

    That can sometimes happen, for particular kinds of code.  It is not
    often that hand-written assembly will be so significantly faster than
    gcc unless there are close matches for unusual instructions that the
    compiler does not generate.  But it can certainly happen.

    Often the C code can be improved in various ways - perhaps using
    extensions (such as gcc attributes).  Getting the very best out of a
    compiler for key performance-critical code is rarely just a matter of
    writing "-O3" and hoping for the best.

    If you can boil this down to a short and manageable piece of code,
    then it might be fun to look at ways of improving the speed using
    either pure standard C, or gcc extensions, and compare it to the
    hand-generated assembly.  I realise that making such as small sample
    that keeps the effect is unlikely to be a trivial task.  (And if you
    do this, put it in a new thread :-) )

    I have done experiments that comprise under 100 lines of code and that
    test a very small number of bytecodes, maybe just one.

    But I don't believe you can extract useful conclusions that will scale.


    That is sometimes the case.

    gcc's agressive optimiser is likely to reduce such a simple program to nothing, or it will be able to infer things are not possible when spread
    over 20,000 lines of code over multiple modules, and where the test
    bytecode is a runtime input, not set up in a data structure.


    A trick here is to make your source data "volatile" - or make it depend
    on a volatile (so that you are not affecting the access to the data
    itself). And put the results into a volatile.

    volatile int nothing = 0;
    volatile int result;

    void foo(void) {
    int test_data = 1234;
    double more_data = 3.14;

    test_data += nothing;

    // or

    if (nothing) {
    test_data = nothing;
    more_data = nothing;
    }

    // Start timer
    int x = run_tests(test_data, more_data);
    // Stop timer

    result = x;
    }

    It doesn't really matter how you use "nothing", just that the value of
    the inputs could be affected by it.

    (I can show you an alternative trick using target-independent inline
    assembly, if you like.)

    I've worked with four main kinds of bytecode dispatcher:

    (1)  Using a table of function pointers. The dispatcher is then a simple
         3-line loop


    OK.

    (2)  Using a large switch statement. Each case can be either inline code
         or can call a function. gcc likes this one because it can inline
         function calls


    I prefer switches to jump tables - I avoid function pointers where I
    can, because they make it so much harder to analyse call trees.

    (3)  Using computed-goto. In C, this would need gcc's extension to use
         label pointers

    That is also sometimes used for such code. Again, I prefer the switch here.


    (4)  Using a threaded-code dispatcher, making extensive use of inline
         assembly, which is an overlay over (1): when a bytecode can't be
         fully handled here, it calls one of the handlers from (1).

    These are progressively faster. I no longer use (2) or (3); there's no
    point if I have (4).

    The C version used (1), compared with (4) using my compiler and language.


    It is perhaps a bit unfair to compare a slow algorithm in C with a fast algorithm in a different language!

    Much will depend on cache usage, branch prediction, and if you can
    arrange for common cases to be checked first. And also remember that
    "-O3" is not always faster than "-O2", and that sometimes there are
    other flags that make a significant difference.

    I have in the past compared C versions of (2) and (3) with (4), and (4)
    was still faster, although that was some time ago.

    Depending on your value of "some", that might make a difference - gcc
    has improved over time.


    When I lookat at CPython sources a decade ago, they used method (3) when compiled on Linux. On Windows however, they used method (2), since it
    needs MSVC to build, which does not have the needed extension.


    There's a lot of work been done on such bytecode interpreters. One of
    the best-known experts, Anton Ertl, is the author of the GForth bytecode interpreter. His code is all in gcc-extended C. (He gets really worked
    up about some gcc optimisations "breaking" his "correct" code - we've
    had a few disagreements of opinion on that topic. You'd like him :-) )
    You may find some interesting papers on his webpage:

    <http://www.complang.tuwien.ac.at/projects/interpreters.html>

    He hangs out in comp.arch, and possibly other newsgroups.

    (My language also has label pointers, but it also has a built-in feature
    for computed-goto: you only have to tweak one line to change a regular switch-loop into one that uses computed-goto. That is, using multiple loop-back points so that each can have its own branch prediction.

    I don't use that in this product, only in a separate project.)

    -------------

    My (4) dispatcher uses thread-code functions that try and keep execution within a tight, register-based environment:

    * Essential globals are kept in registers

    * There is no function entry/exit code: each handler jumps directly to
      the next, without any loops

    * ABI considerations are put aside (eg. all non-volatiles are saved once
      at the beginning, and restored at the end)

    * Most handlers use inline assembly

    * When it is necessary to call a normal HLL handler, the environment
      must be saved and restored.

    This an example of a very simple handler that uses inline assembly:

        threadedproc j_jump*=
            assem
                mov Dprog,[Dprog+kopnda]
                *jumpnext
            end
        end

    And this is one which uses the HLL handler:

        threadedproc j_jumpptr*=
            saveregs
            k_jumpptr()
            loadregs
            jumpnext
        end

    saveregs/loadregs are macros. In both cases, 'jumpnext' is this macro
    (it needs * to invoke it from assembly):

        macro jumpnext = asm jmp [Dprog]

    So, the overheads of executing 'goto L' in the interpreted language are
    two machine instructions.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Wed Jan 17 07:07:51 2024
    On 16.01.2024 18:46, David Brown wrote:
    On 16/01/2024 16:15, bart wrote:

    The need for assembly usually trumps whatever minor optimisation my
    compiler might do.

    The good thing is that compilers do also major optimizations.
    (And the minor ones come for free.)

    But the most important insight is that tackling complexity
    algorithmically is what is most contributing for speed.

    (This may be different for folks fiddling only on assembler
    level, where _separate_ assembler projects serve better.
    It's IMO more sensible for time critical stuff to just use
    an assembler for such functions and embed them by a clean
    function interface in any HLL than to inline assembler code.
    YMMV.)


    Ah, there we differ. I use tools that do good optimisations. And
    perhaps 70% of the use-cases of inline assembly will be lost if the tool can't generate efficient code when there is inline assembly.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Wed Jan 17 07:11:58 2024
    On 16.01.2024 19:03, David Brown wrote:

    [...], because they know that Norwegians are superior to Swedes in every way...

    I just wanted to reply on that but then I saw your email address... :-)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Scott Lurndal on Wed Jan 17 06:25:34 2024
    On 16.01.2024 16:57, Scott Lurndal wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 16.01.2024 14:42, David Brown wrote:
    On 16/01/2024 12:54, bart wrote:

    Which processor is this for again?

    It's for a cpu with a "madd" instruction that implements "x = x + y * z" >>> in a single instruction - as Kaz pointed out, doing this in inline
    assembly would make sense if it the cpu had such a dedicated
    instruction. [...]

    I recall such a feature from a 35+ years old assembler project I did.
    It was on the TI TMS320C25 DSP, and the instruction was called 'MAC'
    (Multiply and ACcumulate). - Not sure it clarifies anything but just
    as an amendment if someone is interested in searching for keywords
    on such a function.

    Pretty much every modern architecture has multiply and accumulate instructions, even the ARM Cortex-M7 cores (MLA instruction).

    Yeah, I inferred that already from David's response. At these days,
    though, I hadn't seen 'MAC' in the common standard CPUs. (And since
    that time I didn't do any assembler any more, so I don't know the
    contemporary architectures.)

    It's maybe noteworthy that despite such advanced CPU functions I
    didn't make use of them, despite I had functions like sum(x[i]*y[i])
    to implement. The reason was that an algorithmic transformation,
    an optimization, made use of that 1-cycle MAC an inferior solution!

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Janis Papanagnou on Wed Jan 17 14:17:21 2024
    On 17/01/2024 07:11, Janis Papanagnou wrote:
    On 16.01.2024 19:03, David Brown wrote:

    [...], because they know that Norwegians are superior to Swedes in every way...

    I just wanted to reply on that but then I saw your email address... :-)


    I'm Scottish by origin, but have lived in Norway for about 30 years. So
    I am a Norwegian by practice (and citizenship) rather than birth.

    (Norway and Sweden view each other as "brother" countries, with very
    close ties and cooperation, but also good-natured rivalry and occasional teasing.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Wed Jan 17 16:33:26 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 17/01/2024 07:11, Janis Papanagnou wrote:
    On 16.01.2024 19:03, David Brown wrote:

    [...], because they know that Norwegians are superior to Swedes in every way...

    I just wanted to reply on that but then I saw your email address... :-)


    I'm Scottish by origin, but have lived in Norway for about 30 years. So
    I am a Norwegian by practice (and citizenship) rather than birth.

    (Norway and Sweden view each other as "brother" countries, with very
    close ties and cooperation, but also good-natured rivalry and occasional >teasing.)

    And at various times in the past they've been one country (c.f. United Kingdoms).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Wed Jan 17 18:47:50 2024
    On 17/01/2024 17:33, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 17/01/2024 07:11, Janis Papanagnou wrote:
    On 16.01.2024 19:03, David Brown wrote:

    [...], because they know that Norwegians are superior to Swedes in every way...

    I just wanted to reply on that but then I saw your email address... :-)


    I'm Scottish by origin, but have lived in Norway for about 30 years. So
    I am a Norwegian by practice (and citizenship) rather than birth.

    (Norway and Sweden view each other as "brother" countries, with very
    close ties and cooperation, but also good-natured rivalry and occasional
    teasing.)

    And at various times in the past they've been one country (c.f. United Kingdoms).


    Norway and Sweden have always been separate countries (since they became countries). They were in a union, with Norway dominated and ruled by
    Sweden, but they were still different countries.

    The Ununited Kingdom, on the other hand, is in some aspects one country,
    in other aspects four different countries (with a whole bunch of
    complicated bits), and in many aspects a complete mess.

    (I should stop there - this is no place to risk getting into politics.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Wed Jan 17 18:04:14 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 17/01/2024 17:33, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 17/01/2024 07:11, Janis Papanagnou wrote:
    On 16.01.2024 19:03, David Brown wrote:

    [...], because they know that Norwegians are superior to Swedes in every way...

    I just wanted to reply on that but then I saw your email address... :-) >>>>

    I'm Scottish by origin, but have lived in Norway for about 30 years. So >>> I am a Norwegian by practice (and citizenship) rather than birth.

    (Norway and Sweden view each other as "brother" countries, with very
    close ties and cooperation, but also good-natured rivalry and occasional >>> teasing.)

    And at various times in the past they've been one country (c.f. United Kingdoms).


    Norway and Sweden have always been separate countries (since they became >countries). They were in a union, with Norway dominated and ruled by
    Sweden, but they were still different countries.

    Note the plural kingdoms above.


    "United Kingdoms of Sweden and Norway, and known as the United Kingdoms"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Wed Jan 17 19:15:50 2024
    On 17.01.2024 14:17, David Brown wrote:
    On 17/01/2024 07:11, Janis Papanagnou wrote:
    On 16.01.2024 19:03, David Brown wrote:

    [...], because they know that Norwegians are superior to Swedes in
    every way...

    I just wanted to reply on that but then I saw your email address... :-)

    I'm Scottish by origin, but have lived in Norway for about 30 years. So
    I am a Norwegian by practice (and citizenship) rather than birth.

    By birth, or whatever; I think we are what we feel to be. :-)


    (Norway and Sweden view each other as "brother" countries, with very
    close ties and cooperation,

    Yes, that's also how these countries are seen from other places.
    I've been a couple times in SE, and once in NO; I like these
    countries.

    Also with respect to computer languages; I was amazed by Simula,
    originating from the NCC in Oslo/NO, and I had a compiler that
    had been developed in Lund/SE, where I could even talk to one
    of the developers in the 1980's.)

    but also good-natured rivalry and occasional teasing.)

    Within DE there's rivalry even between Bavaria and Prussia. ;-)
    (Not seriously but mainly also only some teasing. Historically,
    though, it had been a much more serious conflict.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Otto J. Makela@21:1/5 to Janis Papanagnou on Thu Jan 18 17:22:41 2024
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 16.01.2024 19:03, David Brown wrote:
    [...], because they know that Norwegians are superior to Swedes in
    every way...

    I just wanted to reply on that but then I saw your email address... :-)

    A Finn smirking from the sidelines.
    --
    /* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
    /* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
    /* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
    /* * * Computers Rule 01001111 01001011 * * * * * * */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phil Carmody@21:1/5 to Otto J. Makela on Sun Mar 24 14:24:28 2024
    om@iki.fi (Otto J. Makela) writes:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 16.01.2024 19:03, David Brown wrote:
    [...], because they know that Norwegians are superior to Swedes in
    every way...

    I just wanted to reply on that but then I saw your email address... :-)

    A Finn smirking from the sidelines.

    As if you have no rivalry with Swedes. What's the score? 5-1? Ooops, now
    it's 5-6. We have a great view here from Estonia.

    /* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */

    Send my regards to Vastarannan Kiiski.

    Phil
    --
    We are no longer hunters and nomads. No longer awed and frightened, as we have gained some understanding of the world in which we live. As such, we can cast aside childish remnants from the dawn of our civilization.
    -- NotSanguine on SoylentNews, after Eugen Weber in /The Western Tradition/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)