Forum: >>> Magnum BBS <<<

Interpreter Dispatch in C (was: Effect of CPP Tags)

From bart@21:1/5 to David Brown on Tue Jan 16 19:21:27 2024

On 16/01/2024 15:29, David Brown wrote:

On 16/01/2024 13:26, bart wrote:

Let me just say that, in my interpreter, the extensive use of inline
assembly in one module, makes some programs run twice as fast, as a
gcc-O3-compiled C rendering.

That can sometimes happen, for particular kinds of code. It is not
often that hand-written assembly will be so significantly faster than
gcc unless there are close matches for unusual instructions that the
compiler does not generate. But it can certainly happen.

Often the C code can be improved in various ways - perhaps using
extensions (such as gcc attributes). Getting the very best out of a compiler for key performance-critical code is rarely just a matter of
writing "-O3" and hoping for the best.

If you can boil this down to a short and manageable piece of code, then
it might be fun to look at ways of improving the speed using either pure standard C, or gcc extensions, and compare it to the hand-generated assembly. I realise that making such as small sample that keeps the
effect is unlikely to be a trivial task. (And if you do this, put it in
a new thread :-) )

I have done experiments that comprise under 100 lines of code and that
test a very small number of bytecodes, maybe just one.

But I don't believe you can extract useful conclusions that will scale.

gcc's agressive optimiser is likely to reduce such a simple program to
nothing, or it will be able to infer things are not possible when spread
over 20,000 lines of code over multiple modules, and where the test
bytecode is a runtime input, not set up in a data structure.

I've worked with four main kinds of bytecode dispatcher:

(1) Using a table of function pointers. The dispatcher is then a simple
3-line loop

(2) Using a large switch statement. Each case can be either inline code
or can call a function. gcc likes this one because it can inline
function calls

(3) Using computed-goto. In C, this would need gcc's extension to use
label pointers

(4) Using a threaded-code dispatcher, making extensive use of inline
assembly, which is an overlay over (1): when a bytecode can't be
fully handled here, it calls one of the handlers from (1).

These are progressively faster. I no longer use (2) or (3); there's no
point if I have (4).

The C version used (1), compared with (4) using my compiler and language.

I have in the past compared C versions of (2) and (3) with (4), and (4)
was still faster, although that was some time ago.

When I lookat at CPython sources a decade ago, they used method (3) when compiled on Linux. On Windows however, they used method (2), since it
needs MSVC to build, which does not have the needed extension.

(My language also has label pointers, but it also has a built-in feature
for computed-goto: you only have to tweak one line to change a regular switch-loop into one that uses computed-goto. That is, using multiple
loop-back points so that each can have its own branch prediction.

I don't use that in this product, only in a separate project.)

-------------

My (4) dispatcher uses thread-code functions that try and keep execution
within a tight, register-based environment:

* Essential globals are kept in registers

* There is no function entry/exit code: each handler jumps directly to
the next, without any loops

* ABI considerations are put aside (eg. all non-volatiles are saved once
at the beginning, and restored at the end)

* Most handlers use inline assembly

* When it is necessary to call a normal HLL handler, the environment
must be saved and restored.

This an example of a very simple handler that uses inline assembly:

threadedproc j_jump*=
assem
mov Dprog,[Dprog+kopnda]
*jumpnext
end
end

And this is one which uses the HLL handler:

threadedproc j_jumpptr*=
saveregs
k_jumpptr()
loadregs
jumpnext
end

saveregs/loadregs are macros. In both cases, 'jumpnext' is this macro
(it needs * to invoke it from assembly):

macro jumpnext = asm jmp [Dprog]

So, the overheads of executing 'goto L' in the interpreted language are
two machine instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to bart on Tue Jan 16 23:24:42 2024

On 16/01/2024 20:21, bart wrote:

On 16/01/2024 15:29, David Brown wrote:

On 16/01/2024 13:26, bart wrote:

Let me just say that, in my interpreter, the extensive use of inline
assembly in one module, makes some programs run twice as fast, as a
gcc-O3-compiled C rendering.

That can sometimes happen, for particular kinds of code. It is not
often that hand-written assembly will be so significantly faster than
gcc unless there are close matches for unusual instructions that the
compiler does not generate. But it can certainly happen.

Often the C code can be improved in various ways - perhaps using
extensions (such as gcc attributes). Getting the very best out of a
compiler for key performance-critical code is rarely just a matter of
writing "-O3" and hoping for the best.

If you can boil this down to a short and manageable piece of code,
then it might be fun to look at ways of improving the speed using
either pure standard C, or gcc extensions, and compare it to the
hand-generated assembly. I realise that making such as small sample
that keeps the effect is unlikely to be a trivial task. (And if you
do this, put it in a new thread :-) )

I have done experiments that comprise under 100 lines of code and that
test a very small number of bytecodes, maybe just one.

But I don't believe you can extract useful conclusions that will scale.

That is sometimes the case.

gcc's agressive optimiser is likely to reduce such a simple program to nothing, or it will be able to infer things are not possible when spread
over 20,000 lines of code over multiple modules, and where the test
bytecode is a runtime input, not set up in a data structure.

A trick here is to make your source data "volatile" - or make it depend
on a volatile (so that you are not affecting the access to the data
itself). And put the results into a volatile.

volatile int nothing = 0;
volatile int result;

void foo(void) {
int test_data = 1234;
double more_data = 3.14;

test_data += nothing;

// or

if (nothing) {
test_data = nothing;
more_data = nothing;
}

// Start timer
int x = run_tests(test_data, more_data);
// Stop timer

result = x;
}

It doesn't really matter how you use "nothing", just that the value of
the inputs could be affected by it.

(I can show you an alternative trick using target-independent inline
assembly, if you like.)

I've worked with four main kinds of bytecode dispatcher:

(1) Using a table of function pointers. The dispatcher is then a simple
     3-line loop

OK.

(2) Using a large switch statement. Each case can be either inline code
     or can call a function. gcc likes this one because it can inline
     function calls

I prefer switches to jump tables - I avoid function pointers where I
can, because they make it so much harder to analyse call trees.

(3) Using computed-goto. In C, this would need gcc's extension to use
     label pointers

That is also sometimes used for such code. Again, I prefer the switch here.

(4) Using a threaded-code dispatcher, making extensive use of inline
     assembly, which is an overlay over (1): when a bytecode can't be
     fully handled here, it calls one of the handlers from (1).

These are progressively faster. I no longer use (2) or (3); there's no
point if I have (4).

The C version used (1), compared with (4) using my compiler and language.

It is perhaps a bit unfair to compare a slow algorithm in C with a fast algorithm in a different language!

Much will depend on cache usage, branch prediction, and if you can
arrange for common cases to be checked first. And also remember that
"-O3" is not always faster than "-O2", and that sometimes there are
other flags that make a significant difference.

I have in the past compared C versions of (2) and (3) with (4), and (4)
was still faster, although that was some time ago.

Depending on your value of "some", that might make a difference - gcc
has improved over time.

When I lookat at CPython sources a decade ago, they used method (3) when compiled on Linux. On Windows however, they used method (2), since it
needs MSVC to build, which does not have the needed extension.

There's a lot of work been done on such bytecode interpreters. One of
the best-known experts, Anton Ertl, is the author of the GForth bytecode interpreter. His code is all in gcc-extended C. (He gets really worked
up about some gcc optimisations "breaking" his "correct" code - we've
had a few disagreements of opinion on that topic. You'd like him :-) )
You may find some interesting papers on his webpage:

<http://www.complang.tuwien.ac.at/projects/interpreters.html>

He hangs out in comp.arch, and possibly other newsgroups.

(My language also has label pointers, but it also has a built-in feature
for computed-goto: you only have to tweak one line to change a regular switch-loop into one that uses computed-goto. That is, using multiple loop-back points so that each can have its own branch prediction.

I don't use that in this product, only in a separate project.)

-------------

My (4) dispatcher uses thread-code functions that try and keep execution within a tight, register-based environment:

* Essential globals are kept in registers

* There is no function entry/exit code: each handler jumps directly to
the next, without any loops

* ABI considerations are put aside (eg. all non-volatiles are saved once
at the beginning, and restored at the end)

* Most handlers use inline assembly

* When it is necessary to call a normal HLL handler, the environment
must be saved and restored.

This an example of a very simple handler that uses inline assembly:

    threadedproc j_jump*=
        assem
            mov Dprog,[Dprog+kopnda]
            *jumpnext
        end
    end

And this is one which uses the HLL handler:

    threadedproc j_jumpptr*=
        saveregs
        k_jumpptr()
        loadregs
        jumpnext
    end

saveregs/loadregs are macros. In both cases, 'jumpnext' is this macro
(it needs * to invoke it from assembly):

    macro jumpnext = asm jmp [Dprog]

So, the overheads of executing 'goto L' in the interpreted language are
two machine instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to David Brown on Wed Jan 17 07:07:51 2024

On 16.01.2024 18:46, David Brown wrote:

On 16/01/2024 16:15, bart wrote:

The need for assembly usually trumps whatever minor optimisation my
compiler might do.

The good thing is that compilers do also major optimizations.
(And the minor ones come for free.)

But the most important insight is that tackling complexity
algorithmically is what is most contributing for speed.

(This may be different for folks fiddling only on assembler
level, where _separate_ assembler projects serve better.
It's IMO more sensible for time critical stuff to just use
an assembler for such functions and embed them by a clean
function interface in any HLL than to inline assembler code.
YMMV.)

Ah, there we differ. I use tools that do good optimisations. And
perhaps 70% of the use-cases of inline assembly will be lost if the tool can't generate efficient code when there is inline assembly.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to David Brown on Wed Jan 17 07:11:58 2024

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in every way...

I just wanted to reply on that but then I saw your email address... :-)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Scott Lurndal on Wed Jan 17 06:25:34 2024

On 16.01.2024 16:57, Scott Lurndal wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 16.01.2024 14:42, David Brown wrote:

On 16/01/2024 12:54, bart wrote:

Which processor is this for again?

It's for a cpu with a "madd" instruction that implements "x = x + y * z" >>> in a single instruction - as Kaz pointed out, doing this in inline
assembly would make sense if it the cpu had such a dedicated
instruction. [...]

I recall such a feature from a 35+ years old assembler project I did.
It was on the TI TMS320C25 DSP, and the instruction was called 'MAC'
(Multiply and ACcumulate). - Not sure it clarifies anything but just
as an amendment if someone is interested in searching for keywords
on such a function.

Pretty much every modern architecture has multiply and accumulate instructions, even the ARM Cortex-M7 cores (MLA instruction).

Yeah, I inferred that already from David's response. At these days,
though, I hadn't seen 'MAC' in the common standard CPUs. (And since
that time I didn't do any assembler any more, so I don't know the
contemporary architectures.)

It's maybe noteworthy that despite such advanced CPU functions I
didn't make use of them, despite I had functions like sum(x[i]*y[i])
to implement. The reason was that an algorithmic transformation,
an optimization, made use of that 1-cycle MAC an inferior solution!

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Janis Papanagnou on Wed Jan 17 14:17:21 2024

On 17/01/2024 07:11, Janis Papanagnou wrote:

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in every way...

I just wanted to reply on that but then I saw your email address... :-)

I'm Scottish by origin, but have lived in Norway for about 30 years. So
I am a Norwegian by practice (and citizenship) rather than birth.

(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation, but also good-natured rivalry and occasional teasing.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Wed Jan 17 16:33:26 2024

David Brown <david.brown@hesbynett.no> writes:

On 17/01/2024 07:11, Janis Papanagnou wrote:

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in every way...

I just wanted to reply on that but then I saw your email address... :-)

I'm Scottish by origin, but have lived in Norway for about 30 years. So
I am a Norwegian by practice (and citizenship) rather than birth.

(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation, but also good-natured rivalry and occasional >teasing.)

And at various times in the past they've been one country (c.f. United Kingdoms).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Wed Jan 17 18:47:50 2024

On 17/01/2024 17:33, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 17/01/2024 07:11, Janis Papanagnou wrote:

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in every way...

I just wanted to reply on that but then I saw your email address... :-)

I'm Scottish by origin, but have lived in Norway for about 30 years. So
I am a Norwegian by practice (and citizenship) rather than birth.

(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation, but also good-natured rivalry and occasional
teasing.)

And at various times in the past they've been one country (c.f. United Kingdoms).

Norway and Sweden have always been separate countries (since they became countries). They were in a union, with Norway dominated and ruled by
Sweden, but they were still different countries.

The Ununited Kingdom, on the other hand, is in some aspects one country,
in other aspects four different countries (with a whole bunch of
complicated bits), and in many aspects a complete mess.

(I should stop there - this is no place to risk getting into politics.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Wed Jan 17 18:04:14 2024

David Brown <david.brown@hesbynett.no> writes:

On 17/01/2024 17:33, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 17/01/2024 07:11, Janis Papanagnou wrote:

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in every way...

I just wanted to reply on that but then I saw your email address... :-) >>>>

I'm Scottish by origin, but have lived in Norway for about 30 years. So >>> I am a Norwegian by practice (and citizenship) rather than birth.

(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation, but also good-natured rivalry and occasional >>> teasing.)

And at various times in the past they've been one country (c.f. United Kingdoms).

Norway and Sweden have always been separate countries (since they became >countries). They were in a union, with Norway dominated and ruled by
Sweden, but they were still different countries.

Note the plural kingdoms above.

"United Kingdoms of Sweden and Norway, and known as the United Kingdoms"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to David Brown on Wed Jan 17 19:15:50 2024

On 17.01.2024 14:17, David Brown wrote:

On 17/01/2024 07:11, Janis Papanagnou wrote:

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in
every way...

I just wanted to reply on that but then I saw your email address... :-)

I'm Scottish by origin, but have lived in Norway for about 30 years. So
I am a Norwegian by practice (and citizenship) rather than birth.

By birth, or whatever; I think we are what we feel to be. :-)

(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation,

Yes, that's also how these countries are seen from other places.
I've been a couple times in SE, and once in NO; I like these
countries.

Also with respect to computer languages; I was amazed by Simula,
originating from the NCC in Oslo/NO, and I had a compiler that
had been developed in Lund/SE, where I could even talk to one
of the developers in the 1980's.)

but also good-natured rivalry and occasional teasing.)

Within DE there's rivalry even between Bavaria and Prussia. ;-)
(Not seriously but mainly also only some teasing. Historically,
though, it had been a much more serious conflict.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Otto J. Makela@21:1/5 to Janis Papanagnou on Thu Jan 18 17:22:41 2024

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in
every way...

I just wanted to reply on that but then I saw your email address... :-)

A Finn smirking from the sidelines.
--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Phil Carmody@21:1/5 to Otto J. Makela on Sun Mar 24 14:24:28 2024

om@iki.fi (Otto J. Makela) writes:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 16.01.2024 19:03, David Brown wrote:

[...], because they know that Norwegians are superior to Swedes in
every way...

I just wanted to reply on that but then I saw your email address... :-)

A Finn smirking from the sidelines.

As if you have no rivalry with Swedes. What's the score? 5-1? Ooops, now
it's 5-6. We have a great view here from Estonia.

/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */

Send my regards to Vastarannan Kiiski.

Phil
--
We are no longer hunters and nomads. No longer awed and frightened, as we have gained some understanding of the world in which we live. As such, we can cast aside childish remnants from the dawn of our civilization.
-- NotSanguine on SoylentNews, after Eugen Weber in /The Western Tradition/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu May 2 20:36:49 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu May 2 20:26:33 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu May 2 20:10:37 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu May 2 19:46:45 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	299
Nodes:	16 (2 / 14)
Uptime:	51:12:04
Calls:	6,689
Calls today:	7
Files:	12,225
Messages:	5,344,600
Posted today:	1

Interpreter Dispatch in C (was: Effect of CPP Tags)

Who's Online

Recent Visitors

System Info