On 16/01/2024 13:26, bart wrote:
Let me just say that, in my interpreter, the extensive use of inline
assembly in one module, makes some programs run twice as fast, as a
gcc-O3-compiled C rendering.
That can sometimes happen, for particular kinds of code. It is not
often that hand-written assembly will be so significantly faster than
gcc unless there are close matches for unusual instructions that the
compiler does not generate. But it can certainly happen.
Often the C code can be improved in various ways - perhaps using
extensions (such as gcc attributes). Getting the very best out of a compiler for key performance-critical code is rarely just a matter of
writing "-O3" and hoping for the best.
If you can boil this down to a short and manageable piece of code, then
it might be fun to look at ways of improving the speed using either pure standard C, or gcc extensions, and compare it to the hand-generated assembly. I realise that making such as small sample that keeps the
effect is unlikely to be a trivial task. (And if you do this, put it in
a new thread :-) )
On 16/01/2024 15:29, David Brown wrote:
On 16/01/2024 13:26, bart wrote:
Let me just say that, in my interpreter, the extensive use of inline
assembly in one module, makes some programs run twice as fast, as a
gcc-O3-compiled C rendering.
That can sometimes happen, for particular kinds of code. It is not
often that hand-written assembly will be so significantly faster than
gcc unless there are close matches for unusual instructions that the
compiler does not generate. But it can certainly happen.
Often the C code can be improved in various ways - perhaps using
extensions (such as gcc attributes). Getting the very best out of a
compiler for key performance-critical code is rarely just a matter of
writing "-O3" and hoping for the best.
If you can boil this down to a short and manageable piece of code,
then it might be fun to look at ways of improving the speed using
either pure standard C, or gcc extensions, and compare it to the
hand-generated assembly. I realise that making such as small sample
that keeps the effect is unlikely to be a trivial task. (And if you
do this, put it in a new thread :-) )
I have done experiments that comprise under 100 lines of code and that
test a very small number of bytecodes, maybe just one.
But I don't believe you can extract useful conclusions that will scale.
gcc's agressive optimiser is likely to reduce such a simple program to nothing, or it will be able to infer things are not possible when spread
over 20,000 lines of code over multiple modules, and where the test
bytecode is a runtime input, not set up in a data structure.
I've worked with four main kinds of bytecode dispatcher:
(1) Using a table of function pointers. The dispatcher is then a simple
3-line loop
(2) Using a large switch statement. Each case can be either inline code
or can call a function. gcc likes this one because it can inline
function calls
(3) Using computed-goto. In C, this would need gcc's extension to use
label pointers
(4) Using a threaded-code dispatcher, making extensive use of inline
assembly, which is an overlay over (1): when a bytecode can't be
fully handled here, it calls one of the handlers from (1).
These are progressively faster. I no longer use (2) or (3); there's no
point if I have (4).
The C version used (1), compared with (4) using my compiler and language.
I have in the past compared C versions of (2) and (3) with (4), and (4)
was still faster, although that was some time ago.
When I lookat at CPython sources a decade ago, they used method (3) when compiled on Linux. On Windows however, they used method (2), since it
needs MSVC to build, which does not have the needed extension.
(My language also has label pointers, but it also has a built-in feature
for computed-goto: you only have to tweak one line to change a regular switch-loop into one that uses computed-goto. That is, using multiple loop-back points so that each can have its own branch prediction.
I don't use that in this product, only in a separate project.)
-------------
My (4) dispatcher uses thread-code functions that try and keep execution within a tight, register-based environment:
* Essential globals are kept in registers
* There is no function entry/exit code: each handler jumps directly to
the next, without any loops
* ABI considerations are put aside (eg. all non-volatiles are saved once
at the beginning, and restored at the end)
* Most handlers use inline assembly
* When it is necessary to call a normal HLL handler, the environment
must be saved and restored.
This an example of a very simple handler that uses inline assembly:
threadedproc j_jump*=
assem
mov Dprog,[Dprog+kopnda]
*jumpnext
end
end
And this is one which uses the HLL handler:
threadedproc j_jumpptr*=
saveregs
k_jumpptr()
loadregs
jumpnext
end
saveregs/loadregs are macros. In both cases, 'jumpnext' is this macro
(it needs * to invoke it from assembly):
macro jumpnext = asm jmp [Dprog]
So, the overheads of executing 'goto L' in the interpreted language are
two machine instructions.
On 16/01/2024 16:15, bart wrote:
The need for assembly usually trumps whatever minor optimisation my
compiler might do.
Ah, there we differ. I use tools that do good optimisations. And
perhaps 70% of the use-cases of inline assembly will be lost if the tool can't generate efficient code when there is inline assembly.
[...], because they know that Norwegians are superior to Swedes in every way...
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
On 16.01.2024 14:42, David Brown wrote:
On 16/01/2024 12:54, bart wrote:
Which processor is this for again?
It's for a cpu with a "madd" instruction that implements "x = x + y * z" >>> in a single instruction - as Kaz pointed out, doing this in inline
assembly would make sense if it the cpu had such a dedicated
instruction. [...]
I recall such a feature from a 35+ years old assembler project I did.
It was on the TI TMS320C25 DSP, and the instruction was called 'MAC'
(Multiply and ACcumulate). - Not sure it clarifies anything but just
as an amendment if someone is interested in searching for keywords
on such a function.
Pretty much every modern architecture has multiply and accumulate instructions, even the ARM Cortex-M7 cores (MLA instruction).
On 16.01.2024 19:03, David Brown wrote:
[...], because they know that Norwegians are superior to Swedes in every way...
I just wanted to reply on that but then I saw your email address... :-)
On 17/01/2024 07:11, Janis Papanagnou wrote:
On 16.01.2024 19:03, David Brown wrote:
[...], because they know that Norwegians are superior to Swedes in every way...
I just wanted to reply on that but then I saw your email address... :-)
I'm Scottish by origin, but have lived in Norway for about 30 years. So
I am a Norwegian by practice (and citizenship) rather than birth.
(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation, but also good-natured rivalry and occasional >teasing.)
David Brown <david.brown@hesbynett.no> writes:
On 17/01/2024 07:11, Janis Papanagnou wrote:
On 16.01.2024 19:03, David Brown wrote:
[...], because they know that Norwegians are superior to Swedes in every way...
I just wanted to reply on that but then I saw your email address... :-)
I'm Scottish by origin, but have lived in Norway for about 30 years. So
I am a Norwegian by practice (and citizenship) rather than birth.
(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation, but also good-natured rivalry and occasional
teasing.)
And at various times in the past they've been one country (c.f. United Kingdoms).
On 17/01/2024 17:33, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 17/01/2024 07:11, Janis Papanagnou wrote:
On 16.01.2024 19:03, David Brown wrote:
[...], because they know that Norwegians are superior to Swedes in every way...
I just wanted to reply on that but then I saw your email address... :-) >>>>
I'm Scottish by origin, but have lived in Norway for about 30 years. So >>> I am a Norwegian by practice (and citizenship) rather than birth.
(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation, but also good-natured rivalry and occasional >>> teasing.)
And at various times in the past they've been one country (c.f. United Kingdoms).
Norway and Sweden have always been separate countries (since they became >countries). They were in a union, with Norway dominated and ruled by
Sweden, but they were still different countries.
On 17/01/2024 07:11, Janis Papanagnou wrote:
On 16.01.2024 19:03, David Brown wrote:
[...], because they know that Norwegians are superior to Swedes in
every way...
I just wanted to reply on that but then I saw your email address... :-)
I'm Scottish by origin, but have lived in Norway for about 30 years. So
I am a Norwegian by practice (and citizenship) rather than birth.
(Norway and Sweden view each other as "brother" countries, with very
close ties and cooperation,
but also good-natured rivalry and occasional teasing.)
On 16.01.2024 19:03, David Brown wrote:
[...], because they know that Norwegians are superior to Swedes in
every way...
I just wanted to reply on that but then I saw your email address... :-)
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 16.01.2024 19:03, David Brown wrote:
[...], because they know that Norwegians are superior to Swedes in
every way...
I just wanted to reply on that but then I saw your email address... :-)
A Finn smirking from the sidelines.
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 299 |
Nodes: | 16 (2 / 14) |
Uptime: | 51:12:04 |
Calls: | 6,689 |
Calls today: | 7 |
Files: | 12,225 |
Messages: | 5,344,600 |
Posted today: | 1 |