• Is This a Dumb Idea? paralellizing byte codes

    From Jon Forrest@21:1/5 to All on Sat Oct 22 11:00:36 2022
    Modern CPUs employ all kinds of clever techniques to improve
    instruction level parallelism (ILP). I was wondering if it
    makes sense to try to employ similar techniques in the
    virtual machines used to execute byte code produced by language
    compilers.

    By that I mean what if virtual machines were to examine byte code
    streams to detect when it would be safe to execute multiple
    byte codes concurrently? Then, based on its findings, the virtual
    machine would execute as many byte codes concurrently as is safe.

    I have no idea if the overhead of the byte code examination would
    exceed any advantage of the concurrent execution, although it's
    important to point out that this examination would only have to
    be done once, and the results could somehow be stored along with
    the byte code. Of course, if the byte code changes the examination
    would have to be done again.

    I'm also worried that internal virtual machine locking requirements
    might make this idea infeasible. For example, in a virtual machine with
    a global interpreter lock, would it be possible for there to be any
    concurrent execution?

    This idea, if it works, would be a great way to take advantage of
    multiple cores without having to rewrite any user code. The big
    question is whether it would work.

    Comments?
    Jon Forrest

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alain Ketterlin@21:1/5 to Jon Forrest on Sat Oct 22 23:50:49 2022
    Jon Forrest <nobozo@gmail.com> writes:

    Modern CPUs employ all kinds of clever techniques to improve
    instruction level parallelism (ILP). I was wondering if it
    makes sense to try to employ similar techniques in the
    virtual machines used to execute byte code produced by language
    compilers.

    I think it does not, mostly because such fine grain parallelism cannot
    be implemented efficiently enough. Even waking up some sort of worker
    threads, along with the necessary synchronization, would probably cost
    more than the gain achieved by executing a handful of byte-codes in
    parallel.

    I think there is no way to efficiently implement fine-grain, ILP-like, parallelism in software (except for vectorization, but that's a
    completely different topic).

    By that I mean what if virtual machines were to examine byte code
    streams to detect when it would be safe to execute multiple
    byte codes concurrently? Then, based on its findings, the virtual
    machine would execute as many byte codes concurrently as is safe.

    This implies that some static analysis can be performed on the
    byte-codes. It may be possible in some cases (the JVM comes to mind),
    and nearly impossible in others (essentially dynamic languages, of which
    Python is the epitome).

    I have no idea if the overhead of the byte code examination would
    exceed any advantage of the concurrent execution, although it's
    important to point out that this examination would only have to
    be done once, and the results could somehow be stored along with
    the byte code. Of course, if the byte code changes the examination
    would have to be done again.

    You're right about static, "compile-time" analysis, whenever possible.
    Dynamic analysis of byte-code streams, plus run-time fine-grain parallelization, is probably a lost battle in terms of efficiency.

    I'm also worried that internal virtual machine locking requirements
    might make this idea infeasible. For example, in a virtual machine with
    a global interpreter lock, would it be possible for there to be any concurrent execution?

    That's only part of the problem. Note also that not all virtual machines
    have a "global interpreter lock".

    This idea, if it works, would be a great way to take advantage of
    multiple cores without having to rewrite any user code. The big
    question is whether it would work.

    Why not just let super-scalar processors take car of that? Modern
    processors can handle tens of instructions at a time, do all the dirty
    work (handling dependencies, essentially), and they'll also do all kind
    of crazy stuff you probably won't even think of implementing in software
    (like brand prediction, data prefetches, and more). You'll probably get
    much more gain by working on making the virtual machine loop efficient
    enough to leverage the power of the hardware.

    I've heard/read several times that byte-code micro-optimizations are not
    worth the trouble. Here is a paper from 2015 on a related subject
    ("Branch prediction and the performance of interpreters -- Don't trust folklore"):

    https://ieeexplore.ieee.org/document/7054191

    (you may find the corresponding research report if you can't access the
    full text from that site). It shows how far processors have gone in what
    was once left to the program designer.

    -- Alain.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to Jon Forrest on Sun Oct 23 02:21:53 2022
    On 10/22/22 8:00 PM, Jon Forrest wrote:
    Modern CPUs employ all kinds of clever techniques to improve
    instruction level parallelism (ILP). I was wondering if it
    makes sense to try to employ similar techniques in the
    virtual machines used to execute byte code produced by language
    compilers.


    By that I mean what if virtual machines were to examine byte code
    streams to detect when it would be safe to execute multiple
    byte codes concurrently?

    I came across instruction reordering when writing decompilers for
    various machines. The solution was easy to find, but not so easy to
    implement, dependency analysis between successive instructions. On which register contents or state flags does each instruction depend, and which
    was the most recent instruction that changed these items?

    Less important for decompilation are dependencies on memory contents. I
    found it sufficient and safe to execute all memory writing instructions preceding a memory read, so that memory writes and their addresses did
    not require special handling.

    If the same analysis is possible with virtual machine code then it will
    be possible to emulate multiple instructions in parallel.


    I have no idea if the overhead of the byte code examination would
    exceed any advantage of the concurrent execution, although it's
    important to point out that this examination would only have to
    be done once, and the results could somehow be stored along with
    the byte code. Of course, if the byte code changes the examination
    would have to be done again.

    The next step were compilation of the virtual machine code into physical machine code. This also can be done once, in most cases, and then the
    compiled version can be much faster than the interpreted code.

    But here a word of warning, also from my decompiler research:

    When Microsoft announced phantastic speedup of compiled vs. interpreted
    Visual Basic, the 7 times speed gain was reduced to more realistic "up
    to 30%" soon, and even then they did not succeed in compiling to really equivalent code. AFAIR the biggest problem was the Variant type, that
    left not much room for optimizations. The presented example code was a
    real hall of shame, where desperate coders tried whatever tricks to
    evaluate the simplest boolean expressions which were randomized by the
    infamous compiler :-(

    To be honest: following VB versions came with a much better compiler,
    but even then the speed gain was, hmmm, noticeable. A look at the .NET
    compiled VB code revealed how much housekeeping was required only with
    line numbers.

    This idea, if it works, would be a great way to take advantage of
    multiple cores without having to rewrite any user code. The big
    question is whether it would work.

    Automated decompilation and re-compilation to native code may be the
    most promising approach for a real speed explosion. Provided there are
    no line numbers, Variants, On Error statements and other BASIC brake
    shoes in the virtual machine. But here again an observation:

    If fully automated decompilation is not possible then don't do it! Else
    you'll spend more time in the manual analysis of the next program
    version than you gain in faster execution.

    Just my $0.02
    DoDi
    [I'm wondering whether coarse-grain analysis might be worth it, e.g.,
    look at two Java functions and see that they share no writable data so
    you can run the entire functions in parallel. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to nob...@gmail.com on Sat Oct 22 23:50:51 2022
    On Saturday, October 22, 2022 at 11:51:31 AM UTC-7, nob...@gmail.com wrote:
    Modern CPUs employ all kinds of clever techniques to improve
    instruction level parallelism (ILP). I was wondering if it
    makes sense to try to employ similar techniques in the
    virtual machines used to execute byte code produced by language
    compilers.

    I always find the idea of calling any instruction set designed to be interpreted (by a program written in a high level language) a byte
    code.

    It seems to me that no such language is more byte-oriented than VAX.
    (Though VAX was designed to be interpreted by microcode, so maybe that
    counts.)

    In any case, one reason VAX didn't stay around as long as its
    designers expected, was the difficulty in parallel execution.

    Instructions have up to 6 operands, each of which has an addressing
    mode byte, possibly an index operation, and then the appropriate, and
    variable number of bytes, for the specified address mode. The mode
    specified by each mode byte determines where the next mode byte is.

    On the other hand, the instruction bits for RISC-V are arranged in an
    unusual order, such that related bits stay in the same position.

    By that I mean what if virtual machines were to examine byte code
    streams to detect when it would be safe to execute multiple
    byte codes concurrently? Then, based on its findings, the virtual
    machine would execute as many byte codes concurrently as is safe.

    Pipelined hardware can make some of those decisions in parallel.

    But okay, in the case of JIT compilers, if it can figure it out once
    and use it many times, then it might work.

    I have no idea if the overhead of the byte code examination would
    exceed any advantage of the concurrent execution, although it's
    important to point out that this examination would only have to
    be done once, and the results could somehow be stored along with
    the byte code. Of course, if the byte code changes the examination
    would have to be done again.

    Done once and results stored is pretty much how JIT works.

    I'm also worried that internal virtual machine locking requirements
    might make this idea infeasible. For example, in a virtual machine with
    a global interpreter lock, would it be possible for there to be any concurrent execution?

    This idea, if it works, would be a great way to take advantage of
    multiple cores without having to rewrite any user code. The big
    question is whether it would work.

    As someone else mentioned, vectors (and vector processors)
    are the favorite way to do this. JIT should work well for vectors.

    As for fine-grain parallelism, there is OpenMP for explicit
    (requested by the programmer) parallel execution, based
    on threads and thread synchronization.

    I am not sure at all how much time is spent keeping the
    threads doing what they are supposed to do, and not
    doing actual work. There is still Amdahl's law.

    As well as I know, OpenMP works best on vector
    operations, which would, of course, be convenient
    to run on actual vector processors. (Too bad there
    aren't any more running Cray-1 machines.)

    So, what you didn't ask, how about explicit requests
    like OpenMP?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sun Oct 23 10:17:23 2022
    This is not necessarily a dumb idea. However, the details are important.

    Relevant important factors are the cost communication and
    synchronization. Fine grained parallelism often flounders on those
    aspects.

    ILP works in things like modern x86 processors, because the x86 is
    effectively a very odd byte code developed under severe constraints.
    The code an out-of-order x86 processor runs doesn't really look like
    the x86 and tricks like register renaming internal micro-ops make that possible.

    However, at the more generic byte code level, you have different
    things to worry about (unless you are doing an actual hardware
    design). The impact of the software framework that is used to
    implement the byte code interpreter can take a large toll. That's one
    of the reasons hotspot compilers are effective. They can remove that framework. They can do that fine grained parallelism which mimics
    what goes on behind the scenes in an out-of-order processor.

    We were able to exploit that in Hyperscan (a software regular
    expression accelerator) that Intel acquired, optimized, and then open
    sourced. That was essentially my last project at Intel. We took an
    NFA, turned it into upto 8 parallel DFAs, but then wrote a specialized interpreter that interleaved the 8 DFAs, with the help of VTune.
    Doing that and interleaving the code carefully to reduce instruction
    stalls (which were mostly due to memory reads), we managed to run the
    8 DFAs at the same performance as the original DFA code ran 1, simply
    by effectively filling up "delay slots" that were implicit in the
    code. Note the NFA and DFAs were essentially byte code, it was just
    the interpreter we tweaked.

    A different example was exploited at NuoDB, where we were optimizing
    SQL queries. We did this by noting that most Queries deal with
    multiple rows, and changed the byte code emitted to pass through the
    rows. The interpreter then would do the rows as batches, meaning that
    1 byte code operation could operate on 1000 rows concurrently,
    reducing the overhead by the batch size. This was a speed up even
    without using the vector instructions on the machine.

    But notice in both cases, we did analysis to figure out what the
    relevant parallelism was. It was not ad hoc. And, it was cases where
    we had large enough and slow enough sections that were key
    bottlenecks. Very rarely can one take serial code and find that
    parallelism without having a model that exposes it.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Jon Forrest on Sun Oct 23 12:33:40 2022
    Jon Forrest <nobozo@gmail.com> writes:
    Modern CPUs employ all kinds of clever techniques to improve
    instruction level parallelism (ILP). I was wondering if it
    makes sense to try to employ similar techniques in the
    virtual machines used to execute byte code produced by language
    compilers.

    I'll first assume that you mean virtual-machine interpreters.

    First of all, if you use a modern CPU with OoO execution, it does much
    of that by itself.

    If you use an in-order execution CPU, you can try to software-pipeline
    the execution of the virtual-machine instructions, in particular
    perform virtual-machine instruction fetching early [hoogerbrugge+99].

    It is helpful to design the virtual machine for that: Have a mostly
    fixed encoding, so that, e.g., the dispatch can be prepared in
    advance, rather than, e.g., having to wait until you know about the
    present instruction before you can start the fetch of the next
    instruction.

    By that I mean what if virtual machines were to examine byte code
    streams to detect when it would be safe to execute multiple
    byte codes concurrently? Then, based on its findings, the virtual
    machine would execute as many byte codes concurrently as is safe.

    There has been quite a bit work on combining sequences of
    virtual-machine instructions into superinstructions [hughes82,proebsting95,naessen+01,ertl&gregg03euroforth,eller05], but
    the typical approach was to combine a sequence of dependent
    instructions, which has several advantages:

    * One VM instruction often communicates the data to subsequent ones
    through memory (i.e., D-cache), which often incurs a latency of
    several cycles. The superinstruction can communicate the data
    through a register.

    * If the data is addressed explicitly by the VM instruction (e.g., in
    a register-oriented VM), that requires code for dealing with this
    addressing, which can be eliminated for the data passed implicitly
    in a superinstruction. E.g. "add r3=r1*r2; mul r5=r3+r4" requires
    dealing with six VM "register" accesses, while "madd r5=r1*r2+r4"
    requires only four.

    If you are actually thinking about JIT compilation of the virtual
    machine, you can use the usual techniques for extracting
    instruction-level parallelism: instruction scheduling (within basic
    blocks, or within traces or superblocks), and software pipelining.
    Some of these techniques incur a significant overhead, however, so you
    may not want to employ them in a JIT compiler.

    I'm also worried that internal virtual machine locking requirements
    might make this idea infeasible. For example, in a virtual machine with
    a global interpreter lock, would it be possible for there to be any >concurrent execution?

    The memory-ordering requirements may restrict the reorderings of
    memory accesses, but otherwise I don't see a problem.

    This idea, if it works, would be a great way to take advantage of
    multiple cores without having to rewrite any user code. The big
    question is whether it would work.

    The stuff I have outlined does not introduce additional execution
    threads, and I don't think you can find thread-level parallelism at
    the virtual-machine level unless you have your virtual machine (and
    the programming language that it models) designed for exactly that.

    - anton


    @string{spe="Software---Practice and Experience"}
    @Article{hoogerbrugge+99,
    author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum
    and Rik van de Wiel",
    title = "A code compression system based on pipelined
    interpreters",
    journal = spe,
    volume = "29",
    number = "11",
    pages = "1005--1023",
    month = sep,
    year = "1999",
    OPTannote= ""
    }

    @InProceedings{hughes82,
    author = "R. J. M. Hughes",
    title = "Super-Combinators",
    booktitle = "Conference Record of the 1980 LISP Conference,
    Stanford, CA",
    pages = "1--11",
    publisher = "ACM",
    address = "New York",
    year = "1982",
    OPTannote = {}
    }

    @InProceedings{proebsting95,
    author = "Todd A. Proebsting",
    title = "Optimizing an {ANSI~C} Interpreter with Superoperators",
    crossref = "popl95",
    pages = "322--332",
    annote = "Interpreter performance is optimized by combining
    operators during code generation, when they are
    still organized as trees. So a different, optimized
    interpreter
    is used for each program. Speedups of 1.8--3.1 are
    achieved, but this is probably strongly dependent on
    the original set of operators. The paper uses lccs
    intermediate code operators \cite{fraser&hanson91a}."
    }

    @Proceedings{popl95,
    booktitle = "Principles of Programming Languages (POPL '95)",
    title = "Principles of Programming Languages (POPL '95)",
    year = "1995",
    key = "POPL '95"
    }

    @InProceedings{naessen+01,
    author = {Henrik N\"ass\'en and Mats Carlsson and Konstantinos
    Sagonas},
    title = {Instruction Merging and Specialization in the
    {SICStus Prolog} Virtual Machine},
    booktitle = {Principles and Practice of Declarative Programming
    (PPDP01)},
    OPTpages = {},
    year = {2001},
    url = {http://www.csd.uu.se/%7Ekostis/Papers/sicstus.ps.gz},
    annote = {Gives an overview of various WAM optimization
    techniques and then evaluates combining (merging)
    pairs of instructions into (about 60)
    superinstructions, specializing WAM instructions for
    specific immediate arguments (in particular,
    specific registers, for about 200 new instructions),
    and a combination of both (for about 100 new
    instructions). Instruction merging produces small
    speedups (about 8\% on average), specialization
    produces a small slowdown on average, and both
    combined are about as fast as instruction merging
    alone. VM code size is reduced by around 10\% with
    these techniques, and the VM emulator size grows by
    up to 15KB.}
    }

    @InProceedings{ertl&gregg03euroforth,
    author = {M. Anton Ertl and David Gregg},
    title = {Implementation Issues for Superinstructions in
    {Gforth}},
    booktitle = {EuroForth 2003 Conference Proceedings},
    OPTpages = {},
    year = {2003},
    URL = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03euroforth.ps.gz},
    abstract = {Combining Forth primitives into superinstructions
    provides nice speedups. Several approaches to
    superinstructions were explored in the Gforth
    project. This paper discusses the effects of these
    approaches on performance, compilation time,
    implementation effort, and on programming tools such
    as the decompiler and backtracing.}
    }

    @MastersThesis{eller05,
    author = {Helmut Eller},
    title = {Optimizing Interpreters with Superinstructions},
    school = {TU Wien},
    year = {2005},
    type = {Diplomarbeit},
    url = {http://www.complang.tuwien.ac.at/Diplomarbeiten/eller05.ps.gz},
    abstract = {Superinstructions can be used to make virtual
    machine (VM) interpreters faster. A superinstruction
    is a combination of simpler VM instructions which
    can be executed faster than the corresponding
    sequence of simpler VM instructions, because the
    interpretative overhead, like instruction dispatch
    and argument fetching, is reduced. This work
    discusses the following three topics related to
    superinstructions. First, I present some heuristics
    to choose superinstructions. I evaluated the
    heuristics for Forth and Java programs. If the
    number of allowed superinstructions was very large,
    $> 1000$, then the heuristic which chooses all
    possible subsequences up to length 4 achieved the
    best results. If the number of allowed
    superinstructions was more limited, then a heuristic
    which favors short sequences and sequences which
    occur in many different programs and many different
    basic blocks performed better than the
    others. Second, I compare a simple greedy algorithm
    and an optimal algorithm to cover a program with
    superinstructions. I found that the greedy algorithm
    achieves almost optimal results. Finally, I compare
    superinstructions with non-sequential patterns. In
    my experiments, superinstructions performed slightly
    better than non-sequential patterns.}
    }


    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Alain Ketterlin on Sun Oct 23 13:16:54 2022
    Alain Ketterlin <alain@universite-de-strasbourg.fr> writes:
    I've heard/read several times that byte-code micro-optimizations are not >worth the trouble.

    Apart from the paper below, which is discussed below, what else?

    Here is a paper from 2015 on a related subject
    ("Branch prediction and the performance of interpreters -- Don't trust >folklore"):

    https://ieeexplore.ieee.org/document/7054191

    (you may find the corresponding research report if you can't access the
    full text from that site). It shows how far processors have gone in what
    was once left to the program designer.

    On that I can only say: Not all research papers are trustworthy.
    Catchy titles may be a warning signal.

    I did my own measurements on a Haswell (the same CPU they used in the
    paper) and published them in
    <2015Sep7.142507@mips.complang.tuwien.ac.at> (<http://al.howardknight.net/?ID=158702747000> for those of you who
    don't know what to do with Message-IDs).

    If you don't want to read that posting, the executive summary is that
    the sentence in the abstract "we show that the accuracy of indirect
    branch prediction is no longer critical for interpreters." is wrong.

    Looking at the run-time in seconds for the large benchmarks:

    | shared non-shared
    | --no- --no- --no-
    |dynamic dynamic super default
    | 3.332 2.440 2.276 1.468 benchgc
    | 1.652 1.524 1.064 0.544 brainless
    | 4.016 3.416 2.288 1.520 brew
    | 3.420 3.236 2.232 1.232 cd16sim
    | 2.956 2.684 1.484 0.864 fcp
    | 13.128 9.848 9.280 7.840 lexex

    We see a speedup factor of 1.08-1.37 (but, measuring mispredictions,
    no consistent reduction in mispredictions) from (non-shared)
    dispatching the code for the next VM instruction at the end of the
    code of every VM instruction, rather than jumping to a shared piece of
    dispatch code (from "shared --no-dynamic" to "non-shared
    --no-dynamic").

    We see a speedup factor of 1.06-1.81 and a reduction in mispredictions
    by a factor 1.35-8.76 from replicating the code for each occurence of
    a VM instruction (from "non-shared --no-dynamic" to "--no-super").

    We see a speedup factor of 1.18-1.96 and a reduction in branch
    mispredictions by up to a factor of 3.2 by then eliminating the
    dispatches at the end of non-control-flow VM instructions (from
    "--no-super" to default).

    The overall speedup factor for all these steps is 1.67-3.42.

    The somewhat longer summary from the posting above:

    |Haswell's indirect branch prediction is really much
    |better than before, but for larger programs running on an interpreter
    |like Gforth, replication still provides substantial branch prediction |improvements that result in significant speedups. And while there is
    |no longer a branch prediction advantage to keeping separate indirect
    |branches for dispatch, there is still a significant speed advantage;
    |and dynamic superinstructions also provide a good speedup, resulting
    |in overall speedups by factors 1.67-3.42 for the application
    |benchmarks.
    |
    |Why are the results here different from those in the paper?
    |1) Different Interpreter 2) different benchmarks. If you write an |interpreter, and look for performance, should you go for interpreter |optimizations like threaded-code, replication, and dynamic
    |superinstructions like I suggest, or just use a switch-based
    |interpreter like the paper suggests? Threaded code is a good idea
    |with little cost in any case. If that provides a significant speedup
    |and your VM instruction implementations are short (you run into cache
    |trouble with long VM instructions [vitale&abdelrahman04]), then
    |replication with superinstructions will probably give a good speedup.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alain Ketterlin@21:1/5 to Anton Ertl on Sun Oct 23 21:29:34 2022
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Alain Ketterlin <alain@universite-de-strasbourg.fr> writes:
    I've heard/read several times that byte-code micro-optimizations are not >>worth the trouble.

    Apart from the paper below, which is discussed below, what else?

    This is not directly related to the paper I mention later. I was talking
    about optimizing bytecode vs. compiler optimizations. I know of no
    interpreter doing elaborate static byte-code optimization.

    https://ieeexplore.ieee.org/document/7054191

    On that I can only say: Not all research papers are trustworthy.
    Catchy titles may be a warning signal.

    I did my own measurements on a Haswell (the same CPU they used in the
    paper) and published them in
    <2015Sep7.142507@mips.complang.tuwien.ac.at> (<http://al.howardknight.net/?ID=158702747000> for those of you who
    don't know what to do with Message-IDs).
    [...]

    |Why are the results here different from those in the paper?
    |1) Different Interpreter 2) different benchmarks.

    I'm glad it works for you. For the record: they consider interpreters
    for Python, Javascript and CLI, on a fairly broad set of benchmarks. And
    they also evaluate (through simulation) branch-predictors that may (or
    may not) be included in more recent architectures.

    -- Alain.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to nob...@gmail.com on Wed Oct 26 18:18:33 2022
    On Saturday, October 22, 2022 at 11:51:31 AM UTC-7, nob...@gmail.com wrote:
    Modern CPUs employ all kinds of clever techniques to improve
    instruction level parallelism (ILP). I was wondering if it
    makes sense to try to employ similar techniques in the
    virtual machines used to execute byte code produced by language
    compilers.

    Seems to me it is not that parallelizing byte codes that is
    a dumb idea, but byte codes themselves are.

    This was known when Alpha replaced VAX. Work on making faster VAX
    systems was stuck with the byte oriented instruction stream which was impossible to pipeline.

    Not so many years after IBM S/360 was the first "computer
    architecture", DEC wanted VAX to be an architecture with a wide range
    of implementations, and do last for many years. But VAX didn't last so
    long, and descendants of S/360, now z/Architecture, are still around.

    S/360 instructions are in units of 16 bits, always aligned, and you
    can tell from the first byte how long the instruction is. That allows
    the processor to quickly parse instructions and pass them to
    reservation stations (in the case of the 360/91).

    But VAX instructions have to be parsed byte by byte, until the
    hardware finds how long the instruction is.

    So it seems that the real answer is to devise a word oriented, or in
    other words RISC, virtual machine. (Actual RISC hardware might not be
    a good choice.)


    I have in the past, maybe here, wondered about the ability to improve
    a RISC processor. As one example, many include a branch delay slot,
    which depends somewhat on the processor pipeline.

    It seems that one fix would be for compilers to write out not the
    actual instructions, but slightly earlier in the code generator.

    Then a processor specific program would, fairly quickly, convert that
    to actual instructions. That could be done at program installation
    time, or at program load time, as needed.

    If a new processor came along that, for example, needed two branch
    delay slots, it would generate them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to gah4@u.washington.edu on Thu Oct 27 14:51:04 2022
    On 2022-10-27, gah4 <gah4@u.washington.edu> wrote:
    On Saturday, October 22, 2022 at 11:51:31 AM UTC-7, nob...@gmail.com wrote:
    Modern CPUs employ all kinds of clever techniques to improve
    instruction level parallelism (ILP). I was wondering if it
    makes sense to try to employ similar techniques in the
    virtual machines used to execute byte code produced by language
    compilers.

    Seems to me it is not that parallelizing byte codes that is
    a dumb idea, but byte codes themselves are.

    I think you're taking "byte code" too literally to refer to
    refer to a virtual machine where the instructions are byte-wide
    opcodes that implicitly refer to operands on a stack.

    I think that nowadays the term refers to any software-based
    synthetic instruction set oriented toward supporting a higher
    level language.

    This was known when Alpha replaced VAX. Work on making faster VAX
    systems was stuck with the byte oriented instruction stream which was impossible to pipeline.

    Not impossible for Intel and AMD, obviously.

    Variable-length instruction encodings do not inherently hamper
    pipelining.

    What might hamper pipelining would be variable-lenth instruction
    encodings /where the length is not known until the instruction is
    executed, due to depending on its output somehow/.

    If you can decode an instruction and then immediately know where
    the next one starts, you can pipeline.

    The internal representation of a pipeline doesn't use the the original variable-length representation any more; the instruction bytes do not
    literally move through the pipeline.

    So it seems that the real answer is to devise a word oriented, or in
    other words RISC, virtual machine. (Actual RISC hardware might not be
    a good choice.)

    I designed one in TXR Lisp; but the "byte code" terminology appears
    numerous times in the source code, and leaks into the name of one
    API fuction calld vm-desc-bytecode, which accesses the code vector
    of a virtual machine description.

    The opcodes are actually four byte words, stored in the local endian.
    (When a compiled file is loaded that was compiled on a different
    endian system, the load function will swap the byte order on all
    the four byte words in the "bytecode" vector).

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    [THe VAX had auto-increment indexed address modes like

    addl3 (ra)+[rb],(rc)+[rd],(re)+[rf]

    which means to take the word that ra+4*rb points to, add 4 to ra, take
    the word that rc+4*rd points to, add 4 to rc, put their sum in the word
    that re+4*rf points to, and add 4 to re. If any of those registers are
    the same register, the fetches and increments have to happen as if it was
    all done sequentially. There were instructions that took six operands.
    While this much address complication was rare, it had to work. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to gah4@u.washington.edu on Sat Oct 29 09:06:38 2022
    gah4 <gah4@u.washington.edu> writes:
    Seems to me it is not that parallelizing byte codes that is
    a dumb idea, but byte codes themselves are.

    This was known when Alpha replaced VAX. Work on making faster VAX
    systems was stuck with the byte oriented instruction stream which was >impossible to pipeline.

    Pipelined VAX implementation exist, e.g. the VAX 8800 and NVAX. AMD64
    is also a byte-granularity instruction set. While byte granularity
    makes wide decoders more expensive (by roughly a factor of 4 compared
    to an instruction set with 32-bit granularity for the same number of
    decoded bytes). RISC-V's compressed (C) instructions have 16-bit
    granularity, so at least the RISC-V designers think that the benefits
    of instruction compression outweigh the costs in decoding.

    Anyway, this has little to do with the question of the original
    poster. Virtual machines (often called byte codes, even if they use a different instruction granularity) use instruction sets that are quite different from that of VAX or AMD64; in particular, they have no
    "instruction formats" with "addressing modes", where every base
    instruction can be combined with a set of addressing modes in an
    orthogonal way. That's because they are not decoded like hardware instructions.

    There are, however, costs to byte granularity in VM interpreters:

    * Byte granularity makes it harder to use direct-threaded code and
    optimizations that start with that, in particular dynamic
    superinstructions: You cannot rewrite the VM code into
    direct-threaded code in-place, but have to translate it to another
    place, which also means that you cannot reuse the branch targets
    as-is.

    * Also, if the VM instruction has an inline argument wider than one
    byte, the access to that argument can be significantly more
    expensive on architectures with alignment restrictions (e.g.,
    ironically, SPARC). But alignment restrictions have died out in
    general-purpose computers, and VM interpreters are not that popular
    on embedded systems.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Alain Ketterlin on Fri Oct 28 17:06:55 2022
    Alain Ketterlin <alain@universite-de-strasbourg.fr> writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Alain Ketterlin <alain@universite-de-strasbourg.fr> writes:
    I've heard/read several times that byte-code micro-optimizations are not >>>worth the trouble.
    ...
    This is not directly related to the paper I mention later. I was talking >about optimizing bytecode vs. compiler optimizations. I know of no >interpreter doing elaborate static byte-code optimization.

    If I understand you correctly, you mean optimizations that the
    compiler that generates "byte code" performs, e.g., stuff like partial redundancy elimination.

    I expect that these optimizations are as effective for virtual machine
    code as for native (i.e., real-machine) code, but if you want to go to
    these lengths, you use a native-code compiler. And for systems that
    uses a JIT compiler (i.e., a two stage process: source -> VM (aka byte
    code) -> native code), the preferred place for putting these
    optimizations is in the second stage (probably because it enables
    optimization decisions with consideration of the target machine).
    There have been some efforts to have analysis at the source code level
    (or anyway, before JIT compilation), and embed the results as optional component in the .class file to speed up JIT compilation, but has this
    made it into production systems?

    Otherwise: I dimly remember optimizations by Prolog compilers that
    generate WAM (Warren abstract machine) code.

    https://ieeexplore.ieee.org/document/7054191

    https://hal.inria.fr/hal-01100647/document

    I'm glad it works for you.

    What's "it"? Anyway you miss the point: The paper suggests that one
    should just write a switch-based interpreter and that more advanced
    techniques are no longer needed. My results disprove this, on the
    same hardware that they base their claims on. Branch mispredictions
    may play a smaller role now than they used to, but apparently there
    are other reasons that make the more advanced techniques still very
    profitable.

    This was somewhat surprising for me, too. We also did some work with simulations of more advanced branch predictors in this context [ertl&gregg03jilp], so I expected the performance benefits of our
    advanced techniques to diminish significantly when the hardware
    acquires such techniques, but I never really saw that happen. And
    that's even on hardware that has very good indirect branch prediction
    (as Rohou et al. showed).

    @Article{ertl&gregg03jilp,
    author = {M. Anton Ertl and David Gregg},
    title = {The Structure and Performance of \emph{Efficient}
    Interpreters},
    journal = {The Journal of Instruction-Level Parallelism},
    year = {2003},
    volume = {5},
    month = nov,
    url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
    url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
    note = {http://www.jilp.org/vol5/},
    abstract = {Interpreters designed for high general-purpose
    performance typically perform a large number of
    indirect branches (3.2\%--13\% of all executed
    instructions in our benchmarks). These branches
    consume more than half of the run-time in a number
    of configurations we simulated. We evaluate how
    accurate various existing and proposed branch
    prediction schemes are on a number of interpreters,
    how the mispredictions affect the performance of the
    interpreters and how two different interpreter
    implementation techniques perform with various
    branch predictors. We also suggest various ways in
    which hardware designers, C compiler writers, and
    interpreter writers can improve the performance of
    interpreters.}
    }

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)