• [debian-hppa] poor performance of Debian 11 HPPA with qemu-system-hppa

    From Nelson H. F. Beebe@21:1/5 to All on Sat Aug 14 16:40:01 2021
    In a previous message to the debian-hppa list today, I described how I
    finally got a virtual machine successfully created for running Debian
    11 on HPPA (aka PA-RISC).

    On the same host

    Dell Precision 7920 (1 16-core CPU, 32 hyperthreads,
    2200MHz Intel Xeon Platinum 8253,
    384GB DDR-4 RAM);
    Ubuntu 20.04.02 LTS (Focal Fossa)

    I have VMs running with QEMU emulation for Alpha, ARM64, M68K, MIPS32,
    MIPS64, RISC-V64, S390x, and SPARC64, and most of them have quite
    reasonable interactive performance, making it possible to use the
    emacs editor in terminal windows and X11 windows without any serious
    response problems.

    However, for the new Debian 11 HPPA VM, interactive performance is a
    huge issue: shell typein sometimes gets immediate character echo, but frequently gets delays of 10 to 30 seconds for each input character.
    That makes it extremely hard for a fast typist to type commands and
    text: one is never sure whether input keys have been dropped.

    I develop mathematical software, and a large package that I'm writing
    for multiple precision arithmetic provides a testbed for evaluating VM performance. Most of the QEMU CPU types support multiple processors,
    but M68K and SPARC64 sun4u only permit one CPU. For HPPA, I have 4 CPUs
    and 3GB DRAM; the latter is a hard limit imposed in QEMU source code.

    Here is a table of running the equivalent of

    date; make all check ; date

    on these systems, using QEMU-6.0.0, unless noted. Both compilations
    and test programs are run in parallel, by internal "make -j" commands.

    make timing (wall clock)

    Debian 11 Alpha 07:43:16 -- 08:23:05 39m 49s
    Debian 11 ARM64 07:58:02 -- 08:24:45 26m 43s
    Debian 11 M68K 07:43:15 -- 08:30:56 47m 41s
    Debian 11 HPPA 13:23:16 -- 21:40:19 497m 03s
    Debian 11 HPPA 07:29:18 -- 18:07:19 638m 01s [qemu-6.1.0-rc3]
    NetBSD 9.2 HPPA 11:22:10 -- 01:25:46 843m 36s
    Debian 11 MIPS32 09:21:49 -- 10:42:41 80m 52s
    Debian 11 SPARC64 14:45:16 -- 06:19:00 933m 44s
    Debian 11 SPARC64 17:57:58 -- 04:02:42 603m 44s [qemu-6.1.0-rc3]
    Ubuntu 18.04 S390x 18:34:34 -- 19:04:36 30m 02s
    Ubuntu 20.04 S390x 18:34:35 -- 19:16:54 42m 19s
    FreeBSD 13 RISC-V64 07:41:14 -- 08:34:00 52m 46s
    FreeBSD 14 RISC-V64 08:35:27 -- 09:25:35 50m 08s
    Fedora 34 RISC-V64 07:43:17 -- 08:02:55 19m 38s

    From comparison, here are results on native hardware with local disk
    (not NFS, unless indicated) [clock speed in GHz is abbreviated to G]:

    ArchLinux ARM32 09:57:34 -- 10:07:43 10m 09s
    Debian 11 UltraSparc T2 08:30:54 -- 08:41:18 10m 24s
    Solaris 10 UltraSparc T2 09:46:31 -- 09:59:32 13m 01s
    Ubuntu 20.04 Xeon 8253 09:34:52 -- 09:35:36 0m 44s
    CentOS 7.9 Xeon E6-1600v3 09:39:00 -- 09:39:42 0m 42s
    CentOS 7.9 Xeon E6-1600v3 10:42:43 -- 10:43:30 0m 47s [NFS]
    CentOS 7.9 EPYC 7502 2.0G 64C/128T 10:02:01 -- 10:02:27 0m 26s
    CentOS 7.9 EPYC 7502 2.5G 32C/64T 10:02:00 -- 10:02:25 0m 25s

    The tests produce about 62,000 total lines of text output, spread over
    about 180 files. They read no input data, and are primarily compute
    bound in loops with integer, not floating-point, arithmetic, using
    32-bit and 64-bit integer types.

    I have generated machine language for representative code from the
    hotspot loop using the -S option of gcc and clang, and found that
    64-bit arithmetic is expanded inline with 32-bit instructions on
    ARM32, HPPA, and M68K, none of which have 64-bit arithmetic
    instructions. The loop instruction counts are comparable across all
    of those systems, typically 10 to 20 instructions, compared to 5 or so
    on those CPUs that have 64-bit arithmetic.

    The dramatic slowdowns on HPPA and SPARC64 are a big surprise, but the
    HPPA slowdown matches the poor interactive response. The SPARC64 VM
    is much more responsive interactively, and it DOES have 64-bit integer arithmetic.

    I have not yet done profiling builds of qemu-system-hppa and qemu-system-sparc64, but that remains an option for further
    investigation to find out what is responsible for the slowness.

    I can also do profiling builds of parts of my test suite to see
    whether there are unexpected hotspots on HPPA and SPARC64 that are
    absent on other CPU types.

    I have physical SPARC64 hardware running Debian 11 and Solaris 10 on
    identical boxes, and have done builds of TeX Live on them with no
    difficulty. However, the slow speed of QEMU HPPA makes it impractical
    to try TeX Live builds for Debian 11 HPPA, which is disappointing.

    Does any list member have any idea of why QEMU emulation of HPPA and
    SPARC64 is so bad? Are there Debian kernel parameters that might be
    tweaked? Have any of you used Debian on QEMU HPPA and seen similar
    slowness compared to other CPU types?

    Notice from my first table above that NetBSD 9.2 on HPPA is also very
    slow, which tends to point the finger at QEMU as the source of the
    dismal performance, rather than the VM guest O/S.

    For the record, here is how QEMU releases downloaded from

    https://www.qemu.org/
    https://download.qemu.org/

    are built here, taking the most recent QEMU release for the sample:

    tar xf $prefix/src/qemu/qemu-6.1.0-rc3.tar.xz
    cd qemu-6.1.0-rc3
    unsetenv CONFIG_SITE
    mkdir build
    cd build
    env CC=cc CFLAGS=-O2 ../configure --prefix=$prefix && make all -j && make check

    QEMU builds require prior installation of the ninja-build package
    available on major GNU/Linux distributions. On completion, the needed qemu-system-xxx executables are present in the build subdirectory.

    On Ubuntu 20.04, the QEMU builds are clean, and pass the entire
    validation suite without any failures.

    ------------------------------------------------------------------------------- - Nelson H. F. Beebe Tel: +1 801 581 5254 - - University of Utah FAX: +1 801 581 4148 - - Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu - - 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org - - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ - -------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John David Anglin@21:1/5 to Nelson H. F. Beebe on Sat Aug 14 19:40:02 2021
    Hi Nelson,

    Helge Deller is the expert on this and you likely will have to wait until he returns from vacation
    for an answer.  I think the pasta buildd running hppa emulation is configured for one cpu although
    I could be wrong.  Performance is a little slower than a real 800 MHz PA8800 machine.

    Some profiling likely would be helpful.

    Dave

    On 2021-08-14 10:35 a.m., Nelson H. F. Beebe wrote:
    In a previous message to the debian-hppa list today, I described how I finally got a virtual machine successfully created for running Debian
    11 on HPPA (aka PA-RISC).

    On the same host

    Dell Precision 7920 (1 16-core CPU, 32 hyperthreads,
    2200MHz Intel Xeon Platinum 8253,
    384GB DDR-4 RAM);
    Ubuntu 20.04.02 LTS (Focal Fossa)

    I have VMs running with QEMU emulation for Alpha, ARM64, M68K, MIPS32, MIPS64, RISC-V64, S390x, and SPARC64, and most of them have quite
    reasonable interactive performance, making it possible to use the
    emacs editor in terminal windows and X11 windows without any serious
    response problems.

    However, for the new Debian 11 HPPA VM, interactive performance is a
    huge issue: shell typein sometimes gets immediate character echo, but frequently gets delays of 10 to 30 seconds for each input character.
    That makes it extremely hard for a fast typist to type commands and
    text: one is never sure whether input keys have been dropped.

    I develop mathematical software, and a large package that I'm writing
    for multiple precision arithmetic provides a testbed for evaluating VM performance. Most of the QEMU CPU types support multiple processors,
    but M68K and SPARC64 sun4u only permit one CPU. For HPPA, I have 4 CPUs
    and 3GB DRAM; the latter is a hard limit imposed in QEMU source code.

    Here is a table of running the equivalent of

    date; make all check ; date

    on these systems, using QEMU-6.0.0, unless noted. Both compilations
    and test programs are run in parallel, by internal "make -j" commands.

    make timing (wall clock)

    Debian 11 Alpha 07:43:16 -- 08:23:05 39m 49s
    Debian 11 ARM64 07:58:02 -- 08:24:45 26m 43s
    Debian 11 M68K 07:43:15 -- 08:30:56 47m 41s
    Debian 11 HPPA 13:23:16 -- 21:40:19 497m 03s
    Debian 11 HPPA 07:29:18 -- 18:07:19 638m 01s [qemu-6.1.0-rc3]
    NetBSD 9.2 HPPA 11:22:10 -- 01:25:46 843m 36s
    Debian 11 MIPS32 09:21:49 -- 10:42:41 80m 52s
    Debian 11 SPARC64 14:45:16 -- 06:19:00 933m 44s
    Debian 11 SPARC64 17:57:58 -- 04:02:42 603m 44s [qemu-6.1.0-rc3]
    Ubuntu 18.04 S390x 18:34:34 -- 19:04:36 30m 02s
    Ubuntu 20.04 S390x 18:34:35 -- 19:16:54 42m 19s
    FreeBSD 13 RISC-V64 07:41:14 -- 08:34:00 52m 46s
    FreeBSD 14 RISC-V64 08:35:27 -- 09:25:35 50m 08s
    Fedora 34 RISC-V64 07:43:17 -- 08:02:55 19m 38s

    From comparison, here are results on native hardware with local disk
    (not NFS, unless indicated) [clock speed in GHz is abbreviated to G]:

    ArchLinux ARM32 09:57:34 -- 10:07:43 10m 09s
    Debian 11 UltraSparc T2 08:30:54 -- 08:41:18 10m 24s
    Solaris 10 UltraSparc T2 09:46:31 -- 09:59:32 13m 01s
    Ubuntu 20.04 Xeon 8253 09:34:52 -- 09:35:36 0m 44s
    CentOS 7.9 Xeon E6-1600v3 09:39:00 -- 09:39:42 0m 42s
    CentOS 7.9 Xeon E6-1600v3 10:42:43 -- 10:43:30 0m 47s [NFS]
    CentOS 7.9 EPYC 7502 2.0G 64C/128T 10:02:01 -- 10:02:27 0m 26s
    CentOS 7.9 EPYC 7502 2.5G 32C/64T 10:02:00 -- 10:02:25 0m 25s

    The tests produce about 62,000 total lines of text output, spread over
    about 180 files. They read no input data, and are primarily compute
    bound in loops with integer, not floating-point, arithmetic, using
    32-bit and 64-bit integer types.

    I have generated machine language for representative code from the
    hotspot loop using the -S option of gcc and clang, and found that
    64-bit arithmetic is expanded inline with 32-bit instructions on
    ARM32, HPPA, and M68K, none of which have 64-bit arithmetic
    instructions. The loop instruction counts are comparable across all
    of those systems, typically 10 to 20 instructions, compared to 5 or so
    on those CPUs that have 64-bit arithmetic.

    The dramatic slowdowns on HPPA and SPARC64 are a big surprise, but the
    HPPA slowdown matches the poor interactive response. The SPARC64 VM
    is much more responsive interactively, and it DOES have 64-bit integer arithmetic.

    I have not yet done profiling builds of qemu-system-hppa and qemu-system-sparc64, but that remains an option for further
    investigation to find out what is responsible for the slowness.

    I can also do profiling builds of parts of my test suite to see
    whether there are unexpected hotspots on HPPA and SPARC64 that are
    absent on other CPU types.

    I have physical SPARC64 hardware running Debian 11 and Solaris 10 on identical boxes, and have done builds of TeX Live on them with no
    difficulty. However, the slow speed of QEMU HPPA makes it impractical
    to try TeX Live builds for Debian 11 HPPA, which is disappointing.

    Does any list member have any idea of why QEMU emulation of HPPA and
    SPARC64 is so bad? Are there Debian kernel parameters that might be
    tweaked? Have any of you used Debian on QEMU HPPA and seen similar
    slowness compared to other CPU types?

    Notice from my first table above that NetBSD 9.2 on HPPA is also very
    slow, which tends to point the finger at QEMU as the source of the
    dismal performance, rather than the VM guest O/S.

    For the record, here is how QEMU releases downloaded from

    https://www.qemu.org/
    https://download.qemu.org/

    are built here, taking the most recent QEMU release for the sample:

    tar xf $prefix/src/qemu/qemu-6.1.0-rc3.tar.xz
    cd qemu-6.1.0-rc3
    unsetenv CONFIG_SITE
    mkdir build
    cd build
    env CC=cc CFLAGS=-O2 ../configure --prefix=$prefix && make all -j && make check

    QEMU builds require prior installation of the ninja-build package
    available on major GNU/Linux distributions. On completion, the needed qemu-system-xxx executables are present in the build subdirectory.

    On Ubuntu 20.04, the QEMU builds are clean, and pass the entire
    validation suite without any failures.

    -------------------------------------------------------------------------------
    - Nelson H. F. Beebe Tel: +1 801 581 5254 -
    - University of Utah FAX: +1 801 581 4148 -
    - Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu -
    - 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org -
    - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
    -------------------------------------------------------------------------------



    --
    John David Anglin dave.anglin@bell.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John David Anglin@21:1/5 to John David Anglin on Sat Aug 14 20:50:01 2021
    I was wrong.  Pasta is configured for 8 cpus.

    Dave

    On 2021-08-14 1:31 p.m., John David Anglin wrote:
    Hi Nelson,

    Helge Deller is the expert on this and you likely will have to wait until he returns from vacation
    for an answer.  I think the pasta buildd running hppa emulation is configured for one cpu although
    I could be wrong.  Performance is a little slower than a real 800 MHz PA8800 machine.

    Some profiling likely would be helpful.

    Dave

    On 2021-08-14 10:35 a.m., Nelson H. F. Beebe wrote:
    In a previous message to the debian-hppa list today, I described how I
    finally got a virtual machine successfully created for running Debian
    11 on HPPA (aka PA-RISC).

    On the same host

    Dell Precision 7920 (1 16-core CPU, 32 hyperthreads,
    2200MHz Intel Xeon Platinum 8253,
    384GB DDR-4 RAM);
    Ubuntu 20.04.02 LTS (Focal Fossa)

    I have VMs running with QEMU emulation for Alpha, ARM64, M68K, MIPS32,
    MIPS64, RISC-V64, S390x, and SPARC64, and most of them have quite
    reasonable interactive performance, making it possible to use the
    emacs editor in terminal windows and X11 windows without any serious
    response problems.

    However, for the new Debian 11 HPPA VM, interactive performance is a
    huge issue: shell typein sometimes gets immediate character echo, but
    frequently gets delays of 10 to 30 seconds for each input character.
    That makes it extremely hard for a fast typist to type commands and
    text: one is never sure whether input keys have been dropped.

    I develop mathematical software, and a large package that I'm writing
    for multiple precision arithmetic provides a testbed for evaluating VM
    performance. Most of the QEMU CPU types support multiple processors,
    but M68K and SPARC64 sun4u only permit one CPU. For HPPA, I have 4 CPUs
    and 3GB DRAM; the latter is a hard limit imposed in QEMU source code.

    Here is a table of running the equivalent of

    date; make all check ; date

    on these systems, using QEMU-6.0.0, unless noted. Both compilations
    and test programs are run in parallel, by internal "make -j" commands.

    make timing (wall clock)

    Debian 11 Alpha 07:43:16 -- 08:23:05 39m 49s
    Debian 11 ARM64 07:58:02 -- 08:24:45 26m 43s
    Debian 11 M68K 07:43:15 -- 08:30:56 47m 41s
    Debian 11 HPPA 13:23:16 -- 21:40:19 497m 03s
    Debian 11 HPPA 07:29:18 -- 18:07:19 638m 01s [qemu-6.1.0-rc3]
    NetBSD 9.2 HPPA 11:22:10 -- 01:25:46 843m 36s
    Debian 11 MIPS32 09:21:49 -- 10:42:41 80m 52s
    Debian 11 SPARC64 14:45:16 -- 06:19:00 933m 44s
    Debian 11 SPARC64 17:57:58 -- 04:02:42 603m 44s [qemu-6.1.0-rc3]
    Ubuntu 18.04 S390x 18:34:34 -- 19:04:36 30m 02s
    Ubuntu 20.04 S390x 18:34:35 -- 19:16:54 42m 19s
    FreeBSD 13 RISC-V64 07:41:14 -- 08:34:00 52m 46s
    FreeBSD 14 RISC-V64 08:35:27 -- 09:25:35 50m 08s
    Fedora 34 RISC-V64 07:43:17 -- 08:02:55 19m 38s

    From comparison, here are results on native hardware with local disk
    (not NFS, unless indicated) [clock speed in GHz is abbreviated to G]:

    ArchLinux ARM32 09:57:34 -- 10:07:43 10m 09s
    Debian 11 UltraSparc T2 08:30:54 -- 08:41:18 10m 24s
    Solaris 10 UltraSparc T2 09:46:31 -- 09:59:32 13m 01s
    Ubuntu 20.04 Xeon 8253 09:34:52 -- 09:35:36 0m 44s
    CentOS 7.9 Xeon E6-1600v3 09:39:00 -- 09:39:42 0m 42s
    CentOS 7.9 Xeon E6-1600v3 10:42:43 -- 10:43:30 0m 47s [NFS]
    CentOS 7.9 EPYC 7502 2.0G 64C/128T 10:02:01 -- 10:02:27 0m 26s
    CentOS 7.9 EPYC 7502 2.5G 32C/64T 10:02:00 -- 10:02:25 0m 25s

    The tests produce about 62,000 total lines of text output, spread over
    about 180 files. They read no input data, and are primarily compute
    bound in loops with integer, not floating-point, arithmetic, using
    32-bit and 64-bit integer types.

    I have generated machine language for representative code from the
    hotspot loop using the -S option of gcc and clang, and found that
    64-bit arithmetic is expanded inline with 32-bit instructions on
    ARM32, HPPA, and M68K, none of which have 64-bit arithmetic
    instructions. The loop instruction counts are comparable across all
    of those systems, typically 10 to 20 instructions, compared to 5 or so
    on those CPUs that have 64-bit arithmetic.

    The dramatic slowdowns on HPPA and SPARC64 are a big surprise, but the
    HPPA slowdown matches the poor interactive response. The SPARC64 VM
    is much more responsive interactively, and it DOES have 64-bit integer
    arithmetic.

    I have not yet done profiling builds of qemu-system-hppa and
    qemu-system-sparc64, but that remains an option for further
    investigation to find out what is responsible for the slowness.

    I can also do profiling builds of parts of my test suite to see
    whether there are unexpected hotspots on HPPA and SPARC64 that are
    absent on other CPU types.

    I have physical SPARC64 hardware running Debian 11 and Solaris 10 on
    identical boxes, and have done builds of TeX Live on them with no
    difficulty. However, the slow speed of QEMU HPPA makes it impractical
    to try TeX Live builds for Debian 11 HPPA, which is disappointing.

    Does any list member have any idea of why QEMU emulation of HPPA and
    SPARC64 is so bad? Are there Debian kernel parameters that might be
    tweaked? Have any of you used Debian on QEMU HPPA and seen similar
    slowness compared to other CPU types?

    Notice from my first table above that NetBSD 9.2 on HPPA is also very
    slow, which tends to point the finger at QEMU as the source of the
    dismal performance, rather than the VM guest O/S.

    For the record, here is how QEMU releases downloaded from

    https://www.qemu.org/
    https://download.qemu.org/

    are built here, taking the most recent QEMU release for the sample:

    tar xf $prefix/src/qemu/qemu-6.1.0-rc3.tar.xz
    cd qemu-6.1.0-rc3
    unsetenv CONFIG_SITE
    mkdir build
    cd build
    env CC=cc CFLAGS=-O2 ../configure --prefix=$prefix && make all -j && make check

    QEMU builds require prior installation of the ninja-build package
    available on major GNU/Linux distributions. On completion, the needed
    qemu-system-xxx executables are present in the build subdirectory.

    On Ubuntu 20.04, the QEMU builds are clean, and pass the entire
    validation suite without any failures.

    -------------------------------------------------------------------------------
    - Nelson H. F. Beebe Tel: +1 801 581 5254 -
    - University of Utah FAX: +1 801 581 4148 -
    - Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu -
    - 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org -
    - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
    -------------------------------------------------------------------------------




    --
    John David Anglin dave.anglin@bell.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John David Anglin@21:1/5 to John David Anglin on Sat Aug 14 23:00:01 2021
    Some hints are here:
    https://parisc.wiki.kernel.org/index.php/Qemu

    Dave

    On 2021-08-14 2:42 p.m., John David Anglin wrote:
    I was wrong.  Pasta is configured for 8 cpus.

    Dave

    On 2021-08-14 1:31 p.m., John David Anglin wrote:
    Hi Nelson,

    Helge Deller is the expert on this and you likely will have to wait until he returns from vacation
    for an answer.  I think the pasta buildd running hppa emulation is configured for one cpu although
    I could be wrong.  Performance is a little slower than a real 800 MHz PA8800 machine.

    Some profiling likely would be helpful.

    Dave

    On 2021-08-14 10:35 a.m., Nelson H. F. Beebe wrote:
    In a previous message to the debian-hppa list today, I described how I
    finally got a virtual machine successfully created for running Debian
    11 on HPPA (aka PA-RISC).

    On the same host

    Dell Precision 7920 (1 16-core CPU, 32 hyperthreads,
    2200MHz Intel Xeon Platinum 8253,
    384GB DDR-4 RAM);
    Ubuntu 20.04.02 LTS (Focal Fossa)

    I have VMs running with QEMU emulation for Alpha, ARM64, M68K, MIPS32,
    MIPS64, RISC-V64, S390x, and SPARC64, and most of them have quite
    reasonable interactive performance, making it possible to use the
    emacs editor in terminal windows and X11 windows without any serious
    response problems.

    However, for the new Debian 11 HPPA VM, interactive performance is a
    huge issue: shell typein sometimes gets immediate character echo, but
    frequently gets delays of 10 to 30 seconds for each input character.
    That makes it extremely hard for a fast typist to type commands and
    text: one is never sure whether input keys have been dropped.

    I develop mathematical software, and a large package that I'm writing
    for multiple precision arithmetic provides a testbed for evaluating VM
    performance. Most of the QEMU CPU types support multiple processors,
    but M68K and SPARC64 sun4u only permit one CPU. For HPPA, I have 4 CPUs >>> and 3GB DRAM; the latter is a hard limit imposed in QEMU source code.

    Here is a table of running the equivalent of

    date; make all check ; date

    on these systems, using QEMU-6.0.0, unless noted. Both compilations
    and test programs are run in parallel, by internal "make -j" commands.

    make timing (wall clock)

    Debian 11 Alpha 07:43:16 -- 08:23:05 39m 49s
    Debian 11 ARM64 07:58:02 -- 08:24:45 26m 43s
    Debian 11 M68K 07:43:15 -- 08:30:56 47m 41s
    Debian 11 HPPA 13:23:16 -- 21:40:19 497m 03s
    Debian 11 HPPA 07:29:18 -- 18:07:19 638m 01s [qemu-6.1.0-rc3]
    NetBSD 9.2 HPPA 11:22:10 -- 01:25:46 843m 36s
    Debian 11 MIPS32 09:21:49 -- 10:42:41 80m 52s
    Debian 11 SPARC64 14:45:16 -- 06:19:00 933m 44s
    Debian 11 SPARC64 17:57:58 -- 04:02:42 603m 44s [qemu-6.1.0-rc3]
    Ubuntu 18.04 S390x 18:34:34 -- 19:04:36 30m 02s
    Ubuntu 20.04 S390x 18:34:35 -- 19:16:54 42m 19s
    FreeBSD 13 RISC-V64 07:41:14 -- 08:34:00 52m 46s
    FreeBSD 14 RISC-V64 08:35:27 -- 09:25:35 50m 08s
    Fedora 34 RISC-V64 07:43:17 -- 08:02:55 19m 38s

    From comparison, here are results on native hardware with local disk
    (not NFS, unless indicated) [clock speed in GHz is abbreviated to G]:

    ArchLinux ARM32 09:57:34 -- 10:07:43 10m 09s
    Debian 11 UltraSparc T2 08:30:54 -- 08:41:18 10m 24s
    Solaris 10 UltraSparc T2 09:46:31 -- 09:59:32 13m 01s
    Ubuntu 20.04 Xeon 8253 09:34:52 -- 09:35:36 0m 44s
    CentOS 7.9 Xeon E6-1600v3 09:39:00 -- 09:39:42 0m 42s
    CentOS 7.9 Xeon E6-1600v3 10:42:43 -- 10:43:30 0m 47s [NFS]
    CentOS 7.9 EPYC 7502 2.0G 64C/128T 10:02:01 -- 10:02:27 0m 26s
    CentOS 7.9 EPYC 7502 2.5G 32C/64T 10:02:00 -- 10:02:25 0m 25s

    The tests produce about 62,000 total lines of text output, spread over
    about 180 files. They read no input data, and are primarily compute
    bound in loops with integer, not floating-point, arithmetic, using
    32-bit and 64-bit integer types.

    I have generated machine language for representative code from the
    hotspot loop using the -S option of gcc and clang, and found that
    64-bit arithmetic is expanded inline with 32-bit instructions on
    ARM32, HPPA, and M68K, none of which have 64-bit arithmetic
    instructions. The loop instruction counts are comparable across all
    of those systems, typically 10 to 20 instructions, compared to 5 or so
    on those CPUs that have 64-bit arithmetic.

    The dramatic slowdowns on HPPA and SPARC64 are a big surprise, but the
    HPPA slowdown matches the poor interactive response. The SPARC64 VM
    is much more responsive interactively, and it DOES have 64-bit integer
    arithmetic.

    I have not yet done profiling builds of qemu-system-hppa and
    qemu-system-sparc64, but that remains an option for further
    investigation to find out what is responsible for the slowness.

    I can also do profiling builds of parts of my test suite to see
    whether there are unexpected hotspots on HPPA and SPARC64 that are
    absent on other CPU types.

    I have physical SPARC64 hardware running Debian 11 and Solaris 10 on
    identical boxes, and have done builds of TeX Live on them with no
    difficulty. However, the slow speed of QEMU HPPA makes it impractical
    to try TeX Live builds for Debian 11 HPPA, which is disappointing.

    Does any list member have any idea of why QEMU emulation of HPPA and
    SPARC64 is so bad? Are there Debian kernel parameters that might be
    tweaked? Have any of you used Debian on QEMU HPPA and seen similar
    slowness compared to other CPU types?

    Notice from my first table above that NetBSD 9.2 on HPPA is also very
    slow, which tends to point the finger at QEMU as the source of the
    dismal performance, rather than the VM guest O/S.

    For the record, here is how QEMU releases downloaded from

    https://www.qemu.org/
    https://download.qemu.org/

    are built here, taking the most recent QEMU release for the sample:

    tar xf $prefix/src/qemu/qemu-6.1.0-rc3.tar.xz
    cd qemu-6.1.0-rc3
    unsetenv CONFIG_SITE
    mkdir build
    cd build
    env CC=cc CFLAGS=-O2 ../configure --prefix=$prefix && make all -j && make check

    QEMU builds require prior installation of the ninja-build package
    available on major GNU/Linux distributions. On completion, the needed
    qemu-system-xxx executables are present in the build subdirectory.

    On Ubuntu 20.04, the QEMU builds are clean, and pass the entire
    validation suite without any failures.

    -------------------------------------------------------------------------------
    - Nelson H. F. Beebe Tel: +1 801 581 5254 -
    - University of Utah FAX: +1 801 581 4148 -
    - Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu -
    - 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org -
    - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
    -------------------------------------------------------------------------------




    --
    John David Anglin dave.anglin@bell.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helge Deller@21:1/5 to Nelson H. F. Beebe on Sun Sep 5 23:10:01 2021
    Hi Nelson,

    On 8/14/21 4:35 PM, Nelson H. F. Beebe wrote:
    In a previous message to the debian-hppa list today, I described how I finally got a virtual machine successfully created for running Debian
    11 on HPPA (aka PA-RISC).

    On the same host

    Dell Precision 7920 (1 16-core CPU, 32 hyperthreads,
    2200MHz Intel Xeon Platinum 8253,
    384GB DDR-4 RAM);
    Ubuntu 20.04.02 LTS (Focal Fossa)

    I have VMs running with QEMU emulation for Alpha, ARM64, M68K, MIPS32, MIPS64, RISC-V64, S390x, and SPARC64, and most of them have quite
    reasonable interactive performance, making it possible to use the
    emacs editor in terminal windows and X11 windows without any serious
    response problems.

    However, for the new Debian 11 HPPA VM, interactive performance is a
    huge issue: shell typein sometimes gets immediate character echo, but frequently gets delays of 10 to 30 seconds for each input character.
    That makes it extremely hard for a fast typist to type commands and
    text: one is never sure whether input keys have been dropped.

    I haven't see this yet.

    I develop mathematical software, and a large package that I'm writing
    for multiple precision arithmetic provides a testbed for evaluating VM performance. Most of the QEMU CPU types support multiple processors,
    but M68K and SPARC64 sun4u only permit one CPU. For HPPA, I have 4 CPUs
    and 3GB DRAM; the latter is a hard limit imposed in QEMU source code.

    Yes, 3GB (actually 3,5GB) is max for 32bit hppa systems.

    If you run with 4 emulated CPUs, make to sure to add:
    -accel tcg,thread=multi
    when starting qemu.

    Here is a table of running the equivalent of

    date; make all check ; date

    on these systems, using QEMU-6.0.0, unless noted. Both compilations
    and test programs are run in parallel, by internal "make -j" commands.

    make timing (wall clock)

    Debian 11 Alpha 07:43:16 -- 08:23:05 39m 49s
    Debian 11 ARM64 07:58:02 -- 08:24:45 26m 43s
    Debian 11 M68K 07:43:15 -- 08:30:56 47m 41s
    Debian 11 HPPA 13:23:16 -- 21:40:19 497m 03s
    Debian 11 HPPA 07:29:18 -- 18:07:19 638m 01s [qemu-6.1.0-rc3]
    NetBSD 9.2 HPPA 11:22:10 -- 01:25:46 843m 36s

    It would be interesting to see the performance on hppa on real hardware.
    If needed I can give you access to a physical machine to test. Just let me know.

    From comparison, here are results on native hardware with local disk
    (not NFS, unless indicated) [clock speed in GHz is abbreviated to G]:

    ArchLinux ARM32 09:57:34 -- 10:07:43 10m 09s
    Debian 11 UltraSparc T2 08:30:54 -- 08:41:18 10m 24s
    Solaris 10 UltraSparc T2 09:46:31 -- 09:59:32 13m 01s
    Ubuntu 20.04 Xeon 8253 09:34:52 -- 09:35:36 0m 44s
    CentOS 7.9 Xeon E6-1600v3 09:39:00 -- 09:39:42 0m 42s
    CentOS 7.9 Xeon E6-1600v3 10:42:43 -- 10:43:30 0m 47s [NFS]
    CentOS 7.9 EPYC 7502 2.0G 64C/128T 10:02:01 -- 10:02:27 0m 26s
    CentOS 7.9 EPYC 7502 2.5G 32C/64T 10:02:00 -- 10:02:25 0m 25s

    The tests produce about 62,000 total lines of text output, spread over
    about 180 files. They read no input data, and are primarily compute
    bound in loops with integer, not floating-point, arithmetic, using
    32-bit and 64-bit integer types.

    I have generated machine language for representative code from the
    hotspot loop using the -S option of gcc and clang, and found that
    64-bit arithmetic is expanded inline with 32-bit instructions on
    ARM32, HPPA, and M68K, none of which have 64-bit arithmetic
    instructions. The loop instruction counts are comparable across all
    of those systems, typically 10 to 20 instructions, compared to 5 or so
    on those CPUs that have 64-bit arithmetic.

    The dramatic slowdowns on HPPA and SPARC64 are a big surprise, but the
    HPPA slowdown matches the poor interactive response. The SPARC64 VM
    is much more responsive interactively, and it DOES have 64-bit integer arithmetic.

    I have not yet done profiling builds of qemu-system-hppa and qemu-system-sparc64, but that remains an option for further
    investigation to find out what is responsible for the slowness.

    It would be good if you find some time for further analysis.

    I can also do profiling builds of parts of my test suite to see
    whether there are unexpected hotspots on HPPA and SPARC64 that are
    absent on other CPU types.

    I have physical SPARC64 hardware running Debian 11 and Solaris 10 on identical boxes, and have done builds of TeX Live on them with no
    difficulty. However, the slow speed of QEMU HPPA makes it impractical
    to try TeX Live builds for Debian 11 HPPA, which is disappointing.

    Does any list member have any idea of why QEMU emulation of HPPA and
    SPARC64 is so bad? Are there Debian kernel parameters that might be
    tweaked? Have any of you used Debian on QEMU HPPA and seen similar
    slowness compared to other CPU types?

    Again, I'd like to compare qemu-emulated hppa to physical hppa performance
    to rule out any qemu slowliness.

    Notice from my first table above that NetBSD 9.2 on HPPA is also very
    slow, which tends to point the finger at QEMU as the source of the
    dismal performance, rather than the VM guest O/S.

    For the record, here is how QEMU releases downloaded from

    https://www.qemu.org/
    https://download.qemu.org/

    are built here, taking the most recent QEMU release for the sample:

    tar xf $prefix/src/qemu/qemu-6.1.0-rc3.tar.xz
    cd qemu-6.1.0-rc3
    unsetenv CONFIG_SITE
    mkdir build
    cd build
    env CC=cc CFLAGS=-O2 ../configure --prefix=$prefix && make all -j && make check

    You should make sure to disable debugging when building.
    Those are mine configure options:
    '--target-list=hppa-softmmu' '--enable-numa' '--disable-mpath' '--disable-spice' '--disable-opengl' '--disable-sanitizers' --disable-docs --disable-debug-mutex --disable-debug-tcg --disable-containers --disable-pie --disable-qom-cast-debug --disable-
    debug-mutex


    QEMU builds require prior installation of the ninja-build package
    available on major GNU/Linux distributions. On completion, the needed qemu-system-xxx executables are present in the build subdirectory.

    On Ubuntu 20.04, the QEMU builds are clean, and pass the entire
    validation suite without any failures.

    Helge

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)