• Confused about avx and sse2 performance

    From Branimir Maksimovic@21:1/5 to All on Thu May 27 20:03:12 2021
    Can someone explain to me why is this faster: https://github.com/bmaxa/shootout/blob/main/nbody/nbodysse2.asm
    then this:
    https://github.com/bmaxa/shootout/blob/main/nbody/nbody2.asm

    also can someone run to see if this is quirk with my cpu (Zen1).
    you need fasm, compile with fasm and
    then gcc.
    [code]
    ~/shootout/nbody >>> fasm nbody2.asm ±[●][main]
    flat assembler version 1.73.27 (16384 kilobytes memory)
    4 passes, 10432 bytes.
    ~/shootout/nbody >>> gcc nbody2.o -o nbody2 -no-pie ±[●][main]
    ~/shootout/nbody >>>
    [/code]

    Thanks!
    --
    current job title: senior software engineer
    skills: x86 aasembler,c++,c,rust,go,nim,haskell...

    press any key to continue or any other to quit...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Branimir Maksimovic on Fri May 28 16:43:58 2021
    On 2021-05-27, Branimir Maksimovic <branimir.maksimovic@nospicedham.gmail.com> wrote:
    Can someone explain to me why is this faster: https://github.com/bmaxa/shootout/blob/main/nbody/nbodysse2.asm
    then this:
    https://github.com/bmaxa/shootout/blob/main/nbody/nbody2.asm

    also can someone run to see if this is quirk with my cpu (Zen1).
    you need fasm, compile with fasm and
    then gcc.
    [code]
    ~/shootout/nbody >>> fasm nbody2.asm ±[●][main]
    flat assembler version 1.73.27 (16384 kilobytes memory)
    4 passes, 10432 bytes.
    ~/shootout/nbody >>> gcc nbody2.o -o nbody2 -no-pie ±[●][main]
    ~/shootout/nbody >>>
    [/code]

    Thanks!
    Look this entry for C (fastest). Heavilly optimized for AVX: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gcc-9.html
    [code]
    -0.169075164 [210/787]
    -0.169059907

    Performance counter stats for './fastc 50000000':

    1,988.28 msec task-clock:u # 0.999 CPUs utilized
    0 context-switches:u # 0.000 K/sec
    0 cpu-migrations:u # 0.000 K/sec
    67 page-faults:u # 0.034 K/sec
    8,059,303,822 cycles:u # 4.053 GHz (62.46%)
    200,129 stalled-cycles-frontend:u # 0.00% frontend cycles idle (62.46%)
    7,382,686,755 stalled-cycles-backend:u # 91.60% backend cycles idle (62.46%)
    6,589,592,669 instructions:u # 0.82 insn per cycle
    # 1.12 stalled cycles per insn (62.46%)
    49,984,983 branches:u # 25.140 M/sec (62.47%)
    1,753 branch-misses:u # 0.00% of all branches (62.57%)
    4,498,871,311 L1-dcache-loads:u # 2262.693 M/sec (62.57%)
    8,279 L1-dcache-load-misses:u # 0.00% of all L1-dcache accesses (62.55%)
    <not supported> LLC-loads:u
    <not supported> LLC-load-misses:u

    1.990374103 seconds time elapsed

    1.982009000 seconds user
    0.000000000 seconds sys
    [/code]
    my sse2 version:
    [code]
    -0.169075164
    -0.169059907

    Performance counter stats for './nbodysse2 50000000':

    2,603.44 msec task-clock:u # 0.999 CPUs utilized
    0 context-switches:u # 0.000 K/sec
    0 cpu-migrations:u # 0.000 K/sec
    53 page-faults:u # 0.020 K/sec
    10,290,993,689 cycles:u # 3.953 GHz (62.35%)
    545,799 stalled-cycles-frontend:u # 0.01% frontend cycles idle (62.37%)
    6,563,060,268 stalled-cycles-backend:u # 63.77% backend cycles idle (62.49%)
    32,393,147,312 instructions:u # 3.15 insn per cycle
    # 0.20 stalled cycles per insn (62.61%)
    1,943,915,974 branches:u # 746.673 M/sec (62.70%)
    12,391 branch-misses:u # 0.00% of all branches (62.61%)
    13,260,365,081 L1-dcache-loads:u # 5093.407 M/sec (62.50%)
    11,338 L1-dcache-load-misses:u # 0.00% of all L1-dcache accesses (62.38%)
    <not supported> LLC-loads:u
    <not supported> LLC-load-misses:u

    2.606247643 seconds time elapsed

    2.585024000 seconds user
    0.003319000 seconds sys
    [/code]
    then optimized for simd (arrays instead of structures)
    [code]
    -0.169075164 [70/787]
    -0.169059907

    Performance counter stats for './nbody2 50000000':

    3,188.49 msec task-clock:u # 0.999 CPUs utilized
    0 context-switches:u # 0.000 K/sec
    0 cpu-migrations:u # 0.000 K/sec
    53 page-faults:u # 0.017 K/sec
    12,961,532,262 cycles:u # 4.065 GHz (62.49%)
    267,650 stalled-cycles-frontend:u # 0.00% frontend cycles idle (62.49%)
    10,087,414,286 stalled-cycles-backend:u # 77.83% backend cycles idle (62.49%)
    22,690,576,224 instructions:u # 1.75 insn per cycle
    # 0.44 stalled cycles per insn (62.49%)
    1,498,517,339 branches:u # 469.976 M/sec (62.49%)
    10,791 branch-misses:u # 0.00% of all branches (62.52%)
    11,845,468,225 L1-dcache-loads:u # 3715.066 M/sec (62.52%)
    10,205 L1-dcache-load-misses:u # 0.00% of all L1-dcache accesses (62.52%)
    <not supported> LLC-loads:u
    <not supported> LLC-load-misses:u

    3.191446269 seconds time elapsed

    3.178639000 seconds user
    0.000000000 seconds sys

    [/code]

    As you can see, C version is almost branchless but AVX version has less branches and instructions,
    but is slower then SSE2 version? Main difference is that SSE2 version has different memory access.
    While SSE2 version accesses one element of structure at time AVX version accesses in vector manner
    as it has arrays of elements instead of arrays of structures, so I give credit to that.
    But why is arrays slower?
    To give my point I have avx version that uses same memory access as SSE2 version at is also faster
    then new version, I thought optimized for SIMD...
    [code]
    -0.169075164
    -0.169059907

    Performance counter stats for './nbodyavx 50000000':

    2,784.27 msec task-clock:u # 0.999 CPUs utilized
    0 context-switches:u # 0.000 K/sec
    0 cpu-migrations:u # 0.000 K/sec
    54 page-faults:u # 0.019 K/sec
    10,806,896,379 cycles:u # 3.881 GHz (62.51%)
    276,219 stalled-cycles-frontend:u # 0.00% frontend cycles idle (62.53%)
    8,428,091,473 stalled-cycles-backend:u # 77.99% backend cycles idle (62.53%)
    21,319,940,647 instructions:u # 1.97 insn per cycle
    # 0.40 stalled cycles per insn (62.53%)
    1,848,343,065 branches:u # 663.853 M/sec (62.53%)
    7,706 branch-misses:u # 0.00% of all branches (62.47%)
    12,510,801,413 L1-dcache-loads:u # 4493.394 M/sec (62.45%)
    14,623 L1-dcache-load-misses:u # 0.00% of all L1-dcache accesses (62.45%)
    <not supported> LLC-loads:u
    <not supported> LLC-load-misses:u

    2.786929447 seconds time elapsed

    2.773920000 seconds user
    0.003332000 seconds sys

    [/code]
    So you see AVX versions have less instructions but less instructions per cycle. Zen1 can execute up to 4 sse instructions per cycle, but only two AVX as Intel, as 2 256bit units exists in both Intel and Zen. But Zen1 has 4 128 bit which are
    paired when executing AVX so it is somewhat better then Intel with SSE2.
    Also I have used plain old sqrtpd instead of approximation as on AMD(Zen1) it is
    unecessary but it is there for Intel users.
    [code]
    ; cvtpd2ps xmm4,xmm3
    ; rsqrtps xmm4,xmm4
    sqrtpd xmm7,xmm3
    mulpd xmm3,xmm7
    divpd xmm6,xmm3
    ; mulpd xmm3,dqword[L2]
    ; cvtps2pd xmm4,xmm4
    ;--------------------

    ; movapd xmm7, xmm4

    ; movapd xmm8,xmm3
    ; mulpd xmm8, xmm7
    ; mulpd xmm8, xmm7
    ; mulpd xmm8, xmm7

    ; mulpd xmm7,dqword[L1]

    ; subpd xmm7,xmm8

    ;------------------------

    ; movapd xmm8,xmm3
    ; mulpd xmm8, xmm7
    ; mulpd xmm8, xmm7
    ; mulpd xmm8, xmm7

    ; mulpd xmm7,dqword[L1]
    ; subpd xmm7,xmm8 ; distance -> xmm7

    ;--------------------------

    ; mulpd xmm6,xmm7 ; mag -> xmm6
    [/code]
    Of course I didn't bother to use this Intel optimization
    for new AVX version, so If you want you can add it.


    --
    current job title: senior software engineer
    skills: x86 aasembler,c++,c,rust,go,nim,haskell...

    press any key to continue or any other to quit...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)