• Cortex-M buses

    From antispam@math.uni.wroc.pl@21:1/5 to All on Thu Dec 29 04:32:55 2022
    I want to understand impact of buses/wait states on Cortex-M
    preformance. It seems that documentation about this seem
    to be scatterd or unavailable (I would appreciate pointers
    to appropriate documentation). I was unable to find answers
    in documentation so I did some testing. Below part of
    my results. I run tests on STM32F030, STM32F103, STM32F407,
    STM32411 and two chinese clones, namly CKS32F103 and
    Air32F103. Let me mention from the start, that chinese
    clones show quite different results from STM32F103.

    My first test was delay loop (I needed it for some other
    tests, but even alone it gives some info). In assembler
    (GNU as) it is:

    counted_delay:
    sub r0, #1
    bgt counted_delay
    bx lr

    I also did four test that flip bits in GPIO ports. One is:

    pin_test1:
    ldr r1, [pc, #16]
    movs r2, #0x1
    str r2, [r1]
    ldr r1, [pc, #12]
    movs r2, #0x0
    str r2, [r1]
    sub r0, #1
    bgt pin_test1
    bx lr
    .balign 4
    .long 0x42210198

    This is strightforward code, I expect gcc to generate code like
    this. This one uses access via bit-band region to set single
    bit in output register. The second test uses the same code, but
    just writes to output register (so it is setting all bits
    confugured as output).

    Third test used improved loop, to avoid repeatedly loading
    constants to registers:

    pin_test3:
    ldr r1, [pc, #0x00C]
    movs r2, #0x1
    movs r3, #0x0
    pin_testl3:
    str r2, [r1]
    sub r0, #1
    str r3, [r1]
    bgt pin_testl3
    bx lr
    .long 0x42210198

    Again this one used bit-band region. Fourth test was like
    third, but did full write to output register.

    I run output tests only on F103 compatible processors. For
    convenience I run most test in RAM. The results are below,
    all time in clocks. Note: I measured time reading systick
    counter. There is some constant overhead/inaccuracy but
    it looks that for given count time is small_constat + count*coeff
    where coeff is in table and count means repetition count of
    the loop.

    delay pin1 pin2 pin3 pin4
    STM32F103 ram 4 28 18 23 12
    CKS32F103 ram 6 29 22 24 14
    Air32F103 ram 4 22 14 19 8
    STM32F103 flash 6
    2 wait states
    STM32F103 flash 3
    0 wait states
    STM32F030 ram 4
    STM32F407 ram1 6
    STM32F407 ram0 3

    For STM32F407 ram1 means first ram bank at default location,
    ram0 means first ram bank remapped to address 0. On STM32F401
    I got the same results as STM32F407.

    Now, already delay loop raises some questions: STM claims that
    RAM is zero wait states, but from the timings we see that on
    STM32F103 we effectively get 1 wait state, compared to 0 wait
    state flash. OTOH 2 wait state flash actually causes loss of
    3 cycles. One guess was that with 2 wait states delay loop
    may be bandwidt limited: each jump seem to cause two accesses
    to flash due to prefetch and they need 6 clocks. But disabling
    flash prefetch still gives 6 clocks (it changed other timings).
    Also, for CKS32F103 and STM32F407 penalty compared to optimal
    case is 3 clocks. STM32F030 is unremarkable here, time is
    exactly as ARM docs says.

    Now the busses: ARM docs says that Cortex M3/M4 has three buses.
    In area of STM RAM core uses "system bus" which has some buffering.
    When executing from lower addresses (flash or remapped RAM)
    core uses "code bus" for instruction fetches and "idata bus" for
    data accesses. Clearly forcing all accesses on single bus
    is suboptimal, but for delay loop alone it should not matter:
    delay loop only fetches instructions, all work is done in
    registers. ARM says that system bus is "buffered", and the
    other unbuffered, but it is rather unclear why/if this should
    impact timings.

    IIUC bit-band access uses read-modify-write sequence and probably
    the whole sequence keep exclusive use "system bus" during execution.
    Since core is fetching instructions on "system bus" this must
    slow down execution. Compared to simple writes bit-band access
    seem to cause overhead of order 8-11 clocks. There are two
    accesses per iteration, so overhead for single access seem to
    be 4-6 clycles. I must admit that this looks suprisingly high.

    Gain from moving loading of constants outside loop is almost
    as expected: two memory fetches each needing 2 clocks and
    two single clock instrictions together give 6 clocks, which
    is several cases agrees with measured results. But there
    are few discrepancies.

    It may be of same interest that on this very artifical test
    the 3 F103-alikes show quite different performance, with
    CKS32F103 the slowest, Air32F103 fastest and STM32F103 in
    the middle.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)