• Performance issue with CPython 3.10 + Cython

    From Andreas Ames@21:1/5 to All on Tue Oct 4 11:05:42 2022
    Hi all,

    I am wrapping an embedded application (, which does not use any dynamic
    memory management,) using Cython to call it from CPython. The wrapped application uses a cyclic executive, i.e. everything is done in the input-logic-output design, typical for some real-time related domains. Consequentially, every application cycle executes more or less the very
    same code. As I am still in a prototyping stadium, the wrapped process is completely CPU-bound, i.e. except of some logging output there is no I/O whatsoever.

    During one second of "application time", I am doing exactly 120 calls into
    the application through three Cython-wrapped API functions. The
    application uses some platform-dependent APIs, which I have also wrapped
    with Cython, so that there are numerous callbacks into the Python realm per call into the application. What I am observing now, is that the performance
    per "application second" decreases (remember: executing code that does the
    same thing on every cycle) and extending the number of loop iterations does
    not seem to cause any bound to this decrease. In the log ouput below, you
    can see the GC counts, which look innocent to me. The "real time" is
    measured using "time.time()". The "top" utility does not suggest any memory leak either. I am developing on WSL2, but I have checked that this
    performance effect also happens on physical machines. Right now, I am
    staring at "kcachegrind", but I have no idea, how to determine time series
    for the performance of functions (I am not looking for those functions,
    which need the most time, but for those, which consume more and more
    execution time).

    One more thing to look for could be memory fragmentation, but before that I would like to ask the experts here for their ideas and experiences and/or
    for tools, which could help to find the culprit.

    2022-10-04 10:40:50|INFO |__main__ |Execution loop 0 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to 0.06862711906433105 seconds real time.
    2022-10-04 10:40:51|INFO |__main__ |Execution loop 100 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.08224177360534668 seconds real time.
    2022-10-04 10:40:52|INFO |__main__ |Execution loop 200 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.08225250244140625 seconds real time.
    2022-10-04 10:40:53|INFO |__main__ |Execution loop 300 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.10176873207092285 seconds real time.
    2022-10-04 10:40:54|INFO |__main__ |Execution loop 400 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.10900592803955078 seconds real time.
    2022-10-04 10:40:55|INFO |__main__ |Execution loop 500 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.12233948707580566 seconds real time.
    2022-10-04 10:40:56|INFO |__main__ |Execution loop 600 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.14058256149291992 seconds real time.
    2022-10-04 10:40:58|INFO |__main__ |Execution loop 700 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.14777183532714844 seconds real time.
    2022-10-04 10:40:59|INFO |__main__ |Execution loop 800 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.15729451179504395 seconds real time.
    2022-10-04 10:41:01|INFO |__main__ |Execution loop 900 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.17365813255310059 seconds real time.
    2022-10-04 10:41:03|INFO |__main__ |Execution loop 1000 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.17772984504699707 seconds real time.
    2022-10-04 10:41:05|INFO |__main__ |Execution loop 1100 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.1955263614654541 seconds real time.
    2022-10-04 10:41:07|INFO |__main__ |Execution loop 1200 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.20046710968017578 seconds real time.
    2022-10-04 10:41:09|INFO |__main__ |Execution loop 1300 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.22513842582702637 seconds real time.
    2022-10-04 10:41:11|INFO |__main__ |Execution loop 1400 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.23578548431396484 seconds real time.
    2022-10-04 10:41:13|INFO |__main__ |Execution loop 1500 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.24581527709960938 seconds real time.
    2022-10-04 10:41:16|INFO |__main__ |Execution loop 1600 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.2541334629058838 seconds real time.
    2022-10-04 10:41:19|INFO |__main__ |Execution loop 1700 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.26812195777893066 seconds real time.
    2022-10-04 10:41:21|INFO |__main__ |Execution loop 1800 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.28777456283569336 seconds real time.
    2022-10-04 10:41:24|INFO |__main__ |Execution loop 1900 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.3005337715148926 seconds real time.
    2022-10-04 10:41:28|INFO |__main__ |Execution loop 2000 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.30992960929870605 seconds real time.
    2022-10-04 10:41:31|INFO |__main__ |Execution loop 2100 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.32058119773864746 seconds real time.
    2022-10-04 10:41:34|INFO |__main__ |Execution loop 2200 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.33020949363708496 seconds real time.
    2022-10-04 10:41:37|INFO |__main__ |Execution loop 2300 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.34426307678222656 seconds real time.
    2022-10-04 10:41:41|INFO |__main__ |Execution loop 2400 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.35851263999938965 seconds real time.
    2022-10-04 10:41:45|INFO |__main__ |Execution loop 2500 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.3664553165435791 seconds real time.
    2022-10-04 10:41:48|INFO |__main__ |Execution loop 2600 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.3767662048339844 seconds real time.
    2022-10-04 10:41:52|INFO |__main__ |Execution loop 2700 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.39760732650756836 seconds real time.
    2022-10-04 10:41:56|INFO |__main__ |Execution loop 2800 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.42221736907958984 seconds real time.
    2022-10-04 10:42:01|INFO |__main__ |Execution loop 2900 done. GC counts = (381, 9, 3); 1 second of application time corresponds to 0.4237234592437744 seconds real time.


    Thanks in advance,

    Andreas

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Ames@21:1/5 to All on Fri Oct 7 16:39:00 2022
    Answering to myself, just for the records:

    1. The culprit was me. As lazy as I am, I have used f-strings all over the place in calls to `logging.logger.debug()` and friends, evaluating all arguments regardless of whether the logger was enabled or not. Replacing
    these f-strings by regular printf-like format strings solved the issue.
    Now the application bowls happily along, consistently below 0.02 seconds
    per second application time.
    2. Valgrind + callgrind is an awesome toolchain to spot performance issues, even on VMs.


    Am Di., 4. Okt. 2022 um 11:05 Uhr schrieb Andreas Ames < andreas.0815.qwertz@gmail.com>:

    Hi all,

    I am wrapping an embedded application (, which does not use any dynamic memory management,) using Cython to call it from CPython. The wrapped application uses a cyclic executive, i.e. everything is done in the input-logic-output design, typical for some real-time related domains. Consequentially, every application cycle executes more or less the very
    same code. As I am still in a prototyping stadium, the wrapped process is completely CPU-bound, i.e. except of some logging output there is no I/O whatsoever.

    During one second of "application time", I am doing exactly 120 calls into the application through three Cython-wrapped API functions. The
    application uses some platform-dependent APIs, which I have also wrapped
    with Cython, so that there are numerous callbacks into the Python realm per call into the application. What I am observing now, is that the performance per "application second" decreases (remember: executing code that does the same thing on every cycle) and extending the number of loop iterations does not seem to cause any bound to this decrease. In the log ouput below, you can see the GC counts, which look innocent to me. The "real time" is measured using "time.time()". The "top" utility does not suggest any memory leak either. I am developing on WSL2, but I have checked that this performance effect also happens on physical machines. Right now, I am staring at "kcachegrind", but I have no idea, how to determine time series for the performance of functions (I am not looking for those functions,
    which need the most time, but for those, which consume more and more execution time).

    One more thing to look for could be memory fragmentation, but before that
    I would like to ask the experts here for their ideas and experiences and/or for tools, which could help to find the culprit.

    2022-10-04 10:40:50|INFO |__main__ |Execution loop 0 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.06862711906433105 seconds real time.
    2022-10-04 10:40:51|INFO |__main__ |Execution loop 100 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.08224177360534668 seconds real time.
    2022-10-04 10:40:52|INFO |__main__ |Execution loop 200 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.08225250244140625 seconds real time.
    2022-10-04 10:40:53|INFO |__main__ |Execution loop 300 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.10176873207092285 seconds real time.
    2022-10-04 10:40:54|INFO |__main__ |Execution loop 400 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.10900592803955078 seconds real time.
    2022-10-04 10:40:55|INFO |__main__ |Execution loop 500 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.12233948707580566 seconds real time.
    2022-10-04 10:40:56|INFO |__main__ |Execution loop 600 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.14058256149291992 seconds real time.
    2022-10-04 10:40:58|INFO |__main__ |Execution loop 700 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.14777183532714844 seconds real time.
    2022-10-04 10:40:59|INFO |__main__ |Execution loop 800 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.15729451179504395 seconds real time.
    2022-10-04 10:41:01|INFO |__main__ |Execution loop 900 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.17365813255310059 seconds real time.
    2022-10-04 10:41:03|INFO |__main__ |Execution loop 1000 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.17772984504699707 seconds real time.
    2022-10-04 10:41:05|INFO |__main__ |Execution loop 1100 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.1955263614654541 seconds real time.
    2022-10-04 10:41:07|INFO |__main__ |Execution loop 1200 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.20046710968017578 seconds real time.
    2022-10-04 10:41:09|INFO |__main__ |Execution loop 1300 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.22513842582702637 seconds real time.
    2022-10-04 10:41:11|INFO |__main__ |Execution loop 1400 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.23578548431396484 seconds real time.
    2022-10-04 10:41:13|INFO |__main__ |Execution loop 1500 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.24581527709960938 seconds real time.
    2022-10-04 10:41:16|INFO |__main__ |Execution loop 1600 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.2541334629058838 seconds real time.
    2022-10-04 10:41:19|INFO |__main__ |Execution loop 1700 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.26812195777893066 seconds real time.
    2022-10-04 10:41:21|INFO |__main__ |Execution loop 1800 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.28777456283569336 seconds real time.
    2022-10-04 10:41:24|INFO |__main__ |Execution loop 1900 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.3005337715148926 seconds real time.
    2022-10-04 10:41:28|INFO |__main__ |Execution loop 2000 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.30992960929870605 seconds real time.
    2022-10-04 10:41:31|INFO |__main__ |Execution loop 2100 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.32058119773864746 seconds real time.
    2022-10-04 10:41:34|INFO |__main__ |Execution loop 2200 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.33020949363708496 seconds real time.
    2022-10-04 10:41:37|INFO |__main__ |Execution loop 2300 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.34426307678222656 seconds real time.
    2022-10-04 10:41:41|INFO |__main__ |Execution loop 2400 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.35851263999938965 seconds real time.
    2022-10-04 10:41:45|INFO |__main__ |Execution loop 2500 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.3664553165435791 seconds real time.
    2022-10-04 10:41:48|INFO |__main__ |Execution loop 2600 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.3767662048339844 seconds real time.
    2022-10-04 10:41:52|INFO |__main__ |Execution loop 2700 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.39760732650756836 seconds real time.
    2022-10-04 10:41:56|INFO |__main__ |Execution loop 2800 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.42221736907958984 seconds real time.
    2022-10-04 10:42:01|INFO |__main__ |Execution loop 2900 done. GC
    counts = (381, 9, 3); 1 second of application time corresponds to
    0.4237234592437744 seconds real time.


    Thanks in advance,

    Andreas



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)