• Stack analysis tool that really work?

    From Niklas Holsti@21:1/5 to pozz on Thu Jul 29 19:36:55 2021
    On 2021-07-29 19:22, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that really works?

    The best I could find is -fstack-usage and avstack.pl Perl script, but I guess there's another and better way.


    If you can pay, there is https://www.absint.com/stackanalyzer/index.htm.
    I haven't used it, but I have used other tools from AbsInt
    (WCET-analysis tools) which worked well, and (I believe) perform stack
    analysis internally, so I guess the simpler stack-analysis tool also works.

    There is also https://www.adacore.com/gnatpro/toolsuite/gnatstack, which
    I have used successfully. I used it for an Ada application, but it
    should also work for C and C++.

    Unfortunately I don't know of any free tools for your target.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pozz@21:1/5 to All on Thu Jul 29 18:22:00 2021
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that really
    works?

    The best I could find is -fstack-usage and avstack.pl Perl script, but I
    guess there's another and better way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to pozz on Mon Aug 2 17:56:17 2021
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that really works?

    What do you mean by "really works"? It's a (potentially) unbounded problem
    as the compiler can't know what the inputs will drive the code to do, in practice.

    Your best practical option is to tease a callgraph (tree) out and use your
    own knowledge of what the code WANTS to do to identify the worst-case
    stack penetration path. This is also a great way to get a feel for if your code might benefit from a good refactoring!

    The best I could find is -fstack-usage and avstack.pl Perl script, but I guess
    there's another and better way.

    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're typically SoL) but they will be expensive in time or money.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Tue Aug 3 11:43:42 2021
    On 2021-08-03 3:56, Don Y wrote:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
    really works?

    What do you mean by "really works"?  It's a (potentially) unbounded problem as the compiler can't know what the inputs will drive the code to do, in practice.


    There are practical static-analysis methods to compute an upper bound on
    the stack usage that work well for many embedded programs. An upper
    bound is usually all one needs. The exact worst case is less interesting.

    Some problems arise for programs that branch to dynamically computed
    addresses, for example by calling functions via function pointers.
    Static analyses may not be able to find the set of possible target
    addresses. When that happens you have to guide the analysis by
    specifying where such control transfers can end up. For typical embedded
    code that is not hard to do, but a program that relies extensively on
    virtual function calls may be hard to analyze.


    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're typically SoL) but they will be expensive in time or money.


    The exact stack usage of a subprogram can't be derived from source code
    without knowing what machine code the compiler and linker will produce.
    The static stack-analysis tools typically work on the final executable
    code (which also means that they can be independent of the programming language).

    But (as I said in an earlier post) the tools I am aware of, for the OP's target, are not free.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Tue Aug 3 02:28:26 2021
    On 8/3/2021 1:43 AM, Niklas Holsti wrote:
    On 2021-08-03 3:56, Don Y wrote:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that really works?

    What do you mean by "really works"? It's a (potentially) unbounded problem >> as the compiler can't know what the inputs will drive the code to do, in
    practice.

    There are practical static-analysis methods to compute an upper bound on the stack usage that work well for many embedded programs. An upper bound is usually all one needs. The exact worst case is less interesting.

    That depends on the actual algorithm being implemented, the style of the developer and, often, the inputs provided.

    For example, I often use recursive algorithms for pattern matching (because they are easier to "get right"). Looking at the algorithm, you can't know
    how deep the recursion will be -- without seeing the input it will be fed.

    I, OTOH, have intimate knowledge of that input (or, at least *one* half of
    the comparison) and can ensure the recursion 1) stops and 2) stops at a predictable depth (based on that one half of the comparison).

    The same for "layered" UI's -- the structure of the user interface is
    encoded elsewhere; the code that implements it is driven by those
    structures (which can change -- or be changed -- at runtime).

    Any analysis (even by human beings) won't provide those figures -- without having a peek at the inputs.

    Some problems arise for programs that branch to dynamically computed addresses,
    for example by calling functions via function pointers. Static analyses may not
    be able to find the set of possible target addresses. When that happens you have to guide the analysis by specifying where such control transfers can end up. For typical embedded code that is not hard to do, but a program that relies
    extensively on virtual function calls may be hard to analyze.

    The same is true of a program that is driven by external events.
    The code doesn't (typically) "know" the constraints placed on those
    events. (one can argue as to whether or not this is "good practice")
    So, it can't "find the end".

    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're
    typically SoL) but they will be expensive in time or money.

    The exact stack usage of a subprogram can't be derived from source code without
    knowing what machine code the compiler and linker will produce. The static stack-analysis tools typically work on the final executable code (which also means that they can be independent of the programming language).

    If you don't have the sources, you can't *do* anything with the results
    of the analysis -- other than define a minimum stack size (which may
    be a surprise to you *and* your hardware!)

    If you don't have the sources, you likely haven't a clue as to how
    the code behaves, under the complete set of I/O conditions.

    But (as I said in an earlier post) the tools I am aware of, for the OP's target, are not free.

    You can possibly instrument some DSE-like tools. But, I would imagine
    the execution time of the tool would be prohibitively long -- for all
    but trivial codebases.

    [I should try that when I have some time!]

    How efficient are those you've used? Do they operate in "near compile time"? Or, something considerably longer (slower)? I.e., are they practical to
    use iteratively? On an entire application?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pozz@21:1/5 to All on Tue Aug 3 11:49:29 2021
    Il 03/08/2021 02:56, Don Y ha scritto:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
    really works?

    What do you mean by "really works"?  It's a (potentially) unbounded problem as the compiler can't know what the inputs will drive the code to do, in practice.

    I mean a simple tool that can be instructed, even manually, to produce a
    good result.

    -fstack-usage is not usable "by hands".

    You need at least a call-graph (generated by a tool), fill each branch (function) with a stack usage (generated by the compiler), fill every
    branch that is not known at compilation time by the tools (call
    functions by pointers, recursive, and so on).

    It's surprisingly to me there isn't a single non-expensive tool that
    helps in this, considering there are a multitude of good free tools for developers.


    Your best practical option is to tease a callgraph (tree) out and use your own knowledge of what the code WANTS to do to identify the worst-case
    stack penetration path.  This is also a great way to get a feel for if your code might benefit from a good refactoring!

    The best I could find is -fstack-usage and avstack.pl Perl script, but
    I guess there's another and better way.

    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're typically SoL) but they will be expensive in time or money.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to pozz on Tue Aug 3 20:17:14 2021
    On 8/3/2021 2:49 AM, pozz wrote:
    Il 03/08/2021 02:56, Don Y ha scritto:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that really works?

    What do you mean by "really works"? It's a (potentially) unbounded problem >> as the compiler can't know what the inputs will drive the code to do, in
    practice.

    I mean a simple tool that can be instructed, even manually, to produce a good result.

    What's "good"? Does it have to solve *most* peoples' needs? Or, just yours?

    As I said, the first step is understanding what the dynamics of execution
    in your task happen to be. You may be surprised to see functions being
    dragged in that you'd not have thought were part of the mix! ("why is
    printf() being called, here?? and, couldn't I use something like itoa() instead?")

    The compiler won't know anything about your run-time environment. It won't know if you have a separate stack for ISRs, if the function you are analyzing runs as the "root" of a particular process tree, etc. Do you know how much work is done BEFORE the first line of your function executed?

    It also won't know which execution paths *will* be exercised in practice.
    You may have paths that can't be executed (given a particular set of
    inputs). How will you know to recognize their impact on any figures
    reported?

    -fstack-usage is not usable "by hands".

    No, but you can use that data *with* the call tree to get an idea of
    where your maximum lies -- assuming no recursion and worst-case path
    coverage.

    You can also fill the stack space with 0x2B00B1E5, run your COMPREHENSIVE
    test case suite that exercises ALL paths through the code and check to see
    what the high-water mark was. (you *are* testing the code, right?)

    Hint: you *really* want tasks to be of the lowest practical complexity
    that can meet your requirements -- even if that means splitting tasks
    into more subunits.

    You need at least a call-graph (generated by a tool), fill each branch (function) with a stack usage (generated by the compiler),

    These can be done with a script (as I suggested below)

    fill every branch
    that is not known at compilation time by the tools (call functions by pointers,
    recursive, and so on).

    Ah, well... there's the rub! How smart should this "free" tool be?
    And, how should it handle cases that can't be known at compile time
    ("Sorry, your codebase appears to need AT LEAST X bytes of stack; but,
    I can't tell you how much more!" -- what use, that?)

    It's surprisingly to me there isn't a single non-expensive tool that helps in this, considering there are a multitude of good free tools for developers.

    It costs money to engage in any business (non-hobbyist) endeavor.
    Expecting all of your tools to be "free" is a bit naive.

    There are some absolutely *amazing* tools available that can save
    man-years of development effort and greatly improve the quality/correctness
    of code. You can buy them -- or, spend man-hours trying to do what they
    do. Hopefully, without error.

    And, having *done* that, do it all over again on your NEXT project!

    <frown>

    Your best practical option is to tease a callgraph (tree) out and use your >> own knowledge of what the code WANTS to do to identify the worst-case
    stack penetration path. This is also a great way to get a feel for if your >> code might benefit from a good refactoring!

    The best I could find is -fstack-usage and avstack.pl Perl script, but I >>> guess there's another and better way.

    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're
    typically SoL) but they will be expensive in time or money.

    *Try* building the call tree (there are some graphic tools) and
    have a look at it. Is it what you expected? Do you understand
    *why* each function is being invoked in each edge?

    Then, try to *guess* where the "stack pigs" are located.

    Finally, see what the compiler tells you for each of them and
    how well that fits with your "understanding" of the code.

    [The reason you want to do this is so you have a better feel
    for the costs of particular approaches -- instead of just
    reacting to the "final assessment". For example, most newbies
    are stunned to discover how expensive an off-the-shelf
    printf() can be!]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pozz@21:1/5 to All on Wed Aug 4 16:43:10 2021
    Il 04/08/2021 05:17, Don Y ha scritto:
    On 8/3/2021 2:49 AM, pozz wrote:
    Il 03/08/2021 02:56, Don Y ha scritto:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
    really works?

    What do you mean by "really works"?  It's a (potentially) unbounded
    problem
    as the compiler can't know what the inputs will drive the code to do, in >>> practice.

    I mean a simple tool that can be instructed, even manually, to produce
    a good result.

    What's "good"?  Does it have to solve *most* peoples' needs?  Or, just
    yours?

    Most peoples' needs that are similar: have a good estimate of stack
    usage worst case.


    As I said, the first step is understanding what the dynamics of execution
    in your task happen to be.  You may be surprised to see functions being dragged in that you'd not have thought were part of the mix!  ("why is printf() being called, here??  and, couldn't I use something like itoa() instead?")

    Oh yes, but a simple tool that generates automatically a call graph
    could be very useful for this.


    The compiler won't know anything about your run-time environment.  It won't know if you have a separate stack for ISRs, if the function you are
    analyzing
    runs as the "root" of a particular process tree, etc.  Do you know how much work is done BEFORE the first line of your function executed?

    It also won't know which execution paths *will* be exercised in practice.
    You may have paths that can't be executed (given a particular set of inputs).  How will you know to recognize their impact on any figures reported?

    I know that compiler can't know everything, but *I* can instruct the
    tool with that kind of info.


    -fstack-usage is not usable "by hands".

    No, but you can use that data *with* the call tree to get an idea of
    where your maximum lies -- assuming no recursion and worst-case path coverage.

    I think this job could be done by a single tool that creates a call
    graph and fill in the values from stack usage.


    You can also fill the stack space with 0x2B00B1E5, run your COMPREHENSIVE test case suite that exercises ALL paths through the code and check to see what the high-water mark was.  (you *are* testing the code, right?)

    This is the dynamic approach, I was exploring the static approach.


    Hint:  you *really* want tasks to be of the lowest practical complexity
    that can meet your requirements -- even if that means splitting tasks
    into more subunits.

    You need at least a call-graph (generated by a tool), fill each branch
    (function) with a stack usage (generated by the compiler),

    These can be done with a script (as I suggested below)

    fill every branch that is not known at compilation time by the tools
    (call functions by pointers, recursive, and so on).

    Ah, well... there's the rub!  How smart should this "free" tool be?
    And, how should it handle cases that can't be known at compile time
    ("Sorry, your codebase appears to need AT LEAST X bytes of stack; but,
    I can't tell you how much more!" -- what use, that?)

    It's surprisingly to me there isn't a single non-expensive tool that
    helps in this, considering there are a multitude of good free tools
    for developers.

    It costs money to engage in any business (non-hobbyist) endeavor.
    Expecting all of your tools to be "free" is a bit naive.

    Anyway there are plenty of complex and good free and open-source
    software (gcc is one of them).
    So it's strange there isn't a similar tool for stack analysis. That's
    all, it's strange for me, but I don't pretend all my preferred tools
    must be free.


    There are some absolutely *amazing* tools available that can save
    man-years of development effort and greatly improve the quality/correctness of code.  You can buy them -- or, spend man-hours trying to do what they
    do.  Hopefully, without error.

    And, having *done* that, do it all over again on your NEXT project!

    <frown>

    Yes, I know these are the only solutions. I was asking if a free tool
    really existed but I didn't knew it.


    Your best practical option is to tease a callgraph (tree) out and use
    your
    own knowledge of what the code WANTS to do to identify the worst-case
    stack penetration path.  This is also a great way to get a feel for
    if your
    code might benefit from a good refactoring!

    The best I could find is -fstack-usage and avstack.pl Perl script,
    but I guess there's another and better way.

    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're >>> typically SoL) but they will be expensive in time or money.

    *Try* building the call tree (there are some graphic tools) and
    have a look at it.  Is it what you expected?  Do you understand
    *why* each function is being invoked in each edge?

    Then, try to *guess* where the "stack pigs" are located.

    Finally, see what the compiler tells you for each of them and
    how well that fits with your "understanding" of the code.

    [The reason you want to do this is so you have a better feel
    for the costs of particular approaches -- instead of just
    reacting to the "final assessment".  For example, most newbies
    are stunned to discover how expensive an off-the-shelf
    printf() can be!]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to pozz on Wed Aug 4 20:00:37 2021
    On 2021-08-03 12:49, pozz wrote:
    Il 03/08/2021 02:56, Don Y ha scritto:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
    really works?

    What do you mean by "really works"? It's a (potentially) unbounded
    problem as the compiler can't know what the inputs will drive the
    code to do, in practice.

    I mean a simple tool that can be instructed, even manually, to
    produce a good result.

    -fstack-usage is not usable "by hands".

    You need at least a call-graph (generated by a tool), fill each
    branch (function) with a stack usage (generated by the compiler),
    fill every branch that is not known at compilation time by the tools
    (call functions by pointers, recursive, and so on).

    It's surprisingly to me there isn't a single non-expensive tool that
    helps in this, considering there are a multitude of good free tools
    for developers.


    One reason is that for good results, such a tool has to analyze the
    executable code, and therefore must be target-specific, or at least have
    ports to the various targets, increasing the effort to implement and
    maintain the tool. The gnatstack tool gets around that, to some extent,
    by relying on stack-usage and call information from the compiler (gcc).
    Note that a source-level call graph will not show calls to the various
    support routines (run-time routines) inserted by the compiler, but those
    calls do use stack.

    I am the main author of a WCET-analysis tool, Bound-T, that also does
    stack analysis and is now free. However, there is (as yet) no port for
    ARM Cortex-M. See http://www.bound-t.com/.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Wed Aug 4 10:50:06 2021
    On 8/4/2021 10:00 AM, Niklas Holsti wrote:
    On 2021-08-03 12:49, pozz wrote:
    Il 03/08/2021 02:56, Don Y ha scritto:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
    really works?

    What do you mean by "really works"? It's a (potentially) unbounded
    problem as the compiler can't know what the inputs will drive the
    code to do, in practice.

    I mean a simple tool that can be instructed, even manually, to
    produce a good result.

    -fstack-usage is not usable "by hands".

    You need at least a call-graph (generated by a tool), fill each
    branch (function) with a stack usage (generated by the compiler),
    fill every branch that is not known at compilation time by the tools
    (call functions by pointers, recursive, and so on).

    It's surprisingly to me there isn't a single non-expensive tool that
    helps in this, considering there are a multitude of good free tools
    for developers.


    One reason is that for good results, such a tool has to analyze the executable
    code, and therefore must be target-specific, or at least have ports to the various targets, increasing the effort to implement and maintain the tool. The
    gnatstack tool gets around that, to some extent, by relying on stack-usage and
    call information from the compiler (gcc).

    I am seeing an increasing number of tools relying on intermediate
    encodings (e.g., via LLVM) to give more portability to their
    application. But, then you're stuck *needing* access to the
    sources.

    Note that a source-level call graph
    will not show calls to the various support routines (run-time routines) inserted by the compiler, but those calls do use stack.

    I am the main author of a WCET-analysis tool, Bound-T, that also does stack analysis and is now free. However, there is (as yet) no port for ARM Cortex-M.
    See http://www.bound-t.com/.

    But not free as in open (?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to pozz on Wed Aug 4 11:06:08 2021
    On 8/4/2021 7:43 AM, pozz wrote:
    As I said, the first step is understanding what the dynamics of execution
    in your task happen to be. You may be surprised to see functions being
    dragged in that you'd not have thought were part of the mix! ("why is
    printf() being called, here?? and, couldn't I use something like itoa()
    instead?")

    Oh yes, but a simple tool that generates automatically a call graph could be very useful for this.

    How "simple" does it have to be? There are already tools that will
    do this for you. Even GUI out. How "pretty" is a matter of debate
    given that it's hard to make an algorithm that can find the "prettiest"
    way of drawing such a graph.

    The compiler won't know anything about your run-time environment. It won't >> know if you have a separate stack for ISRs, if the function you are analyzing
    runs as the "root" of a particular process tree, etc. Do you know how much >> work is done BEFORE the first line of your function executed?

    It also won't know which execution paths *will* be exercised in practice.
    You may have paths that can't be executed (given a particular set of
    inputs). How will you know to recognize their impact on any figures
    reported?

    I know that compiler can't know everything, but *I* can instruct the tool with
    that kind of info.

    When does your effort fall to the level of being "by hand"?
    Note that you really would like a tool to do everything
    (at least, everything IMPORTANT) for you so you don't
    screw up or misapply it.

    -fstack-usage is not usable "by hands".

    No, but you can use that data *with* the call tree to get an idea of
    where your maximum lies -- assuming no recursion and worst-case path
    coverage.

    I think this job could be done by a single tool that creates a call graph and fill in the values from stack usage.

    Have you looked at egypt? There are other similar tools (that may
    require more than one command to build the output)

    You can also fill the stack space with 0x2B00B1E5, run your COMPREHENSIVE
    test case suite that exercises ALL paths through the code and check to see >> what the high-water mark was. (you *are* testing the code, right?)

    This is the dynamic approach, I was exploring the static approach.

    My point is that you still have to perform tests that exercise every path through your code. So, the test suite can be instrumented to harvest
    this data KNOWING that every option has been examined.

    It costs money to engage in any business (non-hobbyist) endeavor.
    Expecting all of your tools to be "free" is a bit naive.

    Anyway there are plenty of complex and good free and open-source software (gcc
    is one of them).
    So it's strange there isn't a similar tool for stack analysis. That's all, it's
    strange for me, but I don't pretend all my preferred tools must be free.

    What's available is what SOMEONE decided they wanted to develop *and*
    offer up to others. You typically have fewer choices for "free" than
    "for pay". Developing any tool (and maintaining it -- or, providing enough information that OTHERS can maintain it without you) isn't a trivial job.

    And, *a* free tool will likely soak up much of the effort that others
    may have spent on some OTHER approach to that free tool. So, the first drives/hinders future development; it's hard to completely redesign an
    existing tool -- esp if others have bought into it on the maintenance side!

    I build lots of tools to address *my* needs. But, rarely publish them
    (beyond close colleagues) because I don't want to "pay the tax" of
    supporting others -- even if they just have simple questions and aren't
    looking for some change to the tool. I have my own interests; maintaining
    a tool FOR OTHERS isn't one of them! <frown>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Wed Aug 4 21:50:06 2021
    On 2021-08-04 20:50, Don Y wrote:
    On 8/4/2021 10:00 AM, Niklas Holsti wrote:

    ...

    I am the main author of a WCET-analysis tool, Bound-T, that also does
    stack analysis and is now free. However, there is (as yet) no port for
    ARM Cortex-M. See http://www.bound-t.com/.

    But not free as in open (?)


    The source code is downloadable; see http://bound-t.com/download/src.
    The copyright text is of my own writing, but I'm open to switching to
    some better-known copyright such as GPL or even some non-viral version.

    However, you may want to read about the state of the tool at http://bound-t.com/status.html.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Wed Aug 4 22:14:35 2021
    On 2021-08-03 12:28, Don Y wrote:
    On 8/3/2021 1:43 AM, Niklas Holsti wrote:
    On 2021-08-03 3:56, Don Y wrote:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
    really works?

    What do you mean by "really works"?  It's a (potentially) unbounded
    problem
    as the compiler can't know what the inputs will drive the code to do, in >>> practice.

    There are practical static-analysis methods to compute an upper bound
    on the stack usage that work well for many embedded programs. An upper
    bound is usually all one needs. The exact worst case is less interesting.

    That depends on the actual algorithm being implemented, the style of the developer and, often, the inputs provided.

    For example, I often use recursive algorithms for pattern matching (because they are easier to "get right").  Looking at the algorithm, you can't know how deep the recursion will be -- without seeing the input it will be fed.


    Well, most embedded programmers avoid recursion. And if you implement
    recursion in a memory-constrained system, your design and code should
    set a hard limit on the recursion depth and reject input that would
    require deeper recursion.

    Some (admittedly complex) static analyses can discover such limits from
    the machine code, in the same way as they discover loop iteration limits
    for WCET analysis. (In fact, I believe that the AbsInt tools translate
    loops into recursions before the analysis.)


    Some problems arise for programs that branch to dynamically computed
    addresses, for example by calling functions via function pointers.
    Static analyses may not be able to find the set of possible target
    addresses. When that happens you have to guide the analysis by
    specifying where such control transfers can end up. For typical
    embedded code that is not hard to do, but a program that relies
    extensively on virtual function calls may be hard to analyze.

    The same is true of a program that is driven by external events.
    The code doesn't (typically) "know" the constraints placed on those
    events.  (one can argue as to whether or not this is "good practice")
    So, it can't "find the end".


    Again, if some sequences of external events might use an unbounded
    amount of stack, the design and code should set a hard limit. I have not encountered such programs.

    The typical design is to allow some fixed maximum nesting of event
    processing (eg. a finite set of interrupt priority levels) and otherwise enqueue the events for sequential processing, which does not require an unbounded stack.


    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're >>> typically SoL) but they will be expensive in time or money.

    The exact stack usage of a subprogram can't be derived from source
    code without knowing what machine code the compiler and linker will
    produce. The static stack-analysis tools typically work on the final
    executable code (which also means that they can be independent of the
    programming language).

    If you don't have the sources, you can't *do* anything with the results
    of the analysis -- other than define a minimum stack size (which may
    be a surprise to you *and* your hardware!)


    If the program is single-threaded, the allocated stack size is usually
    set in the linker command file. Sure, you need the source for that file
    if you want to change the stack size. Of course you are screwed if your
    machine does not have enough memory for the size of stack you want.


    If you don't have the sources, you likely haven't a clue as to how
    the code behaves, under the complete set of I/O conditions.


    I've analyzed the stack usage (upper bounds) of many libraries without
    access to source code. Typically embedded libraries have call-graphs
    that can be extracted from the machine code without assumptions about
    the I/O or other run-time values. But that may no longer be the case for libraries written in C++ or other languages with run-time binding of calls.

    I agree that for complete programs one sometimes needs more insight, in particular if there are calls through function pointers or local
    variables with dynamic (data-dependent) size.


    But (as I said in an earlier post) the tools I am aware of, for the
    OP's target, are not free.

    You can possibly instrument some DSE-like tools.


    Sorry, what is a "DSE-like" tool? DSE = ?


    How efficient are those you've used?  Do they operate in "near compile
    time"?


    For stack-usage analysis the tools are quite fast, typically, because
    embedded programs tend not to allocate dynamically-sized local
    variables. The main time is spent in reading the executable and decoding
    the instructions.

    For programs that allocate local variables of dynamic size the analysis
    becomes much more complex, can need lots of time and memory, and can
    often fail.


    I.e., are they practical to use iteratively? On an entire
    application?

    Yes, for many applications. And if you are implementing a
    safety-critical system, you design your application to be analysable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Wed Aug 4 15:43:10 2021
    On 8/4/2021 12:14 PM, Niklas Holsti wrote:
    On 2021-08-03 12:28, Don Y wrote:
    On 8/3/2021 1:43 AM, Niklas Holsti wrote:
    On 2021-08-03 3:56, Don Y wrote:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that really >>>>> works?

    What do you mean by "really works"? It's a (potentially) unbounded problem
    as the compiler can't know what the inputs will drive the code to do, in >>>> practice.

    There are practical static-analysis methods to compute an upper bound on the
    stack usage that work well for many embedded programs. An upper bound is >>> usually all one needs. The exact worst case is less interesting.

    That depends on the actual algorithm being implemented, the style of the
    developer and, often, the inputs provided.

    For example, I often use recursive algorithms for pattern matching (because >> they are easier to "get right"). Looking at the algorithm, you can't know >> how deep the recursion will be -- without seeing the input it will be fed.

    Well, most embedded programmers avoid recursion. And if you implement recursion
    in a memory-constrained system, your design and code should set a hard limit on
    the recursion depth and reject input that would require deeper recursion.

    Yes, but you don't need that limit to be enforced by a "depth counter". Instead, it can be encoded in the data that *drives* the recursion.
    A *human* understands that there is a limit (and exactly what that limit
    is, for any dataset). But, an algorithm may be taxed with trying to
    determine this.

    I shouldn't have to adopt a coding style just for the sake of a tool.

    Some (admittedly complex) static analyses can discover such limits from the machine code, in the same way as they discover loop iteration limits for WCET analysis. (In fact, I believe that the AbsInt tools translate loops into recursions before the analysis.)

    Some problems arise for programs that branch to dynamically computed
    addresses, for example by calling functions via function pointers. Static >>> analyses may not be able to find the set of possible target addresses. When >>> that happens you have to guide the analysis by specifying where such control
    transfers can end up. For typical embedded code that is not hard to do, but >>> a program that relies extensively on virtual function calls may be hard to >>> analyze.

    The same is true of a program that is driven by external events.
    The code doesn't (typically) "know" the constraints placed on those
    events. (one can argue as to whether or not this is "good practice")
    So, it can't "find the end".

    Again, if some sequences of external events might use an unbounded amount of stack, the design and code should set a hard limit. I have not encountered such
    programs.

    The typical design is to allow some fixed maximum nesting of event processing (eg. a finite set of interrupt priority levels) and otherwise enqueue the events for sequential processing, which does not require an unbounded stack.

    Events need not be interrupts. They may be user actions/keystrokes.
    But, the "grammar" (odd choice of words) that defines the range of
    permissible actions can limit how deeply the "generic" algorithm
    recurses. Again, something that a human can see and verify but that
    could evade anything sort of a sophisticated analysis of code+data.

    There are general tools that you can probably instrument to coax a
    number out of the source (if you don't have access to the source, you're >>>> typically SoL) but they will be expensive in time or money.

    The exact stack usage of a subprogram can't be derived from source code
    without knowing what machine code the compiler and linker will produce. The >>> static stack-analysis tools typically work on the final executable code
    (which also means that they can be independent of the programming language).

    If you don't have the sources, you can't *do* anything with the results
    of the analysis -- other than define a minimum stack size (which may
    be a surprise to you *and* your hardware!)

    If the program is single-threaded, the allocated stack size is usually set in the linker command file. Sure, you need the source for that file if you want to
    change the stack size. Of course you are screwed if your machine does not have
    enough memory for the size of stack you want.

    But you can't *change* your algorithm, having discovered that it uses
    more stack than you had anticipated (or, that other consumers have
    left available for its use). Without the sources, your only option
    is to change the hardware to make more stack available.

    [And, even that may not be possible; e.g., multithreaded in a single
    address space]

    If you don't have the sources, you likely haven't a clue as to how
    the code behaves, under the complete set of I/O conditions.

    I've analyzed the stack usage (upper bounds) of many libraries without access to source code. Typically embedded libraries have call-graphs that can be extracted from the machine code without assumptions about the I/O or other run-time values. But that may no longer be the case for libraries written in C++ or other languages with run-time binding of calls.

    I agree that for complete programs one sometimes needs more insight, in particular if there are calls through function pointers or local variables with
    dynamic (data-dependent) size.

    But (as I said in an earlier post) the tools I am aware of, for the OP's >>> target, are not free.

    You can possibly instrument some DSE-like tools.

    Sorry, what is a "DSE-like" tool? DSE = ?

    Dynamic Symbolic Execution.

    How efficient are those you've used? Do they operate in "near compile time"?

    For stack-usage analysis the tools are quite fast, typically, because embedded
    programs tend not to allocate dynamically-sized local variables. The main time
    is spent in reading the executable and decoding the instructions.

    For programs that allocate local variables of dynamic size the analysis becomes
    much more complex, can need lots of time and memory, and can often fail.

    Ah. The symbolic tools (alluded to above) are *really* resource intensive.
    You can spend HOURS analyzing a piece of code -- with a fast workstation.
    (my point being this is a tool that you'd use to VERIFY your assumptions; relying on it to tell you what you don't yet know -- esp if your code is evolving -- is way too costly!)

    I.e., are they practical to use iteratively? On an entire
    application?

    Yes, for many applications. And if you are implementing a safety-critical system, you design your application to be analysable.

    But, wouldn't you inherently *know* where your design is headed?
    And, just need the tool to confirm that? (and put real numbers on it)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Wed Aug 4 15:46:20 2021
    On 8/4/2021 11:50 AM, Niklas Holsti wrote:
    On 2021-08-04 20:50, Don Y wrote:
    On 8/4/2021 10:00 AM, Niklas Holsti wrote:

    ...

    I am the main author of a WCET-analysis tool, Bound-T, that also does stack >>> analysis and is now free. However, there is (as yet) no port for ARM
    Cortex-M. See http://www.bound-t.com/.

    But not free as in open (?)

    The source code is downloadable; see http://bound-t.com/download/src. The copyright text is of my own writing, but I'm open to switching to some better-known copyright such as GPL or even some non-viral version.

    Ah, OK. I may have a look at it to see how well it works and
    how much work to port to ARM objects. Given that you've not
    done so, already (despite your familiarity with the codebase),
    I suspect that's not a trivial task?

    However, you may want to read about the state of the tool at http://bound-t.com/status.html.

    Hmmm... that sounds ominous! :> (or, am I just too much of a Cynic?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Thu Aug 5 12:00:06 2021
    On 2021-08-04 21:06, Don Y wrote:
    On 8/4/2021 7:43 AM, pozz wrote:
    As I said, the first step is understanding what the dynamics of
    execution in your task happen to be. You may be surprised to see
    functions being dragged in that you'd not have thought were part
    of the mix! ("why is printf() being called, here?? and,
    couldn't I use something like itoa() instead?")


    Oh yes, but a simple tool that generates automatically a call graph
    could be very useful for this.

    How "simple" does it have to be?  There are already tools that will
    do this for you.  Even GUI out.  How "pretty" is a matter of debate
    given that it's hard to make an algorithm that can find the "prettiest"
    way of drawing such a graph.


    The "dot" tool in the GraphViz package does a good job of graph lay-out.
    See https://en.wikipedia.org/wiki/Graphviz. Bound-T generates its graphs
    in "dot" format. The AbsInt tools have their own graph-layout package;
    it may even be sold separately.

    However, as a said earlier, a source-level call-graph is often not
    complete with respect to the calls that happen at run time.


    This is the dynamic approach, I was exploring the static approach.

    My point is that you still have to perform tests that exercise every path through your code.


    That (full path coverage) is almost never done, because it is
    exponential in the number of branches.

    In case the sizes of local variables depend on input data, it may be
    very hard to find the input values that lead to the actual worst-case stack-usage path. Safe upper bounds are easier to find by analysis.


    What's available is what SOMEONE decided they wanted to develop *and*
    offer up to others.  You typically have fewer choices for "free" than
    "for pay".  Developing any tool (and maintaining it -- or, providing enough information that OTHERS can maintain it without you) isn't a trivial job.


    My experience is that rather few developers favour the static-analysis
    approach to resource analysis -- most just instrument and test. There
    are a few application domains that require better guarantees --
    aerospace, automotive, nuclear -- and they are also prepared to pay (a
    bit) for the tools. However, tools for such domains usually need some
    kind of formal qualification or certification, which can be expensive.

    Also, while static analysis is still possible for stack usage, it is
    becoming impossible for WCET, because of the complex, dynamic
    architecture of current high-end embedded processors (out-of-order
    execution, complex caches, multi-core with shared-resource conflicts,
    and so on and on). The field seems to be moving towards hybrid analysis
    methods that combine measured execution times with static flow analys,
    for example the tools from Rapita (https://www.rapitasystems.com/).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Thu Aug 5 13:18:36 2021
    On 2021-08-04 20:50, Don Y wrote:
    On 8/4/2021 10:00 AM, Niklas Holsti wrote:
    On 2021-08-03 12:49, pozz wrote:
    Il 03/08/2021 02:56, Don Y ha scritto:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
     really works?

    What do you mean by "really works"?  It's a (potentially) unbounded
    problem as the compiler can't know what the inputs will drive the
    code to do, in practice.

    I mean a simple tool that can be instructed, even manually, to
    produce a good result.

    -fstack-usage is not usable "by hands".

    You need at least a call-graph (generated by a tool), fill each
    branch (function) with a stack usage (generated by the compiler),
    fill every branch that is not known at compilation time by the tools
    (call functions by pointers, recursive, and so on).

    It's surprisingly to me there isn't a single non-expensive tool that
    helps in this, considering there are a multitude of good free tools
    for developers.


    One reason is that for good results, such a tool has to analyze the
    executable code, and therefore must be target-specific, or at least
    have ports to the various targets, increasing the effort to implement
    and maintain the tool. The gnatstack tool gets around that, to some
    extent, by relying on stack-usage and call information from the
    compiler (gcc).

    I am seeing an increasing number of tools relying on intermediate
    encodings (e.g., via LLVM) to give more portability to their
    application.


    LLVM IR and other similar program representations are a good "level" for
    some semantic analyses, such as value analysis (finding possible ranges
    of variable values) and control-flow analysis. But it is typically too
    high a level for analysing machine resources such as stack usage and WCET.

    If one can set up a reliable mapping between the IR entities (variables
    and control flow) and the machine-level entities (registers, memory
    locations, branch instructions) the two levels of analysis can support
    each other. Unfortunately that mapping is defined by the compiler and
    linker and is usually described only incompletely in the debugging
    information emitted from the compiler and linker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Thu Aug 5 13:43:54 2021
    On 2021-08-05 1:46, Don Y wrote:
    On 8/4/2021 11:50 AM, Niklas Holsti wrote:
    On 2021-08-04 20:50, Don Y wrote:
    On 8/4/2021 10:00 AM, Niklas Holsti wrote:

        ...

    I am the main author of a WCET-analysis tool, Bound-T, that also
    does stack analysis and is now free. However, there is (as yet) no
    port for ARM Cortex-M. See http://www.bound-t.com/.

    But not free as in open (?)

    The source code is downloadable; see http://bound-t.com/download/src.
    The copyright text is of my own writing, but I'm open to switching to
    some better-known copyright such as GPL or even some non-viral version.

    Ah, OK.  I may have a look at it to see how well it works and
    how much work to port to ARM objects.  Given that you've not
    done so, already (despite your familiarity with the codebase),
    I suspect that's not a trivial task?


    There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but
    of course that architecture is out of date. I started a port to ARM
    Cortex-M, but did not (yet) finish it for the reasons described on the
    status page.

    Note that the tool is implemented in Ada.

    Porting to a new target processor means writing procedures to decode the machine instructions and translate them to the internal representation
    used by the analysis. Sometimes it is also necessary to modify or extend
    the tool parts that read the executable files and especially the
    debugging information -- some of that may be compiler-specific,
    unfortunately. It is a fair amount of work and requires a good
    understanding both of the target processor and of the Bound-T internal representation. And Ada, of course, but every programmer should know
    Ada, IMO :-)

    However, you may want to read about the state of the tool at
    http://bound-t.com/status.html.

    Hmmm... that sounds ominous!  :>  (or, am I just too much of a Cynic?)


    The problems described on the status page are more relevant to WCET
    analysis than to stack analysis, but can affect stack analysis too, in
    some cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Thu Aug 5 04:59:38 2021
    On 8/5/2021 3:18 AM, Niklas Holsti wrote:
    On 2021-08-04 20:50, Don Y wrote:
    On 8/4/2021 10:00 AM, Niklas Holsti wrote:
    On 2021-08-03 12:49, pozz wrote:
    Il 03/08/2021 02:56, Don Y ha scritto:
    On 7/29/2021 9:22 AM, pozz wrote:
    arm gcc and Cortex-Mx MCUs embedded systems.

    Is there a compilation-time (static) tool for stack analysis that
    really works?

    What do you mean by "really works"? It's a (potentially) unbounded
    problem as the compiler can't know what the inputs will drive the
    code to do, in practice.

    I mean a simple tool that can be instructed, even manually, to
    produce a good result.

    -fstack-usage is not usable "by hands".

    You need at least a call-graph (generated by a tool), fill each
    branch (function) with a stack usage (generated by the compiler),
    fill every branch that is not known at compilation time by the tools
    (call functions by pointers, recursive, and so on).

    It's surprisingly to me there isn't a single non-expensive tool that
    helps in this, considering there are a multitude of good free tools
    for developers.


    One reason is that for good results, such a tool has to analyze the
    executable code, and therefore must be target-specific, or at least have >>> ports to the various targets, increasing the effort to implement and
    maintain the tool. The gnatstack tool gets around that, to some extent, by >>> relying on stack-usage and call information from the compiler (gcc).

    I am seeing an increasing number of tools relying on intermediate
    encodings (e.g., via LLVM) to give more portability to their
    application.

    LLVM IR and other similar program representations are a good "level" for some semantic analyses, such as value analysis (finding possible ranges of variable
    values) and control-flow analysis.

    Yes. I use such tools for automatically determining test coverage conditions. "Look at my code and figure out the inputs necessary to 'go everywhere'".

    But it is typically too high a level for
    analysing machine resources such as stack usage and WCET.

    Timing, agreed. But, I suspect there might be a way to add hooks to
    the analysis that "expose" the current SP to each function -- and
    then extract that. (I've not thought about it beyond conceptually;
    the tool WILL visit each variant of function invocation so if it
    can be coerced to take note of SP then it would simply be a matter
    of looking for max() -- and, it would be able to tell you *how*
    it got to that point!)

    If one can set up a reliable mapping between the IR entities (variables and control flow) and the machine-level entities (registers, memory locations, branch instructions) the two levels of analysis can support each other. Unfortunately that mapping is defined by the compiler and linker and is usually
    described only incompletely in the debugging information emitted from the compiler and linker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Thu Aug 5 04:54:43 2021
    On 8/5/2021 2:00 AM, Niklas Holsti wrote:
    On 2021-08-04 21:06, Don Y wrote:
    This is the dynamic approach, I was exploring the static approach.

    My point is that you still have to perform tests that exercise every path
    through your code.

    That (full path coverage) is almost never done, because it is exponential in the number of branches.

    It's impractical for a "whole program" but is relatively easy to accomplish
    for tasks designed to be small and single-focus. A solution that is implemented in this sort of "decomposed" manner is easier to analyze
    whereas something "monolithic" is hard to (later) "chop up" to yield testable SMALLER/simpler pieces.

    In case the sizes of local variables depend on input data, it may be very hard
    to find the input values that lead to the actual worst-case stack-usage path. Safe upper bounds are easier to find by analysis.

    What's available is what SOMEONE decided they wanted to develop *and*
    offer up to others. You typically have fewer choices for "free" than
    "for pay". Developing any tool (and maintaining it -- or, providing enough >> information that OTHERS can maintain it without you) isn't a trivial job.

    My experience is that rather few developers favour the static-analysis approach
    to resource analysis -- most just instrument and test. There are a few application domains that require better guarantees -- aerospace, automotive, nuclear -- and they are also prepared to pay (a bit) for the tools. However, tools for such domains usually need some kind of formal qualification or certification, which can be expensive.

    Ditto medical/pharma/gamiing. In addition to tool qualification, there
    are also *process* qualification issues. This tends to reduce the number
    of variants/releases of a "product" as each release requires running
    through the validation effort, again.

    [And, a release begs the question: "Do you mean there were things that
    DID NOT WORK in the prior release? If so, then how comprehensive was
    the previous validation effort??? Assure me that your process isn't
    inherently flawed..."]

    The fact that this effort exists (is required) means you will spend
    money to lessen it's cost AND increase the apparent integrity of
    your process to customers/agencies. Esp as you will be doing this
    repeatedly (for this product or others). Imagine having to *manually*
    repeat the entire effort from the previous release PLUS the changes
    brought about by the new release... for EVERY successive release!

    [No, you can't just *claim* that all of the stuff you validated before
    STILL WORKS!]

    Also, while static analysis is still possible for stack usage, it is becoming impossible for WCET, because of the complex, dynamic architecture of current high-end embedded processors (out-of-order execution, complex caches, multi-core with shared-resource conflicts, and so on and on). The field seems to be moving towards hybrid analysis methods that combine measured execution times with static flow analys, for example the tools from Rapita (https://www.rapitasystems.com/).

    I've taken a different approach on my current (real-time) project: assume
    any task can fail to meet it's deadlines and provide mechanisms to handle
    those overruns. (a "hard" real-time task has a very simple deadline
    handler: kill the task! :> )

    But, that's because the resource load (and resource complement) is not known
    a priori so you can't make *any* guarantees.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Thu Aug 5 06:16:36 2021
    On 8/5/2021 3:43 AM, Niklas Holsti wrote:
    On 2021-08-05 1:46, Don Y wrote:

    The source code is downloadable; see http://bound-t.com/download/src. The >>> copyright text is of my own writing, but I'm open to switching to some
    better-known copyright such as GPL or even some non-viral version.

    Ah, OK. I may have a look at it to see how well it works and
    how much work to port to ARM objects. Given that you've not
    done so, already (despite your familiarity with the codebase),
    I suspect that's not a trivial task?

    There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of course that architecture is out of date. I started a port to ARM Cortex-M, but
    did not (yet) finish it for the reasons described on the status page.

    Note that the tool is implemented in Ada.

    OK. It's been ~20 years since I wrote a line of Ada code but I doubt it
    will take much to "refresh" those memory cells. Esp as this likely doesn't have synchronization issues, RT constraints, etc.

    Porting to a new target processor means writing procedures to decode the machine instructions and translate them to the internal representation used by
    the analysis.

    That shouldn't be hard -- except for the "timing uncertainties" in the processor. (Did you rely on some blanket generalization -- like all
    caches cold -- as you are looking for WORST case timing? OTOH, a
    loop known to fit in a cache line *should* expect that speedup... :< )

    Sometimes it is also necessary to modify or extend the tool parts
    that read the executable files and especially the debugging information -- some
    of that may be compiler-specific, unfortunately. It is a fair amount of work and requires a good understanding both of the target processor and of the Bound-T internal representation.

    The latter will likely be the bigger effort. It requires trying to infer
    your mindset when you began the design.

    [I now include "rationales" in my documentation for exactly this purpose;
    to let those "following" understand why particular choices were made
    instead of choices that may (in hindsight) appear better]

    And Ada, of course, but every programmer
    should know Ada, IMO :-)

    However, you may want to read about the state of the tool at
    http://bound-t.com/status.html.

    Hmmm... that sounds ominous! :> (or, am I just too much of a Cynic?)

    The problems described on the status page are more relevant to WCET analysis than to stack analysis, but can affect stack analysis too, in some cases.

    OK. I'll try to make some time to look through it. Sadly, a neighbor was admitted to hospital with congestive heart failure, last night. So, I've
    some added responsibilities cutting into my time (meal prep, chores around
    his house, etc.) until he's got a firmer prognosis...

    [I don't *need* to sleep, do I? :< ]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Thu Aug 5 18:39:51 2021
    On 2021-08-05 16:16, Don Y wrote:
    On 8/5/2021 3:43 AM, Niklas Holsti wrote:
    On 2021-08-05 1:46, Don Y wrote:

    The source code is downloadable; see
    http://bound-t.com/download/src. The copyright text is of my own
    writing, but I'm open to switching to some better-known copyright
    such as GPL or even some non-viral version.

    Ah, OK.  I may have a look at it to see how well it works and
    how much work to port to ARM objects.  Given that you've not
    done so, already (despite your familiarity with the codebase),
    I suspect that's not a trivial task?

    There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but
    of course that architecture is out of date. I started a port to ARM
    Cortex-M, but did not (yet) finish it for the reasons described on the
    status page.

    Note that the tool is implemented in Ada.

    OK.  It's been ~20 years since I wrote a line of Ada code but I doubt it
    will take much to "refresh" those memory cells.  Esp as this likely doesn't have synchronization issues, RT constraints, etc.


    Right. The only use of tasking features is to set an upper limit on the execution time of the analysis, and that feature is easy to remove if
    needed.

    The code mostly relies on the 1995 Ada standard, with some use of Ada
    2005 additions.


    Porting to a new target processor means writing procedures to decode
    the machine instructions and translate them to the internal
    representation used by the analysis.

    That shouldn't be hard -- except for the "timing uncertainties" in the processor.  (Did you rely on some blanket generalization -- like all
    caches cold -- as you are looking for WORST case timing?   OTOH, a
    loop known to fit in a cache line *should* expect that speedup...  :< )

    At present the tool assumes that every instruction and control transfer
    can be assigned its own WCET, in machine cycles, independent of other
    context or data values. The only dynamic aspects of instruction
    execution time that are modelled are the possible pipeline stalls due to read-after-write interlocks. As a special case, in the version for
    SPARC, the concurrent execution of the Integer Unit and the Floating
    Point Unit is modelled to estimate where and for how long the IU has to
    wait for the FPU to complete an operation.

    Good algorithms for computing WCET bounds for most kinds of instruction
    caches are known, but I did not get around to implementing any of those,
    so only cache-less processors are supported now. If you need WCET
    analysis of caches, the AbsInt tool is best.

    Data caches are still a hard problem, where the WCET analyses tend to be
    quite pessimistic. Moreover, the increasing use of eviction algorithms
    other than LRU (for example, pseudo-random eviction) lessens the
    precision of the cache analyses, even for instruction caches.


    Sometimes it is also necessary to modify or extend the tool parts that
    read the executable files and especially the debugging information --
    some of that may be compiler-specific, unfortunately. It is a fair
    amount of work and requires a good understanding both of the target
    processor and of the Bound-T internal representation.

    The latter will likely be the bigger effort.  It requires trying to infer your mindset when you began the design.


    You could start by reading http://bound-t.com/whats-inside.html.


    [I now include "rationales" in my documentation for exactly this purpose;
    to let those "following" understand why particular choices were made
    instead of choices that may (in hindsight) appear better]


    Me too. Ceterum censeo that calling the in-code documentation "comments"
    was a very poor choice of term and very misleading for what such
    documentation should contain and its importance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Fri Aug 6 01:09:46 2021
    On 8/5/2021 8:39 AM, Niklas Holsti wrote:
    On 2021-08-05 16:16, Don Y wrote:
    On 8/5/2021 3:43 AM, Niklas Holsti wrote:
    On 2021-08-05 1:46, Don Y wrote:

    The source code is downloadable; see http://bound-t.com/download/src. The >>>>> copyright text is of my own writing, but I'm open to switching to some >>>>> better-known copyright such as GPL or even some non-viral version.

    Ah, OK. I may have a look at it to see how well it works and
    how much work to port to ARM objects. Given that you've not
    done so, already (despite your familiarity with the codebase),
    I suspect that's not a trivial task?

    There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of >>> course that architecture is out of date. I started a port to ARM Cortex-M, >>> but did not (yet) finish it for the reasons described on the status page. >>>
    Note that the tool is implemented in Ada.

    OK. It's been ~20 years since I wrote a line of Ada code but I doubt it
    will take much to "refresh" those memory cells. Esp as this likely doesn't >> have synchronization issues, RT constraints, etc.

    Right. The only use of tasking features is to set an upper limit on the execution time of the analysis, and that feature is easy to remove if needed.

    The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 additions.

    OK. I'll have to build a more current version of GNAT.

    Porting to a new target processor means writing procedures to decode the >>> machine instructions and translate them to the internal representation used >>> by the analysis.

    That shouldn't be hard -- except for the "timing uncertainties" in the
    processor. (Did you rely on some blanket generalization -- like all
    caches cold -- as you are looking for WORST case timing? OTOH, a
    loop known to fit in a cache line *should* expect that speedup... :< )

    At present the tool assumes that every instruction and control transfer can be
    assigned its own WCET, in machine cycles, independent of other context or data

    So, you just sum worst case times? You don't, for example, alter
    times based on conditionals?

    I assume this is semi table-driven (?) -- as a preface to inquiring as to
    how you develop (and verify!) a test suite?

    values. The only dynamic aspects of instruction execution time that are modelled are the possible pipeline stalls due to read-after-write interlocks. As a special case, in the version for SPARC, the concurrent execution of the Integer Unit and the Floating Point Unit is modelled to estimate where and for
    how long the IU has to wait for the FPU to complete an operation.

    Good algorithms for computing WCET bounds for most kinds of instruction caches
    are known, but I did not get around to implementing any of those, so only cache-less processors are supported now. If you need WCET analysis of caches, the AbsInt tool is best.

    It's arguable whether (and to what extent) you should model I-cache effects. E.g., in a multithreaded environment, the cache can be purged at some
    frequency that a tool (or the developer, for that matter -- at least not
    wrt the instruction stream) can't know. Aim high and be pleasantly surprised...

    Data caches are still a hard problem, where the WCET analyses tend to be quite
    pessimistic. Moreover, the increasing use of eviction algorithms other than LRU
    (for example, pseudo-random eviction) lessens the precision of the cache analyses, even for instruction caches.

    As would handling page faults or other NUMA issues.

    [How do you handle accesses to memories with different *basic* access times?]

    Sometimes it is also necessary to modify or extend the tool parts that read >>> the executable files and especially the debugging information -- some of >>> that may be compiler-specific, unfortunately. It is a fair amount of work >>> and requires a good understanding both of the target processor and of the >>> Bound-T internal representation.

    The latter will likely be the bigger effort. It requires trying to infer
    your mindset when you began the design.

    You could start by reading http://bound-t.com/whats-inside.html.

    [I now include "rationales" in my documentation for exactly this purpose;
    to let those "following" understand why particular choices were made
    instead of choices that may (in hindsight) appear better]

    Me too. Ceterum censeo that calling the in-code documentation "comments" was a
    very poor choice of term and very misleading for what such documentation should
    contain and its importance.

    A "rationale" lets you talk to the next developer(s) informally. You can
    point out areas for likely optimization, areas fraught with dragons, etc. Otherwise, relying on leafing through hundreds of pages of code in the
    hope of stumbling on some "XXX ..." annotation is wasteful.

    And, *not* being tied to a "text only" presentation means you can explain things in ways that would be difficult to do (unambiguously) with prose.

    [How do I audibly explain the difference between front vowels and back
    vowels to a non-native speaker/listener?]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Fri Aug 6 22:58:06 2021
    On 2021-08-06 11:09, Don Y wrote:
    On 8/5/2021 8:39 AM, Niklas Holsti wrote:
    On 2021-08-05 16:16, Don Y wrote:
    On 8/5/2021 3:43 AM, Niklas Holsti wrote:
    On 2021-08-05 1:46, Don Y wrote:

    The source code is downloadable; see
    http://bound-t.com/download/src. The copyright text is of my own
    writing, but I'm open to switching to some better-known copyright
    such as GPL or even some non-viral version.

    Ah, OK.  I may have a look at it to see how well it works and
    how much work to port to ARM objects.  Given that you've not
    done so, already (despite your familiarity with the codebase),
    I suspect that's not a trivial task?

    There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex),
    but of course that architecture is out of date. I started a port to
    ARM Cortex-M, but did not (yet) finish it for the reasons described
    on the status page.

    Note that the tool is implemented in Ada.

    OK.  It's been ~20 years since I wrote a line of Ada code but I doubt it >>> will take much to "refresh" those memory cells.  Esp as this likely
    doesn't
    have synchronization issues, RT constraints, etc.

    Right. The only use of tasking features is to set an upper limit on
    the execution time of the analysis, and that feature is easy to remove
    if needed.

    The code mostly relies on the 1995 Ada standard, with some use of Ada
    2005 additions.

    OK.  I'll have to build a more current version of GNAT.


    For playing around, I would just use the GNAT Community Edition. Or the
    FSF GNAT that comes with MinGW32.


    Porting to a new target processor means writing procedures to decode
    the machine instructions and translate them to the internal
    representation used by the analysis.

    That shouldn't be hard -- except for the "timing uncertainties" in the
    processor.  (Did you rely on some blanket generalization -- like all
    caches cold -- as you are looking for WORST case timing?   OTOH, a
    loop known to fit in a cache line *should* expect that speedup...  :< )

    At present the tool assumes that every instruction and control
    transfer can be assigned its own WCET, in machine cycles, independent
    of other context or data

    So, you just sum worst case times?  You don't, for example, alter
    times based on conditionals?


    (Note: I say "WCET" below, but really I should always say "upper bound
    on the WCET".)

    The analysis works on the control-flow graph (per subprogram). The WCET
    for each basic block is the sum of the WCETs of the instructions in that
    block. The WCET for an execution path through the graph (from entry to
    return) is the sum of the WCETs of the basic blocks in the bath, plus
    the WCETs assigned to each edge (control transfer) between basic blocks
    in the path. The WCET for the whole graph is computed by a pessimisation
    over all syntactically possible paths through the graph, using an
    Integer Linear Programming solver (lp_solve, for Bound-T). This method
    is common in WCET analysis and is called IPET, for Implicit Path
    Exploration Technique.

    By "syntactically possible" I mean that the tool generally does not
    attempt to find semantically or logically infeasible paths. For example,
    if a subprogram has this form:

    if boo then
    perform computation A;
    else
    perform computation B;
    end if;
    perform computation C;

    then the WCET for the subprogram is estimated as max(WCET(A),WCET(B)) + WCET(C). But if the subprogram has this form:

    if boo then
    perform computation A;
    end if;
    perform computation C;
    if not boo then
    perform computation B;
    end if;

    then the WCET for the subprogram is estimated as WCET(A) + WCET(C) +
    WCET(B). In other words, the analysis (usually) does not discover that
    the conditions "boo" and "not boo" are mutually exclusive.


    I assume this is semi table-driven (?)


    I don't understand that question. Please clarify.


    -- as a preface to inquiring as to
    how you develop (and verify!) a test suite?


    In the same way as for any complex program. Final validation by running
    and measuring test code on a real processor or a cycle-accurate simulation.


    Good algorithms for computing WCET bounds for most kinds of
    instruction caches are known, but I did not get around to implementing
    any of those, so only cache-less processors are supported now. If you
    need WCET analysis of caches, the AbsInt tool is best.

    It's arguable whether (and to what extent) you should model I-cache
    effects.
    E.g., in a multithreaded environment, the cache can be purged at some frequency that a tool (or the developer, for that matter -- at least not
    wrt the instruction stream) can't know.  Aim high and be pleasantly surprised...


    Such Cache-Related Preemption Delays (CRPDs) are a much-studied problem
    in WCET analysis. The most common solution is to first find out how the
    various tasks and interrupts can access memory blocks which potentially
    may collide in the caches, and then to take the worst possible
    interference into account in the schedulability analysis, which does
    know (or assumes) the worst case frequency of preemptions.

    For processors where cache misses are much slower than cache hits (which
    is fast coming to mean almost all processors) IMO an I-cache analysis is necessary for static WCET analysis to be useful.


    [How do you handle accesses to memories with different *basic* access
    times?]


    For code fetch the address is always statically known, so the correct
    access time can be selected and included in the WCET for the instruction
    when the instruction is decoded.

    For data accesses, if different instructions are used for different
    memories (as in the 8051 architecture, for example), the proper access
    time can be included in the WCET for the instruction when the
    instruction is decoded.

    If the same instructions are used to access memories with different
    access times, depending on the address, and the memory addresses are dynamically determined at run time, the practical but pessimistic way is
    to use the largest of the access times, that is, assume that the slowest
    memory is accessed. In principle the tool can try to analyze the address computations and might find sufficient bounds on the address to ensure
    that a faster memory is accessed, but the analysis methods currently
    available in Bound-T seldom manage that, and can be quite slow.

    That said, for some processors it is easy to recognize at decode-time
    most of the instructions that access the stack, and some versions of
    Bound-T let one specify different access times for stack accesses and
    for general (unclassified) accesses. That can be useful if the stack is
    located in fast memory, but other data are in slower memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Fri Aug 6 16:06:45 2021
    On 8/6/2021 4:04 PM, Don Y wrote:
    That said, for some processors it is easy to recognize at decode-time most of
    the instructions that access the stack, and some versions of Bound-T let one >> specify different access times for stack accesses and for general
    (unclassified) accesses. That can be useful if the stack is located in fast >> memory, but other data are in slower memory.

    I'm thinking, specifically, about I/Os -- which are increasingly memory mapped (including register spaces).

    Sorry, I should be more clear. I'm looking at the issues that would affect *my* needs in *my* environment -- realizing these may be different than
    those you've previously targeted. (e.g., VMM, FPU emulation, etc.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Fri Aug 6 16:04:21 2021
    On 8/6/2021 12:58 PM, Niklas Holsti wrote:

    The source code is downloadable; see http://bound-t.com/download/src. >>>>>>> The copyright text is of my own writing, but I'm open to switching to >>>>>>> some better-known copyright such as GPL or even some non-viral version. >>>>>>
    Ah, OK. I may have a look at it to see how well it works and
    how much work to port to ARM objects. Given that you've not
    done so, already (despite your familiarity with the codebase),
    I suspect that's not a trivial task?

    There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of >>>>> course that architecture is out of date. I started a port to ARM Cortex-M,
    but did not (yet) finish it for the reasons described on the status page. >>>>>
    Note that the tool is implemented in Ada.

    OK. It's been ~20 years since I wrote a line of Ada code but I doubt it >>>> will take much to "refresh" those memory cells. Esp as this likely doesn't
    have synchronization issues, RT constraints, etc.

    Right. The only use of tasking features is to set an upper limit on the
    execution time of the analysis, and that feature is easy to remove if needed.

    The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 >>> additions.

    OK. I'll have to build a more current version of GNAT.

    For playing around, I would just use the GNAT Community Edition. Or the FSF GNAT that comes with MinGW32.

    Ugh! No, I'll just build a fresh copy. I do all my development
    under NetBSD so have everything I want/need ('cept the Ada tools)
    there, already.

    Porting to a new target processor means writing procedures to decode the >>>>> machine instructions and translate them to the internal representation >>>>> used by the analysis.

    That shouldn't be hard -- except for the "timing uncertainties" in the >>>> processor. (Did you rely on some blanket generalization -- like all
    caches cold -- as you are looking for WORST case timing? OTOH, a
    loop known to fit in a cache line *should* expect that speedup... :< ) >>>
    At present the tool assumes that every instruction and control transfer can >>> be assigned its own WCET, in machine cycles, independent of other context or
    data

    So, you just sum worst case times? You don't, for example, alter
    times based on conditionals?

    (Note: I say "WCET" below, but really I should always say "upper bound on the WCET".)

    Understood.

    The analysis works on the control-flow graph (per subprogram). The WCET for each basic block is the sum of the WCETs of the instructions in that block. The
    WCET for an execution path through the graph (from entry to return) is the sum
    of the WCETs of the basic blocks in the bath, plus the WCETs assigned to each edge (control transfer) between basic blocks in the path.

    So, the costs of conditional control transfers are handled separately (?)

    The WCET for the
    whole graph is computed by a pessimisation over all syntactically possible paths through the graph, using an Integer Linear Programming solver (lp_solve,
    for Bound-T). This method is common in WCET analysis and is called IPET, for Implicit Path Exploration Technique.

    By "syntactically possible" I mean that the tool generally does not attempt to
    find semantically or logically infeasible paths. For example, if a subprogram has this form:

    if boo then
    perform computation A;
    else
    perform computation B;
    end if;
    perform computation C;

    then the WCET for the subprogram is estimated as max(WCET(A),WCET(B)) + WCET(C). But if the subprogram has this form:

    if boo then
    perform computation A;
    end if;
    perform computation C;
    if not boo then
    perform computation B;
    end if;

    then the WCET for the subprogram is estimated as WCET(A) + WCET(C) + WCET(B). In other words, the analysis (usually) does not discover that the conditions "boo" and "not boo" are mutually exclusive.

    OK. This is different from how a DSE-based tool would approach the problem.
    It would walk through the code and "solve" for each variable that affects control flow or execution cost. So, on one "probe", it would set boo
    to "TRUE" and evaluate ALL of the consequences of that choice.

    Then, would repeat the exercise with boo as FALSE.

    (hence, it is more costly -- though could be made less so if it
    wasn't actively probing for new cases along the way)

    I assume this is semi table-driven (?)

    I don't understand that question. Please clarify.

    A large number of opcodes, each with particular costs.
    Do you build a jungle of conditionals to subdivide the
    "opcode space" into groups of similar-cost operations?
    Or, do you just have a table of:

    {opcode, mask, type[1], cost}

    [1] used to apply some further heuristics to your
    refinement of "cost"

    -- as a preface to inquiring as to
    how you develop (and verify!) a test suite?

    In the same way as for any complex program. Final validation by running and measuring test code on a real processor or a cycle-accurate simulation.

    Ah, OK. So, you could "can" that particular exemplar and use it
    to test a modification to the code (without having to run the app
    on "real hardware", again)

    I don't rely on "live" tests for my code. Rather, I use tools to generate
    good test coverage and then verify the results are what I expect. I find this easier and more easily extensible (I can test ARM code by running an x86
    port of that code)

    Good algorithms for computing WCET bounds for most kinds of instruction
    caches are known, but I did not get around to implementing any of those, so >>> only cache-less processors are supported now. If you need WCET analysis of >>> caches, the AbsInt tool is best.

    It's arguable whether (and to what extent) you should model I-cache effects. >> E.g., in a multithreaded environment, the cache can be purged at some
    frequency that a tool (or the developer, for that matter -- at least not
    wrt the instruction stream) can't know. Aim high and be pleasantly
    surprised...

    Such Cache-Related Preemption Delays (CRPDs) are a much-studied problem in WCET
    analysis. The most common solution is to first find out how the various tasks and interrupts can access memory blocks which potentially may collide in the caches, and then to take the worst possible interference into account in the schedulability analysis, which does know (or assumes) the worst case frequency
    of preemptions.

    For processors where cache misses are much slower than cache hits (which is fast coming to mean almost all processors) IMO an I-cache analysis is necessary
    for static WCET analysis to be useful.

    I look at it the other way around. Assume there is NO cache. You know that your code WILL run in less time than this case. Regardless of the frequency
    or presence of competing events.

    Processors are fast enough, now, that you can usually afford to "step up"
    in capability for comparably little cost.

    And, if you've minimized the number of things that *must* meet specific timing/performance goals -- and left everything else as a lower priority/importance secondary goal -- then cache just increases the
    performance of those "less important" goals, pleasantly.

    [How do you handle accesses to memories with different *basic* access times?]

    For code fetch the address is always statically known, so the correct access time can be selected and included in the WCET for the instruction when the instruction is decoded.

    So, you know the access characteristics of each block of memory in which code may reside.

    For data accesses, if different instructions are used for different memories (as in the 8051 architecture, for example), the proper access time can be included in the WCET for the instruction when the instruction is decoded.

    If the same instructions are used to access memories with different access times, depending on the address, and the memory addresses are dynamically determined at run time, the practical but pessimistic way is to use the largest
    of the access times, that is, assume that the slowest memory is accessed. In principle the tool can try to analyze the address computations and might find sufficient bounds on the address to ensure that a faster memory is accessed, but the analysis methods currently available in Bound-T seldom manage that, and
    can be quite slow.

    It would also require the user to be more specific about his execution environment.

    That said, for some processors it is easy to recognize at decode-time most of the instructions that access the stack, and some versions of Bound-T let one specify different access times for stack accesses and for general (unclassified) accesses. That can be useful if the stack is located in fast memory, but other data are in slower memory.

    I'm thinking, specifically, about I/Os -- which are increasingly memory
    mapped (including register spaces).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to niklas.holsti@tidorum.invalid on Fri Aug 6 22:55:34 2021
    On Thu, 5 Aug 2021 18:39:51 +0300, Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    :
    At present the tool assumes that every instruction and control transfer
    can be assigned its own WCET, in machine cycles, independent of other
    context or data values. The only dynamic aspects of instruction
    execution time that are modelled are the possible pipeline stalls due to >read-after-write interlocks. As a special case, in the version for
    SPARC, the concurrent execution of the Integer Unit and the Floating
    Point Unit is modelled to estimate where and for how long the IU has to
    wait for the FPU to complete an operation.

    Good algorithms for computing WCET bounds for most kinds of instruction >caches are known, but I did not get around to implementing any of those,
    so only cache-less processors are supported now. If you need WCET
    analysis of caches, the AbsInt tool is best.

    Data caches are still a hard problem, where the WCET analyses tend to be >quite pessimistic. Moreover, the increasing use of eviction algorithms
    other than LRU (for example, pseudo-random eviction) lessens the
    precision of the cache analyses, even for instruction caches.
    :

    How do you handle branch/jump mispredicts?

    More importantly, how do you handle chains of conditional branches
    and/or switch/case constructs which can mispredict at every decision
    point? The branch targets may not be in cache - neither mispredicted
    targets nor the actual one. Worst case for a horribly mispredicted
    switch/case can be absolutely dreadful.

    Similar problem for table driven code: unless the jump is always made
    through the /same/ table element, then to a 1st approximation the jump
    will mispredict /every/ time.

    Late binding OO code is a problem too, though not to same degree: a
    lot of code really is monomorphic with a given call usually or always
    targeting the same class and method. But real polymorhic code (like
    table driven) suffers greatly when dealing with heterogenous objects.


    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to George Neuner on Sat Aug 7 13:40:19 2021
    On 2021-08-07 5:55, George Neuner wrote:
    On Thu, 5 Aug 2021 18:39:51 +0300, Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    :
    At present the tool assumes that every instruction and control transfer
    can be assigned its own WCET, in machine cycles, independent of other
    context or data values. The only dynamic aspects of instruction
    execution time that are modelled are the possible pipeline stalls due to
    read-after-write interlocks.
    ...

    How do you handle branch/jump mispredicts?


    See above re the assumptions of the tool.

    The WCET analysis function of the Bound-T tool is (at present) suited
    only for simple, deterministic processors. The tool development started
    in the late 1990s when the company I then worked for developed
    applications mainly for such processors.

    If the processor uses a static branch prediction scheme (for example,
    predict forward branches as not taken and backward branches as taken),
    the mispredict penalty can be included in the WCET assigned to the non-predicted edge in the control-flow graph at instruction decode-time.

    For dynamic (history-dependent) branch predictors, various models for
    static analysis have been proposed and published, but they have the same problems as cache-analysis models: the HW microarchitectures are
    becoming more and more complex and varied, making the modelling effort
    more expensive, both in implementation effort and in analysis time, and reducing the customer base for each specific model.

    The WCET-analysis tools from AbsInt include detailed signal-by-signal simulations of the microarchitectures. Unfortunately such details are
    seldom documented well, and I believe AbsInt has had to enter special agreements with processor suppliers for access to such information. And
    the details can change with each processor revision...

    Speculative executions such as branch prediction can create so-called
    "timing anomalies" which make the WCET analysis much harder. A timing
    anomaly is any property of a processor that makes it invalid to estimate
    the WCET of a sequence of instructions by assuming that the first
    instruction encounters its worst case (for example, a cache miss instead
    of a cache hit) and then continuing the analysis of the rest of the
    sequence while propagating that assumption forward as an assumption on
    the state of the processor.

    The canonical example of timing anomalies are processors where a cache
    hit at one instruction changes the processor state in such a way that a
    later instruction takes much longer than it would have taken, had the
    first instruction encountered a cache miss instead of a hit. Propagating
    the "first instruction missed" assumption to the context of the later instruction may make the analysis under-estimate the execution time of
    that later instruction. To handle timing anomalies, the analysis must
    consider all possible combinations: first instruction hits or misses,
    second instruction hits or misses, etc. Exponential complexity.

    It amuses me that such processor properties, which are/were the bane of
    static WCET analysis, are now in the limelight as the origin of the side channels for various malware, such as Spectre exploits.


    More importantly, how do you handle chains of conditional branches
    and/or switch/case constructs which can mispredict at every decision
    point? The branch targets may not be in cache - neither mispredicted
    targets nor the actual one. Worst case for a horribly mispredicted switch/case can be absolutely dreadful.


    I don't know if those questions are addressed to me (re the Bound-T
    tool) or to WCET analysis in general. The advanced WCET tools, like the
    ones from AbsInt, do try to cover such cases by their cache analysis to
    provide a safe upper bound on the WCET, but the degree of pessimism (over-estimation) increases with the complexity of the code.

    That said, in many programs most of the processing time is taken by
    relatively simple loops, for which the present I-cache analyses work
    quite well, and for which even D-cache analysis can (I believe) work satisfactorily if the memory addressing patterns are regular.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Sat Aug 7 14:09:23 2021
    On 2021-08-07 2:04, Don Y wrote:
    On 8/6/2021 12:58 PM, Niklas Holsti wrote:

    OK.  I'll have to build a more current version of GNAT.

    For playing around, I would just use the GNAT Community Edition. Or
    the FSF GNAT that comes with MinGW32.

    Ugh!  No, I'll just build a fresh copy.  I do all my development
    under NetBSD so have everything I want/need ('cept the Ada tools)
    there, already.


    You are braver than I am. Note that the front-end of GNAT is implemented
    in Ada, so it has to be bootstrapped with an Ada compiler. And the
    current version of GNAT requires quite an up-to-date Ada compiler for
    the bootstrap. In practice, I think people have found that only GNAT can
    build GNAT.

    Many Linux distributions come with GNAT pre-built. Debian-derived
    distributions have good support, I hear.


    The analysis works on the control-flow graph (per subprogram). The
    WCET for each basic block is the sum of the WCETs of the instructions
    in that block. The WCET for an execution path through the graph (from
    entry to return) is the sum of the WCETs of the basic blocks in the
    bath, plus the WCETs assigned to each edge (control transfer) between
    basic blocks in the path.

    So, the costs of conditional control transfers are handled separately (?)


    I'm not sure I understand your question. Each edge in the control-flow
    graph, that is, each transition from one instruction or one basic block
    to a possible next instruction or block, is assigned a WCET value when
    the instructions are decoded and entered in the control-flow graph.
    Usually this WCET is zero for a fall-through edge from a non-branch
    instruction to the next, and is non-zero only for edges from branch instructions, whether conditional or unconditional. For a conditional
    branch, the "not taken" edge usually has a zero WCET and the "taken"
    edge has some non-zero WCET, as specified in the processor documentation (remember we are assuming a simple processor).


    I assume this is semi table-driven (?)

    I don't understand that question. Please clarify.

    A large number of opcodes, each with particular costs.
    Do you build a jungle of conditionals to subdivide the
    "opcode space" into groups of similar-cost operations?
    Or, do you just have a table of:

    {opcode, mask, type[1], cost}

    [1] used to apply some further heuristics to your
    refinement of "cost"


    I would say that the approach to translating instructions into their
    analysis models (elements in the control-flow graph) is usually not table-driven, but emulates the field-by-field decoding and case analysis
    that would be done by a disassembler or emulator (and Bound-T can
    produce a disassembly of the code). The internal model includes not only
    the cost of the instruction, but also its effect on the computation. For example, an instruction like "increment the accumulator", INC A, would
    be decoded into a model that says

    WCET: 1 cycle.
    Control flow: to next instruction, unconditional.
    Effect: A := A + 1.

    and also the effect, if any, on condition flags.


    -- as a preface to inquiring as to
    how you develop (and verify!) a test suite?

    In the same way as for any complex program. Final validation by
    running and measuring test code on a real processor or a
    cycle-accurate simulation.

    Ah, OK.  So, you could "can" that particular exemplar and use it
    to test a modification to the code (without having to run the app
    on "real hardware", again)


    Yep.


    I don't rely on "live" tests for my code. Rather, I use tools to
    generate good test coverage and then verify the results are what I
    expect. I find this easier and more easily extensible (I can test
    ARM code by running an x86 port of that code)


    At my former job, implementing SW for ESA satellite applications, we
    almost always tested the code first on normal PCs, in a simulated I/O environment, and then on the embedded target. Easier for Ada than for C.


    For processors where cache misses are much slower than cache hits
    (which is fast coming to mean almost all processors) IMO an I-cache
    analysis is necessary for static WCET analysis to be useful.

    I look at it the other way around. Assume there is NO cache. You
    know that your code WILL run in less time than this case. Regardless
    of the frequency or presence of competing events.

    Processors are fast enough, now, that you can usually afford to "step up"
    in capability for comparably little cost.


    Sometimes, but not always. The ESA applications we made were usually
    running (worst case) at about 70% processor load, and faster processors
    for space applications were very expensive, not only in euros but also
    in power, volume and mass, which are scarce resources in space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Sat Aug 7 14:20:44 2021
    On 2021-08-07 2:06, Don Y wrote:
    On 8/6/2021 4:04 PM, Don Y wrote:
    That said, for some processors it is easy to recognize at decode-time
    most of the instructions that access the stack, and some versions of
    Bound-T let one specify different access times for stack accesses and
    for general (unclassified) accesses. That can be useful if the stack
    is located in fast memory, but other data are in slower memory.

    I'm thinking, specifically, about I/Os -- which are increasingly memory
    mapped (including register spaces).

    Sorry, I should be more clear.  I'm looking at the issues that would affect *my* needs in *my* environment -- realizing these may be different than
    those you've previously targeted.  (e.g., VMM, FPU emulation, etc.)


    Ok. If you describe your needs, perhaps I can comment.

    In my experience, access to memory-mapped registers tends to be not much
    slower than access to memory. Also, quite often the addresses in MMIO
    accesses are static or easy to derive in static analysis, so the WCET
    tool could choose the proper access time, if the tool knows enough about
    the system architecture.

    If by "FPU emulation" you mean SW-implemented FP instructions, of course
    we have encountered those. They are often hand-coded assembler with very complex and tricky control flow, which makes it hard to find loop bounds automatically, so manual annotations must be used instead.

    What is VMM?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Sat Aug 7 08:09:43 2021
    On 8/7/2021 4:09 AM, Niklas Holsti wrote:
    On 2021-08-07 2:04, Don Y wrote:
    On 8/6/2021 12:58 PM, Niklas Holsti wrote:

    OK. I'll have to build a more current version of GNAT.

    For playing around, I would just use the GNAT Community Edition. Or the FSF >>> GNAT that comes with MinGW32.

    Ugh! No, I'll just build a fresh copy. I do all my development
    under NetBSD so have everything I want/need ('cept the Ada tools)
    there, already.

    You are braver than I am. Note that the front-end of GNAT is implemented in Ada, so it has to be bootstrapped with an Ada compiler. And the current version
    of GNAT requires quite an up-to-date Ada compiler for the bootstrap. In practice, I think people have found that only GNAT can build GNAT.

    Many Linux distributions come with GNAT pre-built. Debian-derived distributions
    have good support, I hear.

    I don't run Linux. I already support Solaris/SPARC, Windows and NetBSD hosts. Adding yet another just makes things harder.

    I build all of the "programs" ("packages") on my NetBSD boxes. In the past, the userland was built by folks of dubious abilities. They'd just try to
    "get it to compile" (not even "compile without warnings"). So, there would often be bugs and failures in the ported code.

    [I recall a prebuilt gnuplot that didn't even successfully pass the COMPREHENSIVE test suite -- because the person porting it didn't know
    what certain functions should look like, graphically!]

    Building myself lets me see the warnings/errors thrown. And, also lets
    me inspect the sources as, often, the configuration options are not
    completely documented (but apparent *if* you look at the sources).

    As building package Q may rely on package B and L, which, in turn rely on
    F, G and Z, this is often a bit involved. But, if those dependencies
    can be used by other packages, that's less work, later.

    [I keep track of what I build, the order I build them, any patches I
    make to the sources, etc.]

    The analysis works on the control-flow graph (per subprogram). The WCET for >>> each basic block is the sum of the WCETs of the instructions in that block. >>> The WCET for an execution path through the graph (from entry to return) is >>> the sum of the WCETs of the basic blocks in the bath, plus the WCETs
    assigned to each edge (control transfer) between basic blocks in the path. >>
    So, the costs of conditional control transfers are handled separately (?)

    I'm not sure I understand your question. Each edge in the control-flow graph, that is, each transition from one instruction or one basic block to a possible
    next instruction or block, is assigned a WCET value when the instructions are decoded and entered in the control-flow graph. Usually this WCET is zero for a
    fall-through edge from a non-branch instruction to the next, and is non-zero only for edges from branch instructions, whether conditional or unconditional.
    For a conditional branch, the "not taken" edge usually has a zero WCET and the
    "taken" edge has some non-zero WCET, as specified in the processor documentation (remember we are assuming a simple processor).

    OK, that's what I was addressing; namely, that the two outcomes of a (conditional) branch can have different costs -- yet those appeared to be "outside" the blocks that you were describing.

    I assume this is semi table-driven (?)

    I don't understand that question. Please clarify.

    A large number of opcodes, each with particular costs.
    Do you build a jungle of conditionals to subdivide the
    "opcode space" into groups of similar-cost operations?
    Or, do you just have a table of:

    {opcode, mask, type[1], cost}

    [1] used to apply some further heuristics to your
    refinement of "cost"

    I would say that the approach to translating instructions into their analysis models (elements in the control-flow graph) is usually not table-driven, but emulates the field-by-field decoding and case analysis that would be done by a
    disassembler or emulator (and Bound-T can produce a disassembly of the code).

    Oh! I would have opted for a (possibly big) table. But, that's just personal preference (I like table driven algorithms)

    The internal model includes not only the cost of the instruction, but also its
    effect on the computation. For example, an instruction like "increment the accumulator", INC A, would be decoded into a model that says

    WCET: 1 cycle.
    Control flow: to next instruction, unconditional.
    Effect: A := A + 1.

    and also the effect, if any, on condition flags.

    Why do you care about the "effect"? Are you trying to determine which branches will be taken? How many iterations through loops? etc.

    Doesn't this require a fairly complete model of the processor?

    I don't rely on "live" tests for my code. Rather, I use tools to
    generate good test coverage and then verify the results are what I
    expect. I find this easier and more easily extensible (I can test
    ARM code by running an x86 port of that code)

    At my former job, implementing SW for ESA satellite applications, we almost always tested the code first on normal PCs, in a simulated I/O environment, and
    then on the embedded target. Easier for Ada than for C.

    Historically, it's been hard for me to have access to real hardware.
    So, my development style pushes all the I/Os out to the fringes of
    the algorithms; so there is little more than a "copy" that occurs
    between the algorithm and the I/O.

    So, I can debug most of my code (95+%) just by simulating input
    values and capturing output values. It's no worse than testing
    a math library.

    Then, to move to real hardware, just ensure that the transfers in and
    out are atomic/intact/correct.

    E.g., for my speech synthesizers, I just had them generate D/AC values
    that I captured using some test scaffolding. Then, massaged the captured
    data into a form that a media player could process and "played" them
    on a PC.

    For my gesture recognizer, I captured/simulated input from the various
    types of input devices in a form that the algorithms expected. Then,
    just passed the data to the algorithms and noted their "conclusions".

    For processors where cache misses are much slower than cache hits (which is >>> fast coming to mean almost all processors) IMO an I-cache analysis is
    necessary for static WCET analysis to be useful.

    I look at it the other way around. Assume there is NO cache. You
    know that your code WILL run in less time than this case. Regardless
    of the frequency or presence of competing events.

    Processors are fast enough, now, that you can usually afford to "step up"
    in capability for comparably little cost.

    Sometimes, but not always. The ESA applications we made were usually running (worst case) at about 70% processor load, and faster processors for space applications were very expensive, not only in euros but also in power, volume and mass, which are scarce resources in space.

    Yes, of course. I'm speaking of the types of applications that I have dealt with, not "universally".

    My "cost constrained" days were back in the days of the "small/cheap CPU"
    where you had to predict performance accurately -- but, doing so was
    relatively easy (because processors had very predictable instruction
    execution times).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Sat Aug 7 19:50:18 2021
    On 2021-08-07 18:09, Don Y wrote:
    On 8/7/2021 4:09 AM, Niklas Holsti wrote:


    (On how the Bound-T tool models instructions for stack and WCET analysis:)


    The internal model includes not only the cost of the instruction, but
    also its effect on the computation. For example, an instruction like
    "increment the accumulator", INC A, would be decoded into a model that
    says

         WCET: 1 cycle.
         Control flow: to next instruction, unconditional.
         Effect: A := A + 1.

    and also the effect, if any, on condition flags.

    Why do you care about the "effect"? Are you trying to determine
    which branches will be taken? How many iterations through loops?
    etc.

    The effect is necessary for any analysis that depends on run-time
    values. Loop bounds are the prime example for WCET analysis, but
    stack-usage analysis also needs it. For example, the effect of a push instruction is to increment the "local stack height" model variable by
    some number that reflects the size of the pushed data (which size can,
    in principle, be non-static and computed at run time).

    The stack-usage analysis finds the maximum possible value of the local
    stack height in each subprogram, both globally and at each call site
    within the subprogram, then adds up the local stack heights along each
    possible call path to find the total usage for that call path, and then
    finds and reports the path with the largest total usage.


    Doesn't this require a fairly complete model of the processor?


    Yes indeed. The model could be somewhat simpler for stack-usage analysis
    than for WCET analysis. For stack usage, the relevant effect of an
    instruction is just to increase or decrease the local stack height by a
    static constant. If push/pop happens in loops they are usually balanced
    so that a loop iteration has no net effect on stack usage, making loop iteration bounds unnecessary.


    My "cost constrained" days were back in the days of the "small/cheap
    CPU" where you had to predict performance accurately -- but, doing so
    was relatively easy (because processors had very predictable
    instruction execution times).

    And that is the kind of processor that Bound-T was designed to support
    for WCET analysis. Stack-usage analysis is not so sensitive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Niklas Holsti on Sat Aug 7 20:02:52 2021
    Oops, a correction:

    On 2021-08-07 19:50, Niklas Holsti wrote:

    For stack usage, the relevant effect of an instruction is just to
    increase or decrease the local stack height by a static constant.
    I meant to say that the change in local stack height is _often_ a static constant, not that it _always_ is static.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Sat Aug 7 10:26:32 2021
    On 8/7/2021 4:20 AM, Niklas Holsti wrote:
    On 2021-08-07 2:06, Don Y wrote:
    On 8/6/2021 4:04 PM, Don Y wrote:
    That said, for some processors it is easy to recognize at decode-time most >>>> of the instructions that access the stack, and some versions of Bound-T let
    one specify different access times for stack accesses and for general
    (unclassified) accesses. That can be useful if the stack is located in fast
    memory, but other data are in slower memory.

    I'm thinking, specifically, about I/Os -- which are increasingly memory
    mapped (including register spaces).

    Sorry, I should be more clear. I'm looking at the issues that would affect >> *my* needs in *my* environment -- realizing these may be different than
    those you've previously targeted. (e.g., VMM, FPU emulation, etc.)


    Ok. If you describe your needs, perhaps I can comment.

    I'm targeting some of the bigger ARM offerings -- A53/55. My impression is they go through more "contortions" to eke out additional performance than some of the smaller processors.

    In my experience, access to memory-mapped registers tends to be not much slower
    than access to memory. Also, quite often the addresses in MMIO accesses are static or easy to derive in static analysis, so the WCET tool could choose the
    proper access time, if the tool knows enough about the system architecture.

    Yes, that last phrase being the kicker. Can this be simplified to something
    as crude as a "memory map"?

    If by "FPU emulation" you mean SW-implemented FP instructions, of course we have encountered those. They are often hand-coded assembler with very complex and tricky control flow, which makes it hard to find loop bounds automatically,
    so manual annotations must be used instead.

    What is VMM?

    Virtual Memory Management. I.e., when an opcode fetch (or argument reference) can not only take longer than a cache miss... but *considerably* longer as
    the physical memory is mapped in while the instruction stream "stalls".

    [Note that a page fault need not map physical memory in the traditional sense. It can also cause some "special" function to be invoked to provide the requisite data/access. So, the cost of a fault can vary depend on
    *what* is faulting and which pager is handling that fault]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Sat Aug 7 10:34:39 2021
    On 8/7/2021 9:50 AM, Niklas Holsti wrote:
    On 2021-08-07 18:09, Don Y wrote:
    On 8/7/2021 4:09 AM, Niklas Holsti wrote:


    (On how the Bound-T tool models instructions for stack and WCET analysis:)

    The internal model includes not only the cost of the instruction, but also >>> its effect on the computation. For example, an instruction like "increment >>> the accumulator", INC A, would be decoded into a model that says

    WCET: 1 cycle.
    Control flow: to next instruction, unconditional.
    Effect: A := A + 1.

    and also the effect, if any, on condition flags.

    Why do you care about the "effect"? Are you trying to determine
    which branches will be taken? How many iterations through loops?
    etc.

    The effect is necessary for any analysis that depends on run-time values. Loop
    bounds are the prime example for WCET analysis, but stack-usage analysis also needs it. For example, the effect of a push instruction is to increment the "local stack height" model variable by some number that reflects the size of the pushed data (which size can, in principle, be non-static and computed at run time).

    OK. But, you're not trying to emulate/simulate the complete algorithm;
    just handle the side effects of value changes.

    But wouldn't you *have* to do a more thorough emulation?

    foo(count) {
    for (i = 0; i < 5 * count; i++) {
    diddle()
    }
    }

    The stack-usage analysis finds the maximum possible value of the local stack height in each subprogram, both globally and at each call site within the subprogram, then adds up the local stack heights along each possible call path
    to find the total usage for that call path, and then finds and reports the path
    with the largest total usage.

    Doesn't this require a fairly complete model of the processor?

    Yes indeed. The model could be somewhat simpler for stack-usage analysis than for WCET analysis. For stack usage, the relevant effect of an instruction is just to increase or decrease the local stack height by a static constant. If push/pop happens in loops they are usually balanced so that a loop iteration has no net effect on stack usage, making loop iteration bounds unnecessary.

    reverse(count, string) {
    while (count-- > 0) {
    reverse(count-1, &string[1])
    emit(string[0])
    }
    }

    No, that's a shitty example. I'll have to think on it when I have more time...

    My "cost constrained" days were back in the days of the "small/cheap
    CPU" where you had to predict performance accurately -- but, doing so
    was relatively easy (because processors had very predictable
    instruction execution times).

    And that is the kind of processor that Bound-T was designed to support for WCET
    analysis. Stack-usage analysis is not so sensitive.

    Ah, OK.

    Time to get my neighbor's lunch ready...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Sat Aug 7 20:46:19 2021
    On 2021-08-07 20:26, Don Y wrote:
    On 8/7/2021 4:20 AM, Niklas Holsti wrote:


    (In a discussion about WCET analysis more than stack analysis:)


    Ok. If you describe your needs, perhaps I can comment.

    I'm targeting some of the bigger ARM offerings -- A53/55. My
    impression is they go through more "contortions" to eke out
    additional performance than some of the smaller processors.

    Out of scope for Bound-T, I'm sure. Even the AbsInt tool has no support
    for static WCET analysis of those processors and their ARM page suggests
    a hybrid tool instead (https://www.absint.com/ait/arm.htm).


    In my experience, access to memory-mapped registers tends to be not
    much slower than access to memory. Also, quite often the addresses in
    MMIO accesses are static or easy to derive in static analysis, so the
    WCET tool could choose the proper access time, if the tool knows
    enough about the system architecture.

    Yes, that last phrase being the kicker. Can this be simplified to
    something as crude as a "memory map"?


    It seems so, if the access time is simply a function of the accessed
    address.


    What is VMM?

    Virtual Memory Management. I.e., when an opcode fetch (or argument reference) can not only take longer than a cache miss... but
    *considerably* longer as the physical memory is mapped in while the instruction stream "stalls".


    I don't recall seeing any static WCET analysis for page faults, but
    there may be some for Translation Look-aside Buffer misses. Out of my competence, and out of scope for Bound-T, certainly.


    [Note that a page fault need not map physical memory in the
    traditional sense. It can also cause some "special" function to be
    invoked to provide the requisite data/access. So, the cost of a
    fault can vary depend on *what* is faulting and which pager is
    handling that fault]

    You may be able to map some of that into a very capable schedulability analyzer, one that can handle chains of "tasks" passing data/messages to
    each other. But translating the application logic and system behaviour
    into a model for such a schedulability analyzer is not trivial.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Don Y on Sat Aug 7 21:10:09 2021
    On 2021-08-07 20:34, Don Y wrote:
    On 8/7/2021 9:50 AM, Niklas Holsti wrote:
    On 2021-08-07 18:09, Don Y wrote:
    On 8/7/2021 4:09 AM, Niklas Holsti wrote:


    (On how the Bound-T tool models instructions for stack and WCET
    analysis:)

    The internal model includes not only the cost of the instruction,
    but also its effect on the computation. For example, an instruction
    like "increment the accumulator", INC A, would be decoded into a
    model that says

         WCET: 1 cycle.
         Control flow: to next instruction, unconditional.
         Effect: A := A + 1.

    and also the effect, if any, on condition flags.

    Why do you care about the "effect"?  Are you trying to determine
    which branches will be taken?  How many iterations through loops?
    etc.

    The effect is necessary for any analysis that depends on run-time
    values. Loop bounds are the prime example for WCET analysis, but
    stack-usage analysis also needs it. For example, the effect of a push
    instruction is to increment the "local stack height" model variable by
    some number that reflects the size of the pushed data (which size can,
    in principle, be non-static and computed at run time).

    OK.  But, you're not trying to emulate/simulate the complete algorithm;
    just handle the side effects of value changes.

    But wouldn't you *have* to do a more thorough emulation?

    foo(count) {
       for (i = 0; i < 5 * count; i++) {
           diddle()
       }
    }


    The point and aim of _static_ analysis is to abstract the computation to
    model only the aspects that are relevant for the goals of the analysis,
    and especially to avoid any step-by-step emulation or interpretation --
    that would become "dynamic analysis".

    For that example, analysing the subprogram foo stand-alone, the WCET
    analysis in Bound-T would conclude that, in the loop:

    - The variable "i" is an induction variable that increases by one on
    each iteration. This looks promising, and means that Bound-T can model
    the value of "i" as the initial value (zero) plus the loop iteration
    number (starting the count at iteration number zero).

    - But the analysis will not find a numeric upper bound on the iteration
    number, because the value of "count" is unknown.

    If this analysis of foo occurs while analysing some higher-level
    subprogram that calls foo, Bound-T would next try to compute bounds for
    the actual value of "count" in each such call. Say that the call is simply

    foo (33);

    This would make Bound-T reanalyze foo in the context count=33, which
    would show that the loop can be repeated only under the condition

    5 * (iteration number) < 33

    equivalent to (iteration number) < 33/5 = 6, which bounds the number of
    loop iterations and lets Bound-T produce a WCET bound for this specific
    call of foo (assuming that Bound-T can produce a WCET bound for the
    "diddle" subprogram too).

    Such repeated analyses down a call-path with increasing amount of
    context will of course increase the analysis time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Sat Aug 7 15:14:52 2021
    On 8/7/2021 10:46 AM, Niklas Holsti wrote:
    On 2021-08-07 20:26, Don Y wrote:
    On 8/7/2021 4:20 AM, Niklas Holsti wrote:

    (In a discussion about WCET analysis more than stack analysis:)

    Ok. If you describe your needs, perhaps I can comment.

    I'm targeting some of the bigger ARM offerings -- A53/55. My
    impression is they go through more "contortions" to eke out
    additional performance than some of the smaller processors.

    Out of scope for Bound-T, I'm sure. Even the AbsInt tool has no support for static WCET analysis of those processors and their ARM page suggests a hybrid tool instead (https://www.absint.com/ait/arm.htm).

    You can see why I favor hand-waving away all of the "tricks" the processors
    can play to improve performance! Granted, the numbers that result are TRULY "worst case" -- and likely significantly inflated!

    But, if you size for that, then all of the "noncritical" stuff runs "for free"!

    In my experience, access to memory-mapped registers tends to be not much >>> slower than access to memory. Also, quite often the addresses in MMIO
    accesses are static or easy to derive in static analysis, so the WCET tool >>> could choose the proper access time, if the tool knows enough about the
    system architecture.

    Yes, that last phrase being the kicker. Can this be simplified to something >> as crude as a "memory map"?

    It seems so, if the access time is simply a function of the accessed address.

    What is VMM?

    Virtual Memory Management. I.e., when an opcode fetch (or argument
    reference) can not only take longer than a cache miss... but
    *considerably* longer as the physical memory is mapped in while the
    instruction stream "stalls".

    I don't recall seeing any static WCET analysis for page faults, but there may be some for Translation Look-aside Buffer misses. Out of my competence, and out
    of scope for Bound-T, certainly.

    Unlike a cache miss, it's not easy to predict those costs without an
    intimate model of the software.

    E.g., two "closely located" addresses can have entirely different
    behaviors (timing) on a page fault. And, you likely can't predict where
    those addresses will reside as that's a function of how they are mapped
    at runtime.

    Unlike a conventional VMM system (faulting in/out pages from a secondary/disk store), I rely on the mechanism heavily for communication and process container hacks.

    I.e., there's a good reason to over-specify the hardware -- so you can be inefficient in your use of it! :-/

    [Note that a page fault need not map physical memory in the
    traditional sense. It can also cause some "special" function to be
    invoked to provide the requisite data/access. So, the cost of a
    fault can vary depend on *what* is faulting and which pager is
    handling that fault]

    You may be able to map some of that into a very capable schedulability analyzer, one that can handle chains of "tasks" passing data/messages to each other. But translating the application logic and system behaviour into a model
    for such a schedulability analyzer is not trivial.

    As my workload is dynamically defined (can grow or shrink, algorithmically),
    I don't really fret this. It's not like a closed system where what's inside *has* to work. If something isn't working, I can shed load (or, move it to another processor). And, if something is underutilized, *add* load (possibly moved from another processor).

    You typically do this intuitively on your workstation; if things start
    to run sluggishly, you kill off some applications (and make a mental note
    to run them, again, later). And, if the workstation isn't "doing anything", you *add* application processes.

    If you needed to "make world", you might power up another workstation so
    it didn't impede the activities of the "normal" workstation you're using.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to niklas.holsti@tidorum.invalid on Sun Aug 8 18:34:25 2021
    On Sat, 7 Aug 2021 13:40:19 +0300, Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2021-08-07 5:55, George Neuner wrote:

    More importantly, how do you handle chains of conditional branches
    and/or switch/case constructs which can mispredict at every decision
    point? The branch targets may not be in cache - neither mispredicted
    targets nor the actual one. Worst case for a horribly mispredicted
    switch/case can be absolutely dreadful.


    I don't know if those questions are addressed to me (re the Bound-T
    tool) or to WCET analysis in general.

    A little of both really. I was hoping you had some smart(er) approach
    to estimating misprediction effects in systems that use dynamic
    prediction - even if it's just heuristic.

    A lot of more powerful chips now are being used even in 'small'
    systems, and some of them do have dynamic prediction. Note that Don
    is talking about Cortex A53, A55, etc.


    The advanced WCET tools, like the
    ones from AbsInt, do try to cover such cases by their cache analysis to >provide a safe upper bound on the WCET, but the degree of pessimism >(over-estimation) increases with the complexity of the code.

    That said, in many programs most of the processing time is taken by >relatively simple loops, for which the present I-cache analyses work
    quite well, and for which even D-cache analysis can (I believe) work >satisfactorily if the memory addressing patterns are regular.

    Dynamic prediction handles loop control quite well ... it's all the
    other branching code that is the problem.

    WCET has use outside the 'embedded' world also. A lot of my work was
    in QA/QC vision systems: tons of code, tons of features, workstation
    class processors, and still having to be /hard real time/.


    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to George Neuner on Mon Aug 9 10:50:00 2021
    On 2021-08-09 1:34, George Neuner wrote:
    On Sat, 7 Aug 2021 13:40:19 +0300, Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2021-08-07 5:55, George Neuner wrote:

    More importantly, how do you handle chains of conditional branches
    and/or switch/case constructs which can mispredict at every decision
    point? The branch targets may not be in cache - neither mispredicted
    targets nor the actual one. Worst case for a horribly mispredicted
    switch/case can be absolutely dreadful.


    I don't know if those questions are addressed to me (re the Bound-T
    tool) or to WCET analysis in general.

    A little of both really. I was hoping you had some smart(er) approach
    to estimating misprediction effects in systems that use dynamic
    prediction - even if it's just heuristic.


    Sorry, but no.


    A lot of more powerful chips now are being used even in 'small'
    systems, and some of them do have dynamic prediction. Note that Don
    is talking about Cortex A53, A55, etc.


    Indeed, and therefore static WCET analysis is waning, and hybrid (partly measurement-based) WCET estimation is waxing.


    Dynamic prediction handles loop control quite well ... it's all the
    other branching code that is the problem.

    WCET has use outside the 'embedded' world also. A lot of my work was
    in QA/QC vision systems: tons of code, tons of features, workstation
    class processors, and still having to be /hard real time/.


    (I would call that "embedded", because it is a computerized system used
    for a single purpose.)

    One could say, perhaps meanly, that that is an ill-posed problem, with
    the wrong choice of processors. But cost matters, of course...

    If you are not satisfied with Don's approach (extreme over-provision of processor power) you could try the hybrid WCET-estimation tools
    (RapiTime or TimeWeaver) which do not need to model the processors, but
    need to measure fine-grained execution times (on the basic-block level).
    The problem with such tools is that they cannot guarantee to produce an
    upper bound on the WCET, only a bound that holds with high probability.
    And, AIUI, at present that probability cannot be computed, and certainly depends on the test suite being measured. For example, on whether those
    tests lead to mispredictions in chains of conditional branches.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Niklas Holsti on Mon Aug 9 12:04:56 2021
    On 8/9/2021 12:50 AM, Niklas Holsti wrote:
    If you are not satisfied with Don's approach (extreme over-provision of processor power)

    It's not necessarily "extreme". I only grossly underestimate the processor's ability to do the "must do" aspects of the design. This ensures that they actually *will* get done.

    And, what's *left* will likely exceed the needs of what I'd *like* to (also) get done -- but, as that is (by definition) not a "must do" aspect of the design, there are no criteria as to HOW MUCH must get done (including
    "none" and "all")

    In my world, I can defer execution of some tasks, *move* tasks from an overburdened processor to a less-burdened one, *add* physical processors
    to the mix (on demand), etc.

    So, trying to match (at design time) the load to the capabilities is
    neither necessary nor desireable/economical.

    There's no (a priori) "hard limit" on the number/types of applications that
    you can run on your PC, is there? Instead, you dynamically review the performance you are experiencing and adjust your expectations, accordingly.

    Maybe kill off some application that isn't *as* useful to you (at the
    moment) as some other. Or, defer invoking an application until some
    *other* has exceeded its utility.

    You could, of course, just buy a bigger/faster PC! Yet, you don't (up
    to a point). Because that changes the economics of the solution.
    And, because you can "make do" with your current investment just by
    more intelligently scheduling its use!

    By minimizing the "must do" aspect of the problem, you give yourself
    the most flexibility in how you address that/those requirements.
    The rest is "free", figuratively speaking.

    For example, one of my applications handles telephony. It screens
    incoming callers, does speaker identification ("who is calling"),
    etc. To train the recognizer, I have it "listen in" on every call
    given that it now KNOWS who I am talking with and can extract/update
    speech parameters from the additional "training material" that is
    present, FOR FREE, in the ongoing conversation. (instead of explicitly requiring callers to "train" the recognizer in a "training session")

    But, there's no reason that this has to happen:
    - in real time
    - while the conversation is in progress
    - anytime "soon"

    So, if you *record* the conversation (relatively inexpensive as you
    are already moving the audio through the processor), you can pick
    some *later* time to run the model update task. Maybe in the middle
    of the night when there are less demands (i.e., no one telephoning
    you at 3AM!)

    And, if something happens to "come up" that is more important than
    this activity, you can checkpoint the operation, kill off the process
    and let the more important activity have access to those resources.

    Yes, it would be a much simpler design if you could just say:
    "I *need* the following resources to be able to update the
    recognizer models AS THE CONVERSATION WAS HAPPENING". But,
    when the conversation was *over*, you'd have all those extra
    re$ource$ sitting idle.

    Turn "hard" requirements into *soft* ones -- and then shift them,
    in time, to periods of lower resource utilization.

    [The "hard real-time is hard but soft real-time is HARDER" idiom]

    you could try the hybrid WCET-estimation tools (RapiTime or
    TimeWeaver) which do not need to model the processors, but need to measure fine-grained execution times (on the basic-block level). The problem with such
    tools is that they cannot guarantee to produce an upper bound on the WCET, only
    a bound that holds with high probability. And, AIUI, at present that probability cannot be computed, and certainly depends on the test suite being measured. For example, on whether those tests lead to mispredictions in chains
    of conditional branches.

    The equivalent mechanism in my world is monitoring *actual* load.
    As it increases, you have a mismatch between needs and capabilities.
    So, adjust one, the other, or both!

    Shed load (remove processes from the current node)

    and/or

    Add capacity (*move* process to another node, including bringing
    that other node "on-line" to handle the load!)

    If you keep in mind the fact that you only have to deal with
    the MUST DO tasks, this is a lot easier to wrap your head
    around. E.g., eventually, there *will* be a situation where
    what you *want* to do exceeds the capabilities that you
    have available -- period. So, you have to remember that only
    some of those are MUST DOs.

    Yeah, it would be nice to have 3000 applications loaded and ready
    at a mouse-click... but, that wouldn't be worth the cost of
    the system required to support it!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Niklas Holsti on Mon Aug 9 11:26:02 2021
    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    And, AIUI, at present that probability cannot be
    computed, and certainly depends on the test suite being measured. For example, on whether those tests lead to mispredictions in chains of conditional branches.

    Maybe architectural simulation of the target cpu can help, if the
    architecture is known (i.e. exact workings of the pipelines, branch
    predictors etc). And maybe forthcoming RISC-V cpus will be more open
    about this than the current ARM stuff is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Paul Rubin on Mon Aug 9 22:59:36 2021
    On 2021-08-09 21:26, Paul Rubin wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    And, AIUI, at present that probability cannot be
    computed, and certainly depends on the test suite being measured. For
    example, on whether those tests lead to mispredictions in chains of
    conditional branches.

    Maybe architectural simulation of the target cpu can help, if the architecture is known (i.e. exact workings of the pipelines, branch predictors etc).


    For timing, one needs *micro* -architectural simulation (as the term is commonly used, for example in comp.arch). But I think you mean that, so
    this is a terminological quibble.


    And maybe forthcoming RISC-V cpus will be more open about this than
    the current ARM stuff is.

    I suspect that the microarchitecture will be where the various RISC-V implementors will compete (that, and peripherals and packaging), so I'm
    not optimistic that they will be very open about their
    microarchitectures. However, I don't know how far into the micro-level
    the RISC-V standardization and open-source licensing extends.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)