arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that really works?
The best I could find is -fstack-usage and avstack.pl Perl script, but I guess there's another and better way.
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that really works?
The best I could find is -fstack-usage and avstack.pl Perl script, but I guess
there's another and better way.
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded problem as the compiler can't know what the inputs will drive the code to do, in practice.
There are general tools that you can probably instrument to coax a
number out of the source (if you don't have access to the source, you're typically SoL) but they will be expensive in time or money.
On 2021-08-03 3:56, Don Y wrote:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that really works?
What do you mean by "really works"? It's a (potentially) unbounded problem >> as the compiler can't know what the inputs will drive the code to do, in
practice.
There are practical static-analysis methods to compute an upper bound on the stack usage that work well for many embedded programs. An upper bound is usually all one needs. The exact worst case is less interesting.
Some problems arise for programs that branch to dynamically computed addresses,
for example by calling functions via function pointers. Static analyses may not
be able to find the set of possible target addresses. When that happens you have to guide the analysis by specifying where such control transfers can end up. For typical embedded code that is not hard to do, but a program that relies
extensively on virtual function calls may be hard to analyze.
There are general tools that you can probably instrument to coax a
number out of the source (if you don't have access to the source, you're
typically SoL) but they will be expensive in time or money.
The exact stack usage of a subprogram can't be derived from source code without
knowing what machine code the compiler and linker will produce. The static stack-analysis tools typically work on the final executable code (which also means that they can be independent of the programming language).
But (as I said in an earlier post) the tools I am aware of, for the OP's target, are not free.
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded problem as the compiler can't know what the inputs will drive the code to do, in practice.
Your best practical option is to tease a callgraph (tree) out and use your own knowledge of what the code WANTS to do to identify the worst-case
stack penetration path. This is also a great way to get a feel for if your code might benefit from a good refactoring!
The best I could find is -fstack-usage and avstack.pl Perl script, but
I guess there's another and better way.
There are general tools that you can probably instrument to coax a
number out of the source (if you don't have access to the source, you're typically SoL) but they will be expensive in time or money.
Il 03/08/2021 02:56, Don Y ha scritto:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that really works?
What do you mean by "really works"? It's a (potentially) unbounded problem >> as the compiler can't know what the inputs will drive the code to do, in
practice.
I mean a simple tool that can be instructed, even manually, to produce a good result.
-fstack-usage is not usable "by hands".
You need at least a call-graph (generated by a tool), fill each branch (function) with a stack usage (generated by the compiler),
fill every branch
that is not known at compilation time by the tools (call functions by pointers,
recursive, and so on).
It's surprisingly to me there isn't a single non-expensive tool that helps in this, considering there are a multitude of good free tools for developers.
Your best practical option is to tease a callgraph (tree) out and use your >> own knowledge of what the code WANTS to do to identify the worst-case
stack penetration path. This is also a great way to get a feel for if your >> code might benefit from a good refactoring!
The best I could find is -fstack-usage and avstack.pl Perl script, but I >>> guess there's another and better way.
There are general tools that you can probably instrument to coax a
number out of the source (if you don't have access to the source, you're
typically SoL) but they will be expensive in time or money.
On 8/3/2021 2:49 AM, pozz wrote:
Il 03/08/2021 02:56, Don Y ha scritto:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded
problem
as the compiler can't know what the inputs will drive the code to do, in >>> practice.
I mean a simple tool that can be instructed, even manually, to produce
a good result.
What's "good"? Does it have to solve *most* peoples' needs? Or, just
yours?
As I said, the first step is understanding what the dynamics of execution
in your task happen to be. You may be surprised to see functions being dragged in that you'd not have thought were part of the mix! ("why is printf() being called, here?? and, couldn't I use something like itoa() instead?")
The compiler won't know anything about your run-time environment. It won't know if you have a separate stack for ISRs, if the function you are
analyzing
runs as the "root" of a particular process tree, etc. Do you know how much work is done BEFORE the first line of your function executed?
It also won't know which execution paths *will* be exercised in practice.
You may have paths that can't be executed (given a particular set of inputs). How will you know to recognize their impact on any figures reported?
-fstack-usage is not usable "by hands".
No, but you can use that data *with* the call tree to get an idea of
where your maximum lies -- assuming no recursion and worst-case path coverage.
You can also fill the stack space with 0x2B00B1E5, run your COMPREHENSIVE test case suite that exercises ALL paths through the code and check to see what the high-water mark was. (you *are* testing the code, right?)
Hint: you *really* want tasks to be of the lowest practical complexity
that can meet your requirements -- even if that means splitting tasks
into more subunits.
You need at least a call-graph (generated by a tool), fill each branch
(function) with a stack usage (generated by the compiler),
These can be done with a script (as I suggested below)
fill every branch that is not known at compilation time by the tools
(call functions by pointers, recursive, and so on).
Ah, well... there's the rub! How smart should this "free" tool be?
And, how should it handle cases that can't be known at compile time
("Sorry, your codebase appears to need AT LEAST X bytes of stack; but,
I can't tell you how much more!" -- what use, that?)
It's surprisingly to me there isn't a single non-expensive tool that
helps in this, considering there are a multitude of good free tools
for developers.
It costs money to engage in any business (non-hobbyist) endeavor.
Expecting all of your tools to be "free" is a bit naive.
There are some absolutely *amazing* tools available that can save
man-years of development effort and greatly improve the quality/correctness of code. You can buy them -- or, spend man-hours trying to do what they
do. Hopefully, without error.
And, having *done* that, do it all over again on your NEXT project!
<frown>
Your best practical option is to tease a callgraph (tree) out and use
your
own knowledge of what the code WANTS to do to identify the worst-case
stack penetration path. This is also a great way to get a feel for
if your
code might benefit from a good refactoring!
The best I could find is -fstack-usage and avstack.pl Perl script,
but I guess there's another and better way.
There are general tools that you can probably instrument to coax a
number out of the source (if you don't have access to the source, you're >>> typically SoL) but they will be expensive in time or money.
*Try* building the call tree (there are some graphic tools) and
have a look at it. Is it what you expected? Do you understand
*why* each function is being invoked in each edge?
Then, try to *guess* where the "stack pigs" are located.
Finally, see what the compiler tells you for each of them and
how well that fits with your "understanding" of the code.
[The reason you want to do this is so you have a better feel
for the costs of particular approaches -- instead of just
reacting to the "final assessment". For example, most newbies
are stunned to discover how expensive an off-the-shelf
printf() can be!]
Il 03/08/2021 02:56, Don Y ha scritto:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded
problem as the compiler can't know what the inputs will drive the
code to do, in practice.
I mean a simple tool that can be instructed, even manually, to
produce a good result.
-fstack-usage is not usable "by hands".
You need at least a call-graph (generated by a tool), fill each
branch (function) with a stack usage (generated by the compiler),
fill every branch that is not known at compilation time by the tools
(call functions by pointers, recursive, and so on).
It's surprisingly to me there isn't a single non-expensive tool that
helps in this, considering there are a multitude of good free tools
for developers.
On 2021-08-03 12:49, pozz wrote:
Il 03/08/2021 02:56, Don Y ha scritto:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded
problem as the compiler can't know what the inputs will drive the
code to do, in practice.
I mean a simple tool that can be instructed, even manually, to
produce a good result.
-fstack-usage is not usable "by hands".
You need at least a call-graph (generated by a tool), fill each
branch (function) with a stack usage (generated by the compiler),
fill every branch that is not known at compilation time by the tools
(call functions by pointers, recursive, and so on).
It's surprisingly to me there isn't a single non-expensive tool that
helps in this, considering there are a multitude of good free tools
for developers.
One reason is that for good results, such a tool has to analyze the executable
code, and therefore must be target-specific, or at least have ports to the various targets, increasing the effort to implement and maintain the tool. The
gnatstack tool gets around that, to some extent, by relying on stack-usage and
call information from the compiler (gcc).
Note that a source-level call graph
will not show calls to the various support routines (run-time routines) inserted by the compiler, but those calls do use stack.
I am the main author of a WCET-analysis tool, Bound-T, that also does stack analysis and is now free. However, there is (as yet) no port for ARM Cortex-M.
See http://www.bound-t.com/.
As I said, the first step is understanding what the dynamics of execution
in your task happen to be. You may be surprised to see functions being
dragged in that you'd not have thought were part of the mix! ("why is
printf() being called, here?? and, couldn't I use something like itoa()
instead?")
Oh yes, but a simple tool that generates automatically a call graph could be very useful for this.
The compiler won't know anything about your run-time environment. It won't >> know if you have a separate stack for ISRs, if the function you are analyzing
runs as the "root" of a particular process tree, etc. Do you know how much >> work is done BEFORE the first line of your function executed?
It also won't know which execution paths *will* be exercised in practice.
You may have paths that can't be executed (given a particular set of
inputs). How will you know to recognize their impact on any figures
reported?
I know that compiler can't know everything, but *I* can instruct the tool with
that kind of info.
-fstack-usage is not usable "by hands".
No, but you can use that data *with* the call tree to get an idea of
where your maximum lies -- assuming no recursion and worst-case path
coverage.
I think this job could be done by a single tool that creates a call graph and fill in the values from stack usage.
You can also fill the stack space with 0x2B00B1E5, run your COMPREHENSIVE
test case suite that exercises ALL paths through the code and check to see >> what the high-water mark was. (you *are* testing the code, right?)
This is the dynamic approach, I was exploring the static approach.
It costs money to engage in any business (non-hobbyist) endeavor.
Expecting all of your tools to be "free" is a bit naive.
Anyway there are plenty of complex and good free and open-source software (gcc
is one of them).
So it's strange there isn't a similar tool for stack analysis. That's all, it's
strange for me, but I don't pretend all my preferred tools must be free.
On 8/4/2021 10:00 AM, Niklas Holsti wrote:
I am the main author of a WCET-analysis tool, Bound-T, that also does
stack analysis and is now free. However, there is (as yet) no port for
ARM Cortex-M. See http://www.bound-t.com/.
But not free as in open (?)
On 8/3/2021 1:43 AM, Niklas Holsti wrote:
On 2021-08-03 3:56, Don Y wrote:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded
problem
as the compiler can't know what the inputs will drive the code to do, in >>> practice.
There are practical static-analysis methods to compute an upper bound
on the stack usage that work well for many embedded programs. An upper
bound is usually all one needs. The exact worst case is less interesting.
That depends on the actual algorithm being implemented, the style of the developer and, often, the inputs provided.
For example, I often use recursive algorithms for pattern matching (because they are easier to "get right"). Looking at the algorithm, you can't know how deep the recursion will be -- without seeing the input it will be fed.
Some problems arise for programs that branch to dynamically computed
addresses, for example by calling functions via function pointers.
Static analyses may not be able to find the set of possible target
addresses. When that happens you have to guide the analysis by
specifying where such control transfers can end up. For typical
embedded code that is not hard to do, but a program that relies
extensively on virtual function calls may be hard to analyze.
The same is true of a program that is driven by external events.
The code doesn't (typically) "know" the constraints placed on those
events. (one can argue as to whether or not this is "good practice")
So, it can't "find the end".
There are general tools that you can probably instrument to coax a
number out of the source (if you don't have access to the source, you're >>> typically SoL) but they will be expensive in time or money.
The exact stack usage of a subprogram can't be derived from source
code without knowing what machine code the compiler and linker will
produce. The static stack-analysis tools typically work on the final
executable code (which also means that they can be independent of the
programming language).
If you don't have the sources, you can't *do* anything with the results
of the analysis -- other than define a minimum stack size (which may
be a surprise to you *and* your hardware!)
If you don't have the sources, you likely haven't a clue as to how
the code behaves, under the complete set of I/O conditions.
But (as I said in an earlier post) the tools I am aware of, for the
OP's target, are not free.
You can possibly instrument some DSE-like tools.
How efficient are those you've used? Do they operate in "near compile
time"?
I.e., are they practical to use iteratively? On an entire
application?
On 2021-08-03 12:28, Don Y wrote:
On 8/3/2021 1:43 AM, Niklas Holsti wrote:
On 2021-08-03 3:56, Don Y wrote:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that really >>>>> works?
What do you mean by "really works"? It's a (potentially) unbounded problem
as the compiler can't know what the inputs will drive the code to do, in >>>> practice.
There are practical static-analysis methods to compute an upper bound on the
stack usage that work well for many embedded programs. An upper bound is >>> usually all one needs. The exact worst case is less interesting.
That depends on the actual algorithm being implemented, the style of the
developer and, often, the inputs provided.
For example, I often use recursive algorithms for pattern matching (because >> they are easier to "get right"). Looking at the algorithm, you can't know >> how deep the recursion will be -- without seeing the input it will be fed.
Well, most embedded programmers avoid recursion. And if you implement recursion
in a memory-constrained system, your design and code should set a hard limit on
the recursion depth and reject input that would require deeper recursion.
Some (admittedly complex) static analyses can discover such limits from the machine code, in the same way as they discover loop iteration limits for WCET analysis. (In fact, I believe that the AbsInt tools translate loops into recursions before the analysis.)
Some problems arise for programs that branch to dynamically computed
addresses, for example by calling functions via function pointers. Static >>> analyses may not be able to find the set of possible target addresses. When >>> that happens you have to guide the analysis by specifying where such control
transfers can end up. For typical embedded code that is not hard to do, but >>> a program that relies extensively on virtual function calls may be hard to >>> analyze.
The same is true of a program that is driven by external events.
The code doesn't (typically) "know" the constraints placed on those
events. (one can argue as to whether or not this is "good practice")
So, it can't "find the end".
Again, if some sequences of external events might use an unbounded amount of stack, the design and code should set a hard limit. I have not encountered such
programs.
The typical design is to allow some fixed maximum nesting of event processing (eg. a finite set of interrupt priority levels) and otherwise enqueue the events for sequential processing, which does not require an unbounded stack.
There are general tools that you can probably instrument to coax a
number out of the source (if you don't have access to the source, you're >>>> typically SoL) but they will be expensive in time or money.
The exact stack usage of a subprogram can't be derived from source code
without knowing what machine code the compiler and linker will produce. The >>> static stack-analysis tools typically work on the final executable code
(which also means that they can be independent of the programming language).
If you don't have the sources, you can't *do* anything with the results
of the analysis -- other than define a minimum stack size (which may
be a surprise to you *and* your hardware!)
If the program is single-threaded, the allocated stack size is usually set in the linker command file. Sure, you need the source for that file if you want to
change the stack size. Of course you are screwed if your machine does not have
enough memory for the size of stack you want.
If you don't have the sources, you likely haven't a clue as to how
the code behaves, under the complete set of I/O conditions.
I've analyzed the stack usage (upper bounds) of many libraries without access to source code. Typically embedded libraries have call-graphs that can be extracted from the machine code without assumptions about the I/O or other run-time values. But that may no longer be the case for libraries written in C++ or other languages with run-time binding of calls.
I agree that for complete programs one sometimes needs more insight, in particular if there are calls through function pointers or local variables with
dynamic (data-dependent) size.
But (as I said in an earlier post) the tools I am aware of, for the OP's >>> target, are not free.
You can possibly instrument some DSE-like tools.
Sorry, what is a "DSE-like" tool? DSE = ?
How efficient are those you've used? Do they operate in "near compile time"?
For stack-usage analysis the tools are quite fast, typically, because embedded
programs tend not to allocate dynamically-sized local variables. The main time
is spent in reading the executable and decoding the instructions.
For programs that allocate local variables of dynamic size the analysis becomes
much more complex, can need lots of time and memory, and can often fail.
I.e., are they practical to use iteratively? On an entire
application?
Yes, for many applications. And if you are implementing a safety-critical system, you design your application to be analysable.
On 2021-08-04 20:50, Don Y wrote:
On 8/4/2021 10:00 AM, Niklas Holsti wrote:
...
I am the main author of a WCET-analysis tool, Bound-T, that also does stack >>> analysis and is now free. However, there is (as yet) no port for ARM
Cortex-M. See http://www.bound-t.com/.
But not free as in open (?)
The source code is downloadable; see http://bound-t.com/download/src. The copyright text is of my own writing, but I'm open to switching to some better-known copyright such as GPL or even some non-viral version.
However, you may want to read about the state of the tool at http://bound-t.com/status.html.
On 8/4/2021 7:43 AM, pozz wrote:
As I said, the first step is understanding what the dynamics of
execution in your task happen to be. You may be surprised to see
functions being dragged in that you'd not have thought were part
of the mix! ("why is printf() being called, here?? and,
couldn't I use something like itoa() instead?")
Oh yes, but a simple tool that generates automatically a call graph
could be very useful for this.
How "simple" does it have to be? There are already tools that will
do this for you. Even GUI out. How "pretty" is a matter of debate
given that it's hard to make an algorithm that can find the "prettiest"
way of drawing such a graph.
This is the dynamic approach, I was exploring the static approach.
My point is that you still have to perform tests that exercise every path through your code.
What's available is what SOMEONE decided they wanted to develop *and*
offer up to others. You typically have fewer choices for "free" than
"for pay". Developing any tool (and maintaining it -- or, providing enough information that OTHERS can maintain it without you) isn't a trivial job.
On 8/4/2021 10:00 AM, Niklas Holsti wrote:
On 2021-08-03 12:49, pozz wrote:
Il 03/08/2021 02:56, Don Y ha scritto:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded
problem as the compiler can't know what the inputs will drive the
code to do, in practice.
I mean a simple tool that can be instructed, even manually, to
produce a good result.
-fstack-usage is not usable "by hands".
You need at least a call-graph (generated by a tool), fill each
branch (function) with a stack usage (generated by the compiler),
fill every branch that is not known at compilation time by the tools
(call functions by pointers, recursive, and so on).
It's surprisingly to me there isn't a single non-expensive tool that
helps in this, considering there are a multitude of good free tools
for developers.
One reason is that for good results, such a tool has to analyze the
executable code, and therefore must be target-specific, or at least
have ports to the various targets, increasing the effort to implement
and maintain the tool. The gnatstack tool gets around that, to some
extent, by relying on stack-usage and call information from the
compiler (gcc).
I am seeing an increasing number of tools relying on intermediate
encodings (e.g., via LLVM) to give more portability to their
application.
On 8/4/2021 11:50 AM, Niklas Holsti wrote:
On 2021-08-04 20:50, Don Y wrote:
On 8/4/2021 10:00 AM, Niklas Holsti wrote:
...
I am the main author of a WCET-analysis tool, Bound-T, that also
does stack analysis and is now free. However, there is (as yet) no
port for ARM Cortex-M. See http://www.bound-t.com/.
But not free as in open (?)
The source code is downloadable; see http://bound-t.com/download/src.
The copyright text is of my own writing, but I'm open to switching to
some better-known copyright such as GPL or even some non-viral version.
Ah, OK. I may have a look at it to see how well it works and
how much work to port to ARM objects. Given that you've not
done so, already (despite your familiarity with the codebase),
I suspect that's not a trivial task?
However, you may want to read about the state of the tool at
http://bound-t.com/status.html.
Hmmm... that sounds ominous! :> (or, am I just too much of a Cynic?)
On 2021-08-04 20:50, Don Y wrote:
On 8/4/2021 10:00 AM, Niklas Holsti wrote:
On 2021-08-03 12:49, pozz wrote:
Il 03/08/2021 02:56, Don Y ha scritto:
On 7/29/2021 9:22 AM, pozz wrote:
arm gcc and Cortex-Mx MCUs embedded systems.
Is there a compilation-time (static) tool for stack analysis that
really works?
What do you mean by "really works"? It's a (potentially) unbounded
problem as the compiler can't know what the inputs will drive the
code to do, in practice.
I mean a simple tool that can be instructed, even manually, to
produce a good result.
-fstack-usage is not usable "by hands".
You need at least a call-graph (generated by a tool), fill each
branch (function) with a stack usage (generated by the compiler),
fill every branch that is not known at compilation time by the tools
(call functions by pointers, recursive, and so on).
It's surprisingly to me there isn't a single non-expensive tool that
helps in this, considering there are a multitude of good free tools
for developers.
One reason is that for good results, such a tool has to analyze the
executable code, and therefore must be target-specific, or at least have >>> ports to the various targets, increasing the effort to implement and
maintain the tool. The gnatstack tool gets around that, to some extent, by >>> relying on stack-usage and call information from the compiler (gcc).
I am seeing an increasing number of tools relying on intermediate
encodings (e.g., via LLVM) to give more portability to their
application.
LLVM IR and other similar program representations are a good "level" for some semantic analyses, such as value analysis (finding possible ranges of variable
values) and control-flow analysis.
But it is typically too high a level for
analysing machine resources such as stack usage and WCET.
If one can set up a reliable mapping between the IR entities (variables and control flow) and the machine-level entities (registers, memory locations, branch instructions) the two levels of analysis can support each other. Unfortunately that mapping is defined by the compiler and linker and is usually
described only incompletely in the debugging information emitted from the compiler and linker.
On 2021-08-04 21:06, Don Y wrote:
This is the dynamic approach, I was exploring the static approach.
My point is that you still have to perform tests that exercise every path
through your code.
That (full path coverage) is almost never done, because it is exponential in the number of branches.
In case the sizes of local variables depend on input data, it may be very hard
to find the input values that lead to the actual worst-case stack-usage path. Safe upper bounds are easier to find by analysis.
What's available is what SOMEONE decided they wanted to develop *and*
offer up to others. You typically have fewer choices for "free" than
"for pay". Developing any tool (and maintaining it -- or, providing enough >> information that OTHERS can maintain it without you) isn't a trivial job.
My experience is that rather few developers favour the static-analysis approach
to resource analysis -- most just instrument and test. There are a few application domains that require better guarantees -- aerospace, automotive, nuclear -- and they are also prepared to pay (a bit) for the tools. However, tools for such domains usually need some kind of formal qualification or certification, which can be expensive.
Also, while static analysis is still possible for stack usage, it is becoming impossible for WCET, because of the complex, dynamic architecture of current high-end embedded processors (out-of-order execution, complex caches, multi-core with shared-resource conflicts, and so on and on). The field seems to be moving towards hybrid analysis methods that combine measured execution times with static flow analys, for example the tools from Rapita (https://www.rapitasystems.com/).
On 2021-08-05 1:46, Don Y wrote:
The source code is downloadable; see http://bound-t.com/download/src. The >>> copyright text is of my own writing, but I'm open to switching to some
better-known copyright such as GPL or even some non-viral version.
Ah, OK. I may have a look at it to see how well it works and
how much work to port to ARM objects. Given that you've not
done so, already (despite your familiarity with the codebase),
I suspect that's not a trivial task?
There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of course that architecture is out of date. I started a port to ARM Cortex-M, but
did not (yet) finish it for the reasons described on the status page.
Note that the tool is implemented in Ada.
Porting to a new target processor means writing procedures to decode the machine instructions and translate them to the internal representation used by
the analysis.
Sometimes it is also necessary to modify or extend the tool parts
that read the executable files and especially the debugging information -- some
of that may be compiler-specific, unfortunately. It is a fair amount of work and requires a good understanding both of the target processor and of the Bound-T internal representation.
And Ada, of course, but every programmer
should know Ada, IMO :-)
However, you may want to read about the state of the tool at
http://bound-t.com/status.html.
Hmmm... that sounds ominous! :> (or, am I just too much of a Cynic?)
The problems described on the status page are more relevant to WCET analysis than to stack analysis, but can affect stack analysis too, in some cases.
On 8/5/2021 3:43 AM, Niklas Holsti wrote:
On 2021-08-05 1:46, Don Y wrote:
The source code is downloadable; see
http://bound-t.com/download/src. The copyright text is of my own
writing, but I'm open to switching to some better-known copyright
such as GPL or even some non-viral version.
Ah, OK. I may have a look at it to see how well it works and
how much work to port to ARM objects. Given that you've not
done so, already (despite your familiarity with the codebase),
I suspect that's not a trivial task?
There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but
of course that architecture is out of date. I started a port to ARM
Cortex-M, but did not (yet) finish it for the reasons described on the
status page.
Note that the tool is implemented in Ada.
OK. It's been ~20 years since I wrote a line of Ada code but I doubt it
will take much to "refresh" those memory cells. Esp as this likely doesn't have synchronization issues, RT constraints, etc.
Porting to a new target processor means writing procedures to decode
the machine instructions and translate them to the internal
representation used by the analysis.
That shouldn't be hard -- except for the "timing uncertainties" in the processor. (Did you rely on some blanket generalization -- like all
caches cold -- as you are looking for WORST case timing? OTOH, a
loop known to fit in a cache line *should* expect that speedup... :< )
Sometimes it is also necessary to modify or extend the tool parts that
read the executable files and especially the debugging information --
some of that may be compiler-specific, unfortunately. It is a fair
amount of work and requires a good understanding both of the target
processor and of the Bound-T internal representation.
The latter will likely be the bigger effort. It requires trying to infer your mindset when you began the design.
[I now include "rationales" in my documentation for exactly this purpose;
to let those "following" understand why particular choices were made
instead of choices that may (in hindsight) appear better]
On 2021-08-05 16:16, Don Y wrote:
On 8/5/2021 3:43 AM, Niklas Holsti wrote:
On 2021-08-05 1:46, Don Y wrote:
The source code is downloadable; see http://bound-t.com/download/src. The >>>>> copyright text is of my own writing, but I'm open to switching to some >>>>> better-known copyright such as GPL or even some non-viral version.
Ah, OK. I may have a look at it to see how well it works and
how much work to port to ARM objects. Given that you've not
done so, already (despite your familiarity with the codebase),
I suspect that's not a trivial task?
There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of >>> course that architecture is out of date. I started a port to ARM Cortex-M, >>> but did not (yet) finish it for the reasons described on the status page. >>>
Note that the tool is implemented in Ada.
OK. It's been ~20 years since I wrote a line of Ada code but I doubt it
will take much to "refresh" those memory cells. Esp as this likely doesn't >> have synchronization issues, RT constraints, etc.
Right. The only use of tasking features is to set an upper limit on the execution time of the analysis, and that feature is easy to remove if needed.
The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 additions.
Porting to a new target processor means writing procedures to decode the >>> machine instructions and translate them to the internal representation used >>> by the analysis.
That shouldn't be hard -- except for the "timing uncertainties" in the
processor. (Did you rely on some blanket generalization -- like all
caches cold -- as you are looking for WORST case timing? OTOH, a
loop known to fit in a cache line *should* expect that speedup... :< )
At present the tool assumes that every instruction and control transfer can be
assigned its own WCET, in machine cycles, independent of other context or data
values. The only dynamic aspects of instruction execution time that are modelled are the possible pipeline stalls due to read-after-write interlocks. As a special case, in the version for SPARC, the concurrent execution of the Integer Unit and the Floating Point Unit is modelled to estimate where and for
how long the IU has to wait for the FPU to complete an operation.
Good algorithms for computing WCET bounds for most kinds of instruction caches
are known, but I did not get around to implementing any of those, so only cache-less processors are supported now. If you need WCET analysis of caches, the AbsInt tool is best.
Data caches are still a hard problem, where the WCET analyses tend to be quite
pessimistic. Moreover, the increasing use of eviction algorithms other than LRU
(for example, pseudo-random eviction) lessens the precision of the cache analyses, even for instruction caches.
Sometimes it is also necessary to modify or extend the tool parts that read >>> the executable files and especially the debugging information -- some of >>> that may be compiler-specific, unfortunately. It is a fair amount of work >>> and requires a good understanding both of the target processor and of the >>> Bound-T internal representation.
The latter will likely be the bigger effort. It requires trying to infer
your mindset when you began the design.
You could start by reading http://bound-t.com/whats-inside.html.
[I now include "rationales" in my documentation for exactly this purpose;
to let those "following" understand why particular choices were made
instead of choices that may (in hindsight) appear better]
Me too. Ceterum censeo that calling the in-code documentation "comments" was a
very poor choice of term and very misleading for what such documentation should
contain and its importance.
On 8/5/2021 8:39 AM, Niklas Holsti wrote:
On 2021-08-05 16:16, Don Y wrote:
On 8/5/2021 3:43 AM, Niklas Holsti wrote:
On 2021-08-05 1:46, Don Y wrote:
The source code is downloadable; see
http://bound-t.com/download/src. The copyright text is of my own
writing, but I'm open to switching to some better-known copyright
such as GPL or even some non-viral version.
Ah, OK. I may have a look at it to see how well it works and
how much work to port to ARM objects. Given that you've not
done so, already (despite your familiarity with the codebase),
I suspect that's not a trivial task?
There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex),
but of course that architecture is out of date. I started a port to
ARM Cortex-M, but did not (yet) finish it for the reasons described
on the status page.
Note that the tool is implemented in Ada.
OK. It's been ~20 years since I wrote a line of Ada code but I doubt it >>> will take much to "refresh" those memory cells. Esp as this likely
doesn't
have synchronization issues, RT constraints, etc.
Right. The only use of tasking features is to set an upper limit on
the execution time of the analysis, and that feature is easy to remove
if needed.
The code mostly relies on the 1995 Ada standard, with some use of Ada
2005 additions.
OK. I'll have to build a more current version of GNAT.
Porting to a new target processor means writing procedures to decode
the machine instructions and translate them to the internal
representation used by the analysis.
That shouldn't be hard -- except for the "timing uncertainties" in the
processor. (Did you rely on some blanket generalization -- like all
caches cold -- as you are looking for WORST case timing? OTOH, a
loop known to fit in a cache line *should* expect that speedup... :< )
At present the tool assumes that every instruction and control
transfer can be assigned its own WCET, in machine cycles, independent
of other context or data
So, you just sum worst case times? You don't, for example, alter
times based on conditionals?
I assume this is semi table-driven (?)
-- as a preface to inquiring as to
how you develop (and verify!) a test suite?
Good algorithms for computing WCET bounds for most kinds of
instruction caches are known, but I did not get around to implementing
any of those, so only cache-less processors are supported now. If you
need WCET analysis of caches, the AbsInt tool is best.
It's arguable whether (and to what extent) you should model I-cache
effects.
E.g., in a multithreaded environment, the cache can be purged at some frequency that a tool (or the developer, for that matter -- at least not
wrt the instruction stream) can't know. Aim high and be pleasantly surprised...
[How do you handle accesses to memories with different *basic* access
times?]
That said, for some processors it is easy to recognize at decode-time most of
the instructions that access the stack, and some versions of Bound-T let one >> specify different access times for stack accesses and for general
(unclassified) accesses. That can be useful if the stack is located in fast >> memory, but other data are in slower memory.
I'm thinking, specifically, about I/Os -- which are increasingly memory mapped (including register spaces).
The source code is downloadable; see http://bound-t.com/download/src. >>>>>>> The copyright text is of my own writing, but I'm open to switching to >>>>>>> some better-known copyright such as GPL or even some non-viral version. >>>>>>Ah, OK. I may have a look at it to see how well it works and
how much work to port to ARM objects. Given that you've not
done so, already (despite your familiarity with the codebase),
I suspect that's not a trivial task?
There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of >>>>> course that architecture is out of date. I started a port to ARM Cortex-M,
but did not (yet) finish it for the reasons described on the status page. >>>>>
Note that the tool is implemented in Ada.
OK. It's been ~20 years since I wrote a line of Ada code but I doubt it >>>> will take much to "refresh" those memory cells. Esp as this likely doesn't
have synchronization issues, RT constraints, etc.
Right. The only use of tasking features is to set an upper limit on the
execution time of the analysis, and that feature is easy to remove if needed.
The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 >>> additions.
OK. I'll have to build a more current version of GNAT.
For playing around, I would just use the GNAT Community Edition. Or the FSF GNAT that comes with MinGW32.
At present the tool assumes that every instruction and control transfer can >>> be assigned its own WCET, in machine cycles, independent of other context orPorting to a new target processor means writing procedures to decode the >>>>> machine instructions and translate them to the internal representation >>>>> used by the analysis.
That shouldn't be hard -- except for the "timing uncertainties" in the >>>> processor. (Did you rely on some blanket generalization -- like all
caches cold -- as you are looking for WORST case timing? OTOH, a
loop known to fit in a cache line *should* expect that speedup... :< ) >>>
data
So, you just sum worst case times? You don't, for example, alter
times based on conditionals?
(Note: I say "WCET" below, but really I should always say "upper bound on the WCET".)
The analysis works on the control-flow graph (per subprogram). The WCET for each basic block is the sum of the WCETs of the instructions in that block. The
WCET for an execution path through the graph (from entry to return) is the sum
of the WCETs of the basic blocks in the bath, plus the WCETs assigned to each edge (control transfer) between basic blocks in the path.
The WCET for the
whole graph is computed by a pessimisation over all syntactically possible paths through the graph, using an Integer Linear Programming solver (lp_solve,
for Bound-T). This method is common in WCET analysis and is called IPET, for Implicit Path Exploration Technique.
By "syntactically possible" I mean that the tool generally does not attempt to
find semantically or logically infeasible paths. For example, if a subprogram has this form:
if boo then
perform computation A;
else
perform computation B;
end if;
perform computation C;
then the WCET for the subprogram is estimated as max(WCET(A),WCET(B)) + WCET(C). But if the subprogram has this form:
if boo then
perform computation A;
end if;
perform computation C;
if not boo then
perform computation B;
end if;
then the WCET for the subprogram is estimated as WCET(A) + WCET(C) + WCET(B). In other words, the analysis (usually) does not discover that the conditions "boo" and "not boo" are mutually exclusive.
I assume this is semi table-driven (?)
I don't understand that question. Please clarify.
-- as a preface to inquiring as to
how you develop (and verify!) a test suite?
In the same way as for any complex program. Final validation by running and measuring test code on a real processor or a cycle-accurate simulation.
Good algorithms for computing WCET bounds for most kinds of instruction
caches are known, but I did not get around to implementing any of those, so >>> only cache-less processors are supported now. If you need WCET analysis of >>> caches, the AbsInt tool is best.
It's arguable whether (and to what extent) you should model I-cache effects. >> E.g., in a multithreaded environment, the cache can be purged at some
frequency that a tool (or the developer, for that matter -- at least not
wrt the instruction stream) can't know. Aim high and be pleasantly
surprised...
Such Cache-Related Preemption Delays (CRPDs) are a much-studied problem in WCET
analysis. The most common solution is to first find out how the various tasks and interrupts can access memory blocks which potentially may collide in the caches, and then to take the worst possible interference into account in the schedulability analysis, which does know (or assumes) the worst case frequency
of preemptions.
For processors where cache misses are much slower than cache hits (which is fast coming to mean almost all processors) IMO an I-cache analysis is necessary
for static WCET analysis to be useful.
[How do you handle accesses to memories with different *basic* access times?]
For code fetch the address is always statically known, so the correct access time can be selected and included in the WCET for the instruction when the instruction is decoded.
For data accesses, if different instructions are used for different memories (as in the 8051 architecture, for example), the proper access time can be included in the WCET for the instruction when the instruction is decoded.
If the same instructions are used to access memories with different access times, depending on the address, and the memory addresses are dynamically determined at run time, the practical but pessimistic way is to use the largest
of the access times, that is, assume that the slowest memory is accessed. In principle the tool can try to analyze the address computations and might find sufficient bounds on the address to ensure that a faster memory is accessed, but the analysis methods currently available in Bound-T seldom manage that, and
can be quite slow.
That said, for some processors it is easy to recognize at decode-time most of the instructions that access the stack, and some versions of Bound-T let one specify different access times for stack accesses and for general (unclassified) accesses. That can be useful if the stack is located in fast memory, but other data are in slower memory.
:
At present the tool assumes that every instruction and control transfer
can be assigned its own WCET, in machine cycles, independent of other
context or data values. The only dynamic aspects of instruction
execution time that are modelled are the possible pipeline stalls due to >read-after-write interlocks. As a special case, in the version for
SPARC, the concurrent execution of the Integer Unit and the Floating
Point Unit is modelled to estimate where and for how long the IU has to
wait for the FPU to complete an operation.
Good algorithms for computing WCET bounds for most kinds of instruction >caches are known, but I did not get around to implementing any of those,
so only cache-less processors are supported now. If you need WCET
analysis of caches, the AbsInt tool is best.
Data caches are still a hard problem, where the WCET analyses tend to be >quite pessimistic. Moreover, the increasing use of eviction algorithms
other than LRU (for example, pseudo-random eviction) lessens the
precision of the cache analyses, even for instruction caches.
:
On Thu, 5 Aug 2021 18:39:51 +0300, Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:...
:
At present the tool assumes that every instruction and control transfer
can be assigned its own WCET, in machine cycles, independent of other
context or data values. The only dynamic aspects of instruction
execution time that are modelled are the possible pipeline stalls due to
read-after-write interlocks.
How do you handle branch/jump mispredicts?
More importantly, how do you handle chains of conditional branches
and/or switch/case constructs which can mispredict at every decision
point? The branch targets may not be in cache - neither mispredicted
targets nor the actual one. Worst case for a horribly mispredicted switch/case can be absolutely dreadful.
On 8/6/2021 12:58 PM, Niklas Holsti wrote:
OK. I'll have to build a more current version of GNAT.
For playing around, I would just use the GNAT Community Edition. Or
the FSF GNAT that comes with MinGW32.
Ugh! No, I'll just build a fresh copy. I do all my development
under NetBSD so have everything I want/need ('cept the Ada tools)
there, already.
The analysis works on the control-flow graph (per subprogram). The
WCET for each basic block is the sum of the WCETs of the instructions
in that block. The WCET for an execution path through the graph (from
entry to return) is the sum of the WCETs of the basic blocks in the
bath, plus the WCETs assigned to each edge (control transfer) between
basic blocks in the path.
So, the costs of conditional control transfers are handled separately (?)
I assume this is semi table-driven (?)
I don't understand that question. Please clarify.
A large number of opcodes, each with particular costs.
Do you build a jungle of conditionals to subdivide the
"opcode space" into groups of similar-cost operations?
Or, do you just have a table of:
{opcode, mask, type[1], cost}
[1] used to apply some further heuristics to your
refinement of "cost"
-- as a preface to inquiring as to
how you develop (and verify!) a test suite?
In the same way as for any complex program. Final validation by
running and measuring test code on a real processor or a
cycle-accurate simulation.
Ah, OK. So, you could "can" that particular exemplar and use it
to test a modification to the code (without having to run the app
on "real hardware", again)
I don't rely on "live" tests for my code. Rather, I use tools to
generate good test coverage and then verify the results are what I
expect. I find this easier and more easily extensible (I can test
ARM code by running an x86 port of that code)
For processors where cache misses are much slower than cache hits
(which is fast coming to mean almost all processors) IMO an I-cache
analysis is necessary for static WCET analysis to be useful.
I look at it the other way around. Assume there is NO cache. You
know that your code WILL run in less time than this case. Regardless
of the frequency or presence of competing events.
Processors are fast enough, now, that you can usually afford to "step up"
in capability for comparably little cost.
On 8/6/2021 4:04 PM, Don Y wrote:
That said, for some processors it is easy to recognize at decode-time
most of the instructions that access the stack, and some versions of
Bound-T let one specify different access times for stack accesses and
for general (unclassified) accesses. That can be useful if the stack
is located in fast memory, but other data are in slower memory.
I'm thinking, specifically, about I/Os -- which are increasingly memory
mapped (including register spaces).
Sorry, I should be more clear. I'm looking at the issues that would affect *my* needs in *my* environment -- realizing these may be different than
those you've previously targeted. (e.g., VMM, FPU emulation, etc.)
On 2021-08-07 2:04, Don Y wrote:
On 8/6/2021 12:58 PM, Niklas Holsti wrote:
OK. I'll have to build a more current version of GNAT.
For playing around, I would just use the GNAT Community Edition. Or the FSF >>> GNAT that comes with MinGW32.
Ugh! No, I'll just build a fresh copy. I do all my development
under NetBSD so have everything I want/need ('cept the Ada tools)
there, already.
You are braver than I am. Note that the front-end of GNAT is implemented in Ada, so it has to be bootstrapped with an Ada compiler. And the current version
of GNAT requires quite an up-to-date Ada compiler for the bootstrap. In practice, I think people have found that only GNAT can build GNAT.
Many Linux distributions come with GNAT pre-built. Debian-derived distributions
have good support, I hear.
The analysis works on the control-flow graph (per subprogram). The WCET for >>> each basic block is the sum of the WCETs of the instructions in that block. >>> The WCET for an execution path through the graph (from entry to return) is >>> the sum of the WCETs of the basic blocks in the bath, plus the WCETsSo, the costs of conditional control transfers are handled separately (?)
assigned to each edge (control transfer) between basic blocks in the path. >>
I'm not sure I understand your question. Each edge in the control-flow graph, that is, each transition from one instruction or one basic block to a possible
next instruction or block, is assigned a WCET value when the instructions are decoded and entered in the control-flow graph. Usually this WCET is zero for a
fall-through edge from a non-branch instruction to the next, and is non-zero only for edges from branch instructions, whether conditional or unconditional.
For a conditional branch, the "not taken" edge usually has a zero WCET and the
"taken" edge has some non-zero WCET, as specified in the processor documentation (remember we are assuming a simple processor).
I assume this is semi table-driven (?)
I don't understand that question. Please clarify.
A large number of opcodes, each with particular costs.
Do you build a jungle of conditionals to subdivide the
"opcode space" into groups of similar-cost operations?
Or, do you just have a table of:
{opcode, mask, type[1], cost}
[1] used to apply some further heuristics to your
refinement of "cost"
I would say that the approach to translating instructions into their analysis models (elements in the control-flow graph) is usually not table-driven, but emulates the field-by-field decoding and case analysis that would be done by a
disassembler or emulator (and Bound-T can produce a disassembly of the code).
The internal model includes not only the cost of the instruction, but also its
effect on the computation. For example, an instruction like "increment the accumulator", INC A, would be decoded into a model that says
WCET: 1 cycle.
Control flow: to next instruction, unconditional.
Effect: A := A + 1.
and also the effect, if any, on condition flags.
I don't rely on "live" tests for my code. Rather, I use tools to
generate good test coverage and then verify the results are what I
expect. I find this easier and more easily extensible (I can test
ARM code by running an x86 port of that code)
At my former job, implementing SW for ESA satellite applications, we almost always tested the code first on normal PCs, in a simulated I/O environment, and
then on the embedded target. Easier for Ada than for C.
For processors where cache misses are much slower than cache hits (which is >>> fast coming to mean almost all processors) IMO an I-cache analysis is
necessary for static WCET analysis to be useful.
I look at it the other way around. Assume there is NO cache. You
know that your code WILL run in less time than this case. Regardless
of the frequency or presence of competing events.
Processors are fast enough, now, that you can usually afford to "step up"
in capability for comparably little cost.
Sometimes, but not always. The ESA applications we made were usually running (worst case) at about 70% processor load, and faster processors for space applications were very expensive, not only in euros but also in power, volume and mass, which are scarce resources in space.
On 8/7/2021 4:09 AM, Niklas Holsti wrote:
The internal model includes not only the cost of the instruction, but
also its effect on the computation. For example, an instruction like
"increment the accumulator", INC A, would be decoded into a model that
says
WCET: 1 cycle.
Control flow: to next instruction, unconditional.
Effect: A := A + 1.
and also the effect, if any, on condition flags.
Why do you care about the "effect"? Are you trying to determine
which branches will be taken? How many iterations through loops?
etc.
Doesn't this require a fairly complete model of the processor?
My "cost constrained" days were back in the days of the "small/cheap
CPU" where you had to predict performance accurately -- but, doing so
was relatively easy (because processors had very predictable
instruction execution times).
For stack usage, the relevant effect of an instruction is just toI meant to say that the change in local stack height is _often_ a static constant, not that it _always_ is static.
increase or decrease the local stack height by a static constant.
On 2021-08-07 2:06, Don Y wrote:
On 8/6/2021 4:04 PM, Don Y wrote:
That said, for some processors it is easy to recognize at decode-time most >>>> of the instructions that access the stack, and some versions of Bound-T let
one specify different access times for stack accesses and for general
(unclassified) accesses. That can be useful if the stack is located in fast
memory, but other data are in slower memory.
I'm thinking, specifically, about I/Os -- which are increasingly memory
mapped (including register spaces).
Sorry, I should be more clear. I'm looking at the issues that would affect >> *my* needs in *my* environment -- realizing these may be different than
those you've previously targeted. (e.g., VMM, FPU emulation, etc.)
Ok. If you describe your needs, perhaps I can comment.
In my experience, access to memory-mapped registers tends to be not much slower
than access to memory. Also, quite often the addresses in MMIO accesses are static or easy to derive in static analysis, so the WCET tool could choose the
proper access time, if the tool knows enough about the system architecture.
If by "FPU emulation" you mean SW-implemented FP instructions, of course we have encountered those. They are often hand-coded assembler with very complex and tricky control flow, which makes it hard to find loop bounds automatically,
so manual annotations must be used instead.
What is VMM?
On 2021-08-07 18:09, Don Y wrote:
On 8/7/2021 4:09 AM, Niklas Holsti wrote:
(On how the Bound-T tool models instructions for stack and WCET analysis:)
The internal model includes not only the cost of the instruction, but also >>> its effect on the computation. For example, an instruction like "increment >>> the accumulator", INC A, would be decoded into a model that says
WCET: 1 cycle.
Control flow: to next instruction, unconditional.
Effect: A := A + 1.
and also the effect, if any, on condition flags.
Why do you care about the "effect"? Are you trying to determine
which branches will be taken? How many iterations through loops?
etc.
The effect is necessary for any analysis that depends on run-time values. Loop
bounds are the prime example for WCET analysis, but stack-usage analysis also needs it. For example, the effect of a push instruction is to increment the "local stack height" model variable by some number that reflects the size of the pushed data (which size can, in principle, be non-static and computed at run time).
The stack-usage analysis finds the maximum possible value of the local stack height in each subprogram, both globally and at each call site within the subprogram, then adds up the local stack heights along each possible call path
to find the total usage for that call path, and then finds and reports the path
with the largest total usage.
Doesn't this require a fairly complete model of the processor?
Yes indeed. The model could be somewhat simpler for stack-usage analysis than for WCET analysis. For stack usage, the relevant effect of an instruction is just to increase or decrease the local stack height by a static constant. If push/pop happens in loops they are usually balanced so that a loop iteration has no net effect on stack usage, making loop iteration bounds unnecessary.
My "cost constrained" days were back in the days of the "small/cheap
CPU" where you had to predict performance accurately -- but, doing so
was relatively easy (because processors had very predictable
instruction execution times).
And that is the kind of processor that Bound-T was designed to support for WCET
analysis. Stack-usage analysis is not so sensitive.
On 8/7/2021 4:20 AM, Niklas Holsti wrote:
Ok. If you describe your needs, perhaps I can comment.
I'm targeting some of the bigger ARM offerings -- A53/55. My
impression is they go through more "contortions" to eke out
additional performance than some of the smaller processors.
In my experience, access to memory-mapped registers tends to be not
much slower than access to memory. Also, quite often the addresses in
MMIO accesses are static or easy to derive in static analysis, so the
WCET tool could choose the proper access time, if the tool knows
enough about the system architecture.
Yes, that last phrase being the kicker. Can this be simplified to
something as crude as a "memory map"?
What is VMM?
Virtual Memory Management. I.e., when an opcode fetch (or argument reference) can not only take longer than a cache miss... but
*considerably* longer as the physical memory is mapped in while the instruction stream "stalls".
[Note that a page fault need not map physical memory in the
traditional sense. It can also cause some "special" function to be
invoked to provide the requisite data/access. So, the cost of a
fault can vary depend on *what* is faulting and which pager is
handling that fault]
On 8/7/2021 9:50 AM, Niklas Holsti wrote:
On 2021-08-07 18:09, Don Y wrote:
On 8/7/2021 4:09 AM, Niklas Holsti wrote:
(On how the Bound-T tool models instructions for stack and WCET
analysis:)
The internal model includes not only the cost of the instruction,
but also its effect on the computation. For example, an instruction
like "increment the accumulator", INC A, would be decoded into a
model that says
WCET: 1 cycle.
Control flow: to next instruction, unconditional.
Effect: A := A + 1.
and also the effect, if any, on condition flags.
Why do you care about the "effect"? Are you trying to determine
which branches will be taken? How many iterations through loops?
etc.
The effect is necessary for any analysis that depends on run-time
values. Loop bounds are the prime example for WCET analysis, but
stack-usage analysis also needs it. For example, the effect of a push
instruction is to increment the "local stack height" model variable by
some number that reflects the size of the pushed data (which size can,
in principle, be non-static and computed at run time).
OK. But, you're not trying to emulate/simulate the complete algorithm;
just handle the side effects of value changes.
But wouldn't you *have* to do a more thorough emulation?
foo(count) {
for (i = 0; i < 5 * count; i++) {
diddle()
}
}
On 2021-08-07 20:26, Don Y wrote:
On 8/7/2021 4:20 AM, Niklas Holsti wrote:
(In a discussion about WCET analysis more than stack analysis:)
Ok. If you describe your needs, perhaps I can comment.
I'm targeting some of the bigger ARM offerings -- A53/55. My
impression is they go through more "contortions" to eke out
additional performance than some of the smaller processors.
Out of scope for Bound-T, I'm sure. Even the AbsInt tool has no support for static WCET analysis of those processors and their ARM page suggests a hybrid tool instead (https://www.absint.com/ait/arm.htm).
In my experience, access to memory-mapped registers tends to be not much >>> slower than access to memory. Also, quite often the addresses in MMIO
accesses are static or easy to derive in static analysis, so the WCET tool >>> could choose the proper access time, if the tool knows enough about the
system architecture.
Yes, that last phrase being the kicker. Can this be simplified to something >> as crude as a "memory map"?
It seems so, if the access time is simply a function of the accessed address.
What is VMM?
Virtual Memory Management. I.e., when an opcode fetch (or argument
reference) can not only take longer than a cache miss... but
*considerably* longer as the physical memory is mapped in while the
instruction stream "stalls".
I don't recall seeing any static WCET analysis for page faults, but there may be some for Translation Look-aside Buffer misses. Out of my competence, and out
of scope for Bound-T, certainly.
[Note that a page fault need not map physical memory in the
traditional sense. It can also cause some "special" function to be
invoked to provide the requisite data/access. So, the cost of a
fault can vary depend on *what* is faulting and which pager is
handling that fault]
You may be able to map some of that into a very capable schedulability analyzer, one that can handle chains of "tasks" passing data/messages to each other. But translating the application logic and system behaviour into a model
for such a schedulability analyzer is not trivial.
On 2021-08-07 5:55, George Neuner wrote:
More importantly, how do you handle chains of conditional branches
and/or switch/case constructs which can mispredict at every decision
point? The branch targets may not be in cache - neither mispredicted
targets nor the actual one. Worst case for a horribly mispredicted
switch/case can be absolutely dreadful.
I don't know if those questions are addressed to me (re the Bound-T
tool) or to WCET analysis in general.
The advanced WCET tools, like the
ones from AbsInt, do try to cover such cases by their cache analysis to >provide a safe upper bound on the WCET, but the degree of pessimism >(over-estimation) increases with the complexity of the code.
That said, in many programs most of the processing time is taken by >relatively simple loops, for which the present I-cache analyses work
quite well, and for which even D-cache analysis can (I believe) work >satisfactorily if the memory addressing patterns are regular.
On Sat, 7 Aug 2021 13:40:19 +0300, Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2021-08-07 5:55, George Neuner wrote:
More importantly, how do you handle chains of conditional branches
and/or switch/case constructs which can mispredict at every decision
point? The branch targets may not be in cache - neither mispredicted
targets nor the actual one. Worst case for a horribly mispredicted
switch/case can be absolutely dreadful.
I don't know if those questions are addressed to me (re the Bound-T
tool) or to WCET analysis in general.
A little of both really. I was hoping you had some smart(er) approach
to estimating misprediction effects in systems that use dynamic
prediction - even if it's just heuristic.
A lot of more powerful chips now are being used even in 'small'
systems, and some of them do have dynamic prediction. Note that Don
is talking about Cortex A53, A55, etc.
Dynamic prediction handles loop control quite well ... it's all the
other branching code that is the problem.
WCET has use outside the 'embedded' world also. A lot of my work was
in QA/QC vision systems: tons of code, tons of features, workstation
class processors, and still having to be /hard real time/.
If you are not satisfied with Don's approach (extreme over-provision of processor power)
you could try the hybrid WCET-estimation tools (RapiTime or
TimeWeaver) which do not need to model the processors, but need to measure fine-grained execution times (on the basic-block level). The problem with such
tools is that they cannot guarantee to produce an upper bound on the WCET, only
a bound that holds with high probability. And, AIUI, at present that probability cannot be computed, and certainly depends on the test suite being measured. For example, on whether those tests lead to mispredictions in chains
of conditional branches.
And, AIUI, at present that probability cannot be
computed, and certainly depends on the test suite being measured. For example, on whether those tests lead to mispredictions in chains of conditional branches.
Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
And, AIUI, at present that probability cannot be
computed, and certainly depends on the test suite being measured. For
example, on whether those tests lead to mispredictions in chains of
conditional branches.
Maybe architectural simulation of the target cpu can help, if the architecture is known (i.e. exact workings of the pipelines, branch predictors etc).
And maybe forthcoming RISC-V cpus will be more open about this than
the current ARM stuff is.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 296 |
Nodes: | 16 (2 / 14) |
Uptime: | 69:46:39 |
Calls: | 6,655 |
Calls today: | 1 |
Files: | 12,200 |
Messages: | 5,332,101 |
Posted today: | 1 |