Stacks - Myths and Run Time Monitoring
Stacks the forgotten part of embedded proramming
Yes that important part of applications and operating systems that the vast majority of people are either totally dismissive or
ignorant of. Some developers like Java developers get used to stack traces as ways of decoding what went wrong.
However most embedded developers just rely on the chip/board manufacturers' linker scripts and libraries to just sort it out for
them. many get into troubles later as they have not planned their system or know how to handle fault conditions internal or
In this article I am looking at embedded systems that are SMALL as in have -
- NO operating system
- NO memory management or protection mechanisms
- NO message queues
- NO multiple threads
- USE simple methods for timed events
- USE 'super looper' or simple time driven co-operative scheduling of tasks/events/processes
Review of stack contents
People forget that the stack stores more than addresses, it stores data like -
- Saved registers from upper level so registers can be used for parameter passing
- Parameters for the function beyond what can be stored in registers
- Local variables in your function or block of code
- Often for object orientated the top level pointer for 'this->'
- Other compiler specific details that compiler may need for handling the return from the function
- For interrupts most processors also save the processor status and sometimes other details as well as the return address
Stack operations are usually done in multiples of address width and/or register width, so even if you save an 8 bit value on the
stack, on a 32 bit processor it takes up the space of 32 bits as that is width of registers and address bus.
Myths about Stacks that most people have
Because most developers take the stack for granted I see some some common Myths they use -
- Making a local variable uint8_t will save space.
A local variable NOT defined as static is always on the stack for recursion and easy disposal freeing up space later.
As the stack on 32 bit systems uses 32 bit operations, this does not save space but may add overhead for three other bytes
to be zeroed. It may be useful for other aspects of type matching and range limits built into the compiler.
On some 8 bit systems like AVR the AVR-GCC port makes all variables 16 bit and uses register pairs as 16 bit for all
- Number of parameters passed does not matter
If lots of your functions have more than generally FOUR parameters, unless you ABSOLUTLEY KNOW for THAT compiler, it uses
lots of registers for parameter passing, you will be passing many of the parameters on the stack. Even if your compiler
uses up to EIGHT register values for parameter passing many of those registers may well have to be saved before calling
your function and restored afterwards, adding calling overhead to your execution time.
An example of bad parameter passing is from an actual chip manufacturer library as follows (wrapped from one long line
to split lines) -
static USBH_PIPE_HANDLE USBH_PipeCreate (uint8_t dev_addr,
uint8_t dev_speed, uint8_t hub_addr, uint8_t hub_port,
uint8_t ep_addr, uint8_t ep_type, uint16_t ep_max_packet_size,
Yes that is EIGHT parameters, mainly as 8 bit or 16 bit values, however on a 32 bit processor that is 32 bits in either
register or stack for EACH parameter, and all the calling overhead to handle them. The worst part in that example is that
the second line which has three parameters, that are passed in but were NEVER used.
- Registers are always used for parameter passing so we do not need a large Stack.
Registers are used for the first few parameters, but as mentioned above you still need space to save old contents of registers
before the function and restore them later.
Hopefully whenever your application runs it does not end up going down many layers of software, meaning you have a very large
function call depth. That is functions, calling functions, calling functions, calling functions ... ad infinitum.
Common Problems with Stacks
Lots of things go wrong in systems often due to circumstances not expected or worst still not even thought of, or tested for in
original design. Some things like Cosmic radiation bit flips, faulty memory locations/bits or processor internal faults are beyond
the realm of this article. What we will cover here is working hardware internally and design issues.
Stack problems appear often at random times and are often seen as 'odd' or 'strange' behaviour of the software, beyond what is
expected or appearing to 'hang'. These can be caused by other issues as well, but often the working system is now off doing something
arbitrary,possibly even executing data as code.
These situations are usually because -
- Something else has written over the Stack
- The Stack has grown too much and is now writing over another part of memory (stack over run)
The first can be for many reasons more normally a memory leak causing the heap to grow too far and overwrite the Stack. Having said
that sometimes this is due to design and testing issues we cannot cover here are things like software bugs writing over random memory
areas, incorrectly setup DMA controllers. If lucky we might be able to catch some events and see if something can be done.
Some Causes of Stack Corruption and Corrupting Elsewhere
- Too Small
Someone just made it too small in the first place not understanding function call depth and simultaneous interrupt
requirements. Could also be worked before, but a new person is now adding code which has each function needing at least 12 parameters.
Some actual monitoring statistics can go a long way to finding out what you need and how much.
- Too many interrupts
Either your planned system to take 'x' interrupts a second and they are all coming in 1 ms instead of over the whole second, or
lack of proper handling of things like FIFOs to reduce interrupt overhead.
Also this could be faulty inputs interrupting too often, but to software analyse this you need a way of counting how many
interrupts a second you actually have to determine this is a fault condition. For example a car door sensor is toggling
saying the door is being opened and closed 100 times a second, and is adding extra load to the system.
- Bad recursion
We have all seen it runaway recursion, that just gobbles up stack.
You need a way to catch this externally (see later)
- Stack Placement
This can be a secondary problem to the other causes listed. When the stack grows too big (over run) what does it then write
over. The TWO worst placements of Stack I have seen are -
- Above the interrupt vectors, so the vector table is completely scrambled. next interrupt ANYTHING could happen.
- At absolute bottom of physical RAM so once an over run occurs no useful data saved on the stack and any references to
the stack contents will get RANDOM results. Watchdog reset is only thing that might eventually save you,
hopefully before any controlled machinery does any damage.
The main things I always note in these situations is that, there is usually -
- Poor or no system design or stack requirements specified (it just happens)
- Lack of resource maps of how many devices interrupt, at what priority, expected ranges of interrupt frequency. What interrupts
block all other interrupts.
Lack of resource maps showing types of interrupt in use, often see vectors setup and interrupt routines for peripherals that
are actually turned off, polled or just using DMA.
- Lack of Stack monitoring
- Lack of interrupt activity level monitoring
The first two this article can do little about, as that is usually down to an organisations management and actually doing engineering
level reviews and design of software which many do not do.
Monitoring of Stack and Interrupts
- Ability to monitor
- Ability to log and save across reset or power cycle the state
- Actions plan on what to do in different circumstances of finding information
So let's start with interrupts and then go on to Stack monitoring.
The easiest way to monitor interrupts is an array of counters 8 bit or 16 bit depending on your interrupt frequency per SECOND. Each
counter should -
- Be separate to every interrupt source (even one for EACH bit of GPIO interrupt), include any timer ticks here as well
- Should count to maximum, by incrementing by 1 for each interrupt
- Counter must NOT wrap
Once a second you can then decide what to do before resetting ALL counters, such as -
- If any above a set threshold save whole table
- Save total of interrupts, if nothing else as RAM variable for maximum interrupt frequency in use.
When logging if you have some spare EEPROM or external Flash space, consider saving the table there preferably with two copies in
there, one copy for first time saved and the other always overwritten with latest HIGHEST value. If you can save timestamps and
what total is there, along with anything else useful do so.
What actions you take depends on your application and each interrupt source as in can you disable that source and carry on, do we
need to reset a peripheral, of the whole system.
Some actions could be
- If total above a threshold save table and increase threshold.
- If certain ones beyond reasonable thresholds log as fault on that input and take suitable remedial safe action. Like car sensors.
- log problem and reset peripheral
- Put system in safe state and stop
- Perform watchdog reset (watching for watchdog loops)
This has some similar attributes and needs a few values saved (preferably in your data section), but adds very little overhead
most of which should be done in your 'idle task' or what you do when waiting to be able to start anything.
The assumption here is that you have a fairly standard memory layout with vectors data, bss and heap staring at bottom of RAM and
Stack starting at top of RAM.
First make sure you can access or create linker symbols for
|Start of Stack
|End of Stack
||lowest stack address
|Size of Stack
|End of heap
Usually at start up there is a memory gap between end of heap and end of stack
Create some variables for -
- Current Stack maximum size
- Clean Heap/gap before stack
- Four or more threshold values for stack actions as in examples of
These can easily be created and initialised at start up.
If you are going to log (dump stack and variables) somewhere make sure even at emergency action level there is enough stack depth
left to disable interrupts and call all the necessary function levels to save all data you need to some non-volatile storage.
then do even worst case action of either halt or watchdog reset.
Similar to interrupts it is best if you can have storage area for TWO copies at least, so first time is saved always then any
others are saved as latest HIGHEST problem, so even after power cycle you will still have the first time it happened.
Right so these are some of the things you need, how do you use them?
First thing is ALL timer tick like functions, should do simple stack threshold tests, to catch run away recursion, or nested
interrupts being continuously called. Then action appropriately. Once a threshold has been reached obviously should only action
again at next threshold level to avoid over loading the system.
Second monitoring method requires -
- at start up fill the bottom half of the stack and the same size again below the stack with a fixed 32bit pattern like
0xDEADFACE or 0x5555AAAA, set your variables for current maximum stack at half stack, and the clean heap before stack to lowest
point filled with pattern.
- As an 'idle task' or part of your 'super loop' when waiting you can then see if the pattern exists at both locations still, if
stack pointer at thresholds.
- If either end has been overwritten you can by binary chop search pattern find where the new points are and log them for next
pass and for analysis.
This gives the software data that can be used to log, raise alarms or do other actions for such events as -
- Heap starting to approach stack
- Heap at different thresholds near stack
- Heap starting to over write stack area
- Stack approaching bottom and various thresholds
- Stack overwriting heap area
- Both ends have MET!!!
This becomes defensive programming and can give you data analysis. So a couple of pointers can tell you a lot of data about what
happened to your stack and you have the ability to save the whole stack and other system status to work out what is happening,
before it becomes destructive. All with a small amount of code and very few variables.
As with everything else what actions you take at each step is a matter best suited for your application and how safe it has to
operate or if a watchdog reset is sufficient.