Calculation of stack size in FreeRtos or TI rtos - stack

Recently I was working with Rtos and created some tasks to perform my required actions. Although it seems like every time when I create new task with xTaskCreate or TI GUI configuration, I simply try to keep my stack size as much so that the stack must not get overflowed.
Is there any way to calculate the maximum stack size used by my task with respect to these events?
1. Stack used by global and local variable
2. Stack used by the maximum number of recursion of the function
3. Including the interrupt context switching

The compiler, compiler optimisation level, CPU architecture, local variable allocations and function call nesting depth all have a large impact on the stack size. The RTOS has minimal impact. For example, FreeRTOS will add approximately 60 bytes to the stack on a Cortex-M - which is used to store the task's context when the task is not running. Whichever method you use to calculate stack usage in your non-RTOS project can be used in your RTOS project too - then add approximately 60 bytes.
You can calculate these things, and that can be important in safety critical applications, but in other cases a more pragmatic approach is to try it and see - use the features of the RTOS to measure how much stack is actually being used and use the stack overflow detection - then adjust until you find something optimal.
http://www.freertos.org/Stacks-and-stack-overflow-checking.html
http://www.freertos.org/uxTaskGetStackHighWaterMark.html

I'm used this code:
TaskHandle_t cipTask;
UBaseType_t uxHighWaterMark;
/* Print actual size of stack has used */
for (;;) {
uxHighWaterMark = uxTaskGetStackHighWaterMark(cipTask);
Serial.println(uxHighWaterMark);
}

Related

All blocks read same global memory location section. Fastest method is?

I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.

FreeRTOS - The reason for the stack increase?

I have created several tasks, the body of which is the same function. Inside the function, there is a delay that is the same for every task. So, when this delay is large enough, the stack of each task is filled with 6 words less than when the delay is less.
As far as I understand, the stack of tasks increases when there is a situation with several tasks with the status Ready.
1.In this situation, some additional 6-word context is written to the task stack?
2.In my example, it turns out 6 words (24 bytes), can this value change somehow?
3.What else can affect the increase in the stack, such as jumping into an interrupt handler?
int main(void){
xTaskCreate(Task_PrintCountString, "Task_1", mySTACK_SIZE, &param1, 1, NUUL);
xTaskCreate(Task_PrintCountString, "Task_2", mySTACK_SIZE, &param2, 1, NUUL);
xTaskCreate(Task_PrintCountString, "Task_3", mySTACK_SIZE, &param3, 1, NUUL);
vTaskStartScheduler();
}
#define myDELAY 100
void Task_PrintCountString(void *pParams){
uint16_t c=0;
for(;;){
if(xSemaphoreTake(WriteCountMutex, portMAX_DELAY) == pdTRUE){
PrintCountString(*(uint8_t *)pParams, c++);
xSemaphoreGive(WriteCountMutex);
}
vTaskDelay(myDELAY/portTICK_PERIOD_MS);//When myDELAY=1, the task stack is 6 words more than when myDELAY= 100!
}
}
Address 0x20000158 is the top of the stack for one of the tasks. Similarly, others!
The only function PrintCountString always the same depth. But at the same time, the stack can grow to its maximum knowledge in several stages, going through dozens of iterations of the task cycle! It turns out that not only the context is saved to the stack, but something else?
P.S.
I use ARM CM0 port and heap_1.
I made an observation:
In the PrintCountString function there was a block for waiting for the SPI flag - while(). I replaced the DMA transfer method and removed the while() loop. The stack of both tasks began to fill up immediately to the maximum value, regardless of delays.
FreeRTOS is just C code so, with one exception, the stack usage is determined by the compiler, including the selected optimisation level. The one exception be that when a task is not running its context is saved to its stack. The context size is fixed in all cases that matter. The size of the context depends on which FreeRTOS port you are using (there are more than 40).
There are a few things in your post I'm not sure about.
when this delay is large enough, the stack of each task is filled with
6 words less than when the delay is less
As above, the C code doesn't change depending on how long a delay you have - the compiler knows nothing about FreeRTOS or the concept of a delay. You may see different stack depths depending on where the code was in the call tree when an interrupt occurs (including interrupts that cause context switches).
I look at the top of the stack and count the remaining number of bytes
to be 0xA5
You don't say which port you are using, so I don't know if the stack grows up or down. In any case the stack is filled with 0xa5 when the task is created but never touched by the kernel again so the number of 0xa5s left shows the maximum amount of stack the task has used since it was created. That value is returned by calling uxTaskGetStackHighWaterMark().
In general the stack memory is used by the microcontroller itself. There are several instructions which writes/reads some data to/from the stack.
The PUSH instruction copies the content of one register to the stack and modifies the stack pointer register
The POP instruction reads a word from the stack into a register and modifies the stack pointer register
Both instructions are managed by the compiler. If a functions is executed the compiler creates some PUSH instructions to "free" some registers to work with. At the end of the function the original register state will be restored using the POP instructions to guarantee the upcoming control flow.
and each branch instruction stores several register values to the stack
The amount of used stack memory depends on the call hierarchy.
According to your example, if the SemaphoreTake function is called and finds a released semaphore the call hierarchy is different from a situation where the semaphore is blocked. This creates a different stack usage.

Is the maximum call stack size in WebKit three times greater than that of V8? Why?

I see from this blog post the following function to compute the max call stack size:
function computeMaxCallStackSize() {
try {
return 1 + computeMaxCallStackSize();
} catch (e) {
// Call stack overflow
return 1;
}
}
if ran from Chrome's console, 12530 is the result (for V8)
Assuming that the same function computes the same result for WebKit, why when ran for Safari is 36243 the result? This is ~three times the size? The only time that I've ran into this error is when creating a good ole infinite loop. Is this an arbitrary decision? Does a larger stack size confer larger benefits?
The maximum number of (recursive or not, doesn't matter) calls is determined by (1) the size of available stack space, divided by (2) the size of each stack frame of the active function(s).
(1) has an upper limit imposed by the operating system; on systems I'm aware of it's usually between 1MB and 8MB. Below that limit, a JavaScript engine can set its own limit. V8 sets a limit of a bit less than one megabyte on all platforms, in order to have different platforms behave as similar to each other as possible. I don't know what Safari/JavaScriptCore does.
(2) depends on the implementation details of the JavaScript engine (specifically, how many slots in each stack frame it uses for internal data), as well as on the number of local variables in each of the functions that are involved.
As you observe, one usually only runs into the stack limit in case of accidental infinite recursion. So for most real applications, the specific value of the limit doesn't matter, and a larger stack provides no benefit.
Note that stack space is unrelated to maximum memory consumption (aka heap space). You can have gigabytes of heap with only a megabyte of stack.

FreeRTOS configMINIMAL_STACK_SIZE

In some of the demos for FreeRTOS on cortex M0 MCUs configMINIMAL_STACK_SIZE is set to 60 while on some others it set to 70. Using the STM32Cube software it's set to 128.
My question is what is actually the MINIMAL stack size?
Looking in the STM32 Cortex-M0 programming manual I see that the processor registers are
R0-R12, MSP, PSP, LR, PC, PSR, ASPR, IPSR, EPSR, PRIMASK, CONTROL. Wouldn't that mean that the MINIMAL stack size is just 23 words? Or is there more info that needs to be saved for a context switch?
As per the description here: http://www.freertos.org/a00110.html#configMINIMAL_STACK_SIZE as far as the RTOS is concerned the constant does nothing more than set the size of the stack used by the idle task.
The stack has to be large enough to hold the context of the task, as well as any normal stack items used by the task (local variables, function call overhead, etc.) so the actual size required depends on what the idle task is doing - and will be at its very minimum if the idle task is doing nothing. If on the other hand there is an idle task hook function in use (http://www.freertos.org/a00016.html) then the required stack size will depend on what the hook function is doing (its function call depth, etc.).
The constant is also used by the demo tasks as a convenient way of being able to use the same demo tasks on multiple architectures, but that does not effect the RTOS, it is just demo code.

How does a stackless language work?

I've heard of stackless languages. However I don't have any idea how such a language would be implemented. Can someone explain?
The modern operating systems we have (Windows, Linux) operate with what I call the "big stack model". And that model is wrong, sometimes, and motivates the need for "stackless" languages.
The "big stack model" assumes that a compiled program will allocate "stack frames" for function calls in a contiguous region of memory, using machine instructions to adjust registers containing the stack pointer (and optional stack frame pointer) very rapidly. This leads to fast function call/return, at the price of having a large, contiguous region for the stack. Because 99.99% of all programs run under these modern OSes work well with the big stack model, the compilers, loaders, and even the OS "know" about this stack area.
One common problem all such applications have is, "how big should my stack be?". With memory being dirt cheap, mostly what happens is that a large chunk is set aside for the stack (MS defaults to 1Mb), and typical application call structure never gets anywhere near to using it up. But if an application does use it all up, it dies with an illegal memory reference ("I'm sorry Dave, I can't do that"), by virtue of reaching off the end of its stack.
Most so-called called "stackless" languages aren't really stackless. They just don't use the contiguous stack provided by these systems. What they do instead is allocate a stack frame from the heap on each function call. The cost per function call goes up somewhat; if functions are typically complex, or the language is interpretive, this additional cost is insignificant. (One can also determine call DAGs in the program call graph and allocate a heap segment to cover the entire DAG; this way you get both heap allocation and the speed of classic big-stack function calls for all calls inside the call DAG).
There are several reasons for using heap allocation for stack frames:
If the program does deep recursion dependent on the specific problem it is solving,
it is very hard to preallocate a "big stack" area in advance because the needed size isn't known. One can awkwardly arrange function calls to check to see if there's enough stack left, and if not, reallocate a bigger chunk, copy the old stack and readjust all the pointers into the stack; that's so awkward that I don't know of any implementations.
Allocating stack frames means the application never has to say its sorry until there's
literally no allocatable memory left.
The program forks subtasks. Each subtask requires its own stack, and therefore can't use the one "big stack" provided. So, one needs to allocate stacks for each subtask. If you have thousands of possible subtasks, you might now need thousands of "big stacks", and the memory demand suddenly gets ridiculous. Allocating stack frames solves this problem. Often the subtask "stacks" refer back to the parent tasks to implement lexical scoping; as subtasks fork, a tree of "substacks" is created called a "cactus stack".
Your language has continuations. These require that the data in lexical scope visible to the current function somehow be preserved for later reuse. This can be implemented by copying parent stack frames, climbing up the cactus stack, and proceeding.
The PARLANSE programming language I implemented does 1) and 2). I'm working on 3). It is amusing to note that PARLANSE allocates stack frames from a very fast-access heap-per-thread; it costs typically 4 machine instructions. The current implementation is x86 based, and the allocated frame is placed in the x86 EBP/ESP register much like other conventional x86 based language implementations. So it does use the hardware "contiguous stack" (including pushing and poppping) just in chunks. It also generates "frame local" subroutine calls the don't switch stacks for lots of generated utility code where the stack demand is known in advance.
Stackless Python still has a Python stack (though it may have tail call optimization and other call frame merging tricks), but it is completely divorced from the C stack of the interpreter.
Haskell (as commonly implemented) does not have a call stack; evaluation is based on graph reduction.
There is a nice article about the language framework Parrot. Parrot does not use the stack for calling and this article explains the technique a bit.
In the stackless environments I'm more or less familiar with (Turing machine, assembly, and Brainfuck), it's common to implement your own stack. There is nothing fundamental about having a stack built into the language.
In the most practical of these, assembly, you just choose a region of memory available to you, set the stack register to point to the bottom, then increment or decrement to implement your pushes and pops.
EDIT: I know some architectures have dedicated stacks, but they aren't necessary.
Call me ancient, but I can remember when the FORTRAN standards and COBOL did not support recursive calls, and therefore didn't require a stack. Indeed, I recall the implementations for CDC 6000 series machines where there wasn't a stack, and FORTRAN would do strange things if you tried to call a subroutine recursively.
For the record, instead of a call-stack, the CDC 6000 series instruction set used the RJ instruction to call a subroutine. This saved the current PC value at the call target location and then branches to the location following it. At the end, a subroutine would perform an indirect jump to the call target location. That reloaded saved PC, effectively returning to the caller.
Obviously, that does not work with recursive calls. (And my recollection is that the CDC FORTRAN IV compiler would generate broken code if you did attempt recursion ...)
There is an easy to understand description of continuations on this article: http://www.defmacro.org/ramblings/fp.html
Continuations are something you can pass into a function in a stack-based language, but which can also be used by a language's own semantics to make it "stackless". Of course the stack is still there, but as Ira Baxter described, it's not one big contiguous segment.
Say you wanted to implement stackless C. The first thing to realize is that this doesn't need a stack:
a == b
But, does this?
isequal(a, b) { return a == b; }
No. Because a smart compiler will inline calls to isequal, turning them into a == b. So, why not just inline everything? Sure, you will generate more code but if getting rid of the stack is worth it to you then this is easy with a small tradeoff.
What about recursion? No problem. A tail-recursive function like:
bang(x) { return x == 1 ? 1 : x * bang(x-1); }
Can still be inlined, because really it's just a for loop in disguise:
bang(x) {
for(int i = x; i >=1; i--) x *= x-1;
return x;
}
In theory a really smart compiler could figure that out for you. But a less-smart one could still flatten it as a goto:
ax = x;
NOTDONE:
if(ax > 1) {
x = x*(--ax);
goto NOTDONE;
}
There is one case where you have to make a small trade off. This can't be inlined:
fib(n) { return n <= 2 ? n : fib(n-1) + fib(n-2); }
Stackless C simply cannot do this. Are you giving up a lot? Not really. This is something normal C can't do well very either. If you don't believe me just call fib(1000) and see what happens to your precious computer.
Please feel free to correct me if I'm wrong, but I would think that allocating memory on the heap for each function call frame would cause extreme memory thrashing. The operating system does after all have to manage this memory. I would think that the way to avoid this memory thrashing would be a cache for call frames. So if you need a cache anyway, we might as well make it contigous in memory and call it a stack.

Resources