The stack an interrupt handler use on VxWorks (PowerPC) - stack

Does the interrupt handler use the stack of the task that's interrupted or a separate stack as its stack? (PowerPC, VxWorks)

This is architecture dependent. From the VxWorks Kernel Programmer's Guide (v6.8):
All ISRs use the same interrupt stack. [...]
CAUTION: Some architectures do not permit using a separate interrupt stack, and
ISRs use the stack of the interrupted task. [...] See
the VxWorks reference for your BSP to determine whether your architecture
supports a separate interrupt stack.
In your case, PowerPC does support a separate shared interrupt stack (per core).

In VxWorks, there is a specific stack for interrupts. All Interrupt handlers share that same stack, which is located just above where the vxWorks image is loaded.
I believe the default stack size is 5K, but can easily be changed with the kernel configurator.
The ISR mechanism works roughly this way:
You can think of VxWorks as typically installs an assembly code wrapper around your ISR code.
On Entry, it automatically saves the general purpose registers (on the ISR stack) so the executing context (another ISR or a task) state is preserved.
On Exit, the registers are restored, but in addition, the OS scheduler is called to see if the just finished ISR changed the state of a higher priority task. If this happened, then the higher priority task resumes. If no higher priority tasks are available, then the original task is restored.

xiaokaoy,
There is a pretty good description of how interrupts work in the VxWorks Programmer's Guide section 2.6. If you don't have a copy, it's available online from many sources.

Related

Calculating the Stack usage in RTOS application

I am currently working on a project to develop an application in STM32 microcontroller using RTOS (micrium).
Are there any tools to calculate the stack usage of a particular thread in RTOS application?
No tools I know of. However, two simple methods to estimate stack usage have always worked for me.
Fill all RAM with a value like 0x55 or 0xAA. Let the program run long enough while using all of the device's options to have the most code execution coverage. Stop (under some debugger), and examine RAM for the above values being overwritten. That should give you a good approximation. This works with or without an OS.
Modify the OS just a bit so that on task switches you record to some global variable (array) and for each task the lowest stack pointer found by comparing to the previous value for the same task. After running the app long enough as in [1], examine the counters. Although there is no guarantee the moment a task switch happens you will have the maximum stack used for that task, statistically, after long enough time and assuming preemptive switching, you will have managed to record an accurate enough value.
If you are using GCC or clang -fstack-usage compiler switch generates a stack frame size for each function. You need to combine that information with call-graph information generated by the linker to find the deepest stack usage starting from a specific function. Starting at main(), a task entry-point and and ISR will then give you the worst-case usage for that thread.
Helpfully the work to create such a tool has been done for you as discussed here, using a Perl script from here.
ARM's armcc compiler v5 and earlier (v6 is clang/llvm) has this functionality built-in and can include detailed stack analysis in the link map, including the worst-case call path and warnings of non-deterministic stack usage (due to recursion or call-backs through function pointers for example). You may be using armcc if you are using Keil ARM MDK for example. Again for multi-threaded systems (tasks/ISRs) you need to look at the stack usage for the thread entry point.
Note also that on ARM Cortex-M, the "system stack" is shared by the main() thread and all ISRs, and if you use the ISR preemption priorities multiple interrupts may be active simultaneously. So in theory worst case stack usage is the sum of the stack usage for each of main() and all ISRs that may occur concurrently. Whilst it is good practice to keep ISRs short and simple, beware of third-party code. ST's USB library for example runs the entire USB device stack in the ISR context for example!

Does ISR (Interrupt Service Routine) have a separate stack?

When using an RTOS (ex FreeRTOS), we have separate stack spaces for each thread. So what about ISR (Interrupt Service Routines), does they have a separate stack in the memory? Or is this configurable?
If they don't have a stack where the local variables declared in ISR get stored?
I have the exact same question and a lot of searching leads me to this conclusion: the answer is dependent on your chip and how the OS you use configures that chip.
So looking at one of my favorite chips ARM Cortex-M3 (for which interrupts are a form of exception), the documentation at various spots reads:
Operating Modes
The Cortex-M3 supports Privileged and User (non-privileged) execution.
Code run as Privileged has full access rights whereas code executed as
User has limited access rights. The limitations include restrictions
on instruction use such as MSR fields, access to memory and
peripherals based on system design, and restrictions imposed by the
MPU configuration.
The processor supports two operation modes, Thread mode and Handler
mode. Thread mode is entered on reset and normally on return from an
exception. When in Thread mode, code can be executed as either
Privileged or Unprivileged.
Handler mode will be entered as a result of an exception. Code in
Handler mode is always executed as Privileged, therefore the core will
automatically switch to Privileged mode when exceptions occur. You can
change between Privileged Thread mode and User Thread mode when
returning from an exception by modifying the EXC_RETURN value in the
link register (R14). You can also change from Privileged Thread to
User Thread mode by clearing CONTROL[0] using an MSR instruction.
However, you cannot directly change to privileged mode from
unprivileged mode without going through an exception, for example an
SVC.
Main and Process Stacks
The Cortex-M3 supports two different stacks, a main stack and a
process stack. To support this the Cortex-M3 has two stack pointers
(R13). One of these is banked out depending on the stack in use. This
means that only one stack pointer at a time is visible as R13.
However, both stack pointers can be accessed using the MRS and MSR
instructions. The main stack is used at reset, and is always used in
Handler mode (when entering an exception handler). The process stack
pointer is only available as the current stack pointer when in Thread
mode. You can select which stack pointer (main or process) is used in
Thread mode in one of two ways, either by using the EXC_RETURN value
when exiting from Handler Mode or while in Thread Mode by writing to
CONTROL[1] using an MSR instruction.
And...
When the processor takes an exception, unless the exception is a
tail-chained or a late-arriving exception, the processor pushes
information onto the current stack. This operation is referred to as
stacking and the structure of eight data words is referred as the
stack frame. ...
Immediately after stacking, the stack pointer indicates the lowest
address in the stack frame
From the book "The Definitive Guide to the ARM Cortex-M3":
The MSP, also called SP_main in ARM documentation, is the default SP
after power-up; it is used by kernel code and exception handlers. The
PSP, or SP_process in ARM documentation, is typically used by thread
processes in system with embedded OS running.
Because exception handlers always use the Main Stack Pointer, the main
stack memory should contain enough space for the largest number of
nesting interrupts.
When an exception takes place, the registers R0–R3, R12, LR, PC,
and Program Status (PSR) are pushed to the stack. If the code that is
running uses the Process Stack Pointer (PSP), the process stack will
be used; if the code that is running uses the Main Stack Pointer
(MSP), the main stack will be used. Afterward, the main stack will
always be used during the handler, so all nested interrupts will use
the main stack.
UPDATE 6/2017:
My previous answer was incorrect, I have analyzed FreeRTOS for cortex processors and rewritten my answer to:
The standard FreeRTOS version for the Cortex-M3 does in fact configure and use both the MSP and PSP. When the very first task runs it modifies MSP to point to the first address specified in the vector table (0x00000000), this tends to be the very last word in SRAM, then it triggers a system call, in the system call exception handler it sets the PSP to the next task stack location, then it modifies the exception LR value such that "return to thread mode and on return use the process stack".
This means that the interrupt service routine (AKA exception handler) stack is grows down from the address specified in the vector table.
You can configure your linker and startup code to locate the exception handler stack wherever you like, make sure your heap or other memory areas do not overlap the exception handler area and make sure the area is large enough.
The answer for other chips and operating systems could be completely different!
To help ensure your application has appropriate space on the ISR stack (MSP),
here's some additional code to check actual ISR stack use. Use in addition to the checking you're already doing on FreeRTOS task stack use:
https://sourceforge.net/p/freertos/discussion/382005/thread/8418dd523e/
Update: I've posted my version of port.c that includes the ISR stack use check on github:
https://github.com/DRNadler/FreeRTOS_helpers

Using pthread_yield to return control over the CPU back to Kernel

Consider the following scenario:
in a POSIX system, some thread from a user program is running and timer_interrupt has been disabled.
to my understanding, unless it terminates - the currently running thread won't willingly give away control over the CPU.
My question is as follows: will calling pthread_yield() from within the thread give the kernel control over the CPU?
any kind of help with this question would be greatly appreciated.
Turning off the operating system's timer interrupt would change it into a cooperative multitasking system. This is how Windows 1,2,3 and Mac OS 9 worked. The running task only changed when the program made a system call.
Since pthread_yield results in a system call, yes, the kernel would get control back from the program.
If you are writing programs on a cooperative multitasking system, it is very important to not hog the CPU. If your program does hog the CPU the entire system comes to a halt.
This is why Windows MFC has the idle message in its message loop. Programs doing long term tasks would do it in that message handler by operating on one or two items and then returning to the operating system to check if the user had clicked on something.
It can easily relinquish control by issuing system calls that perform blocking inter-thread comms or requesting I/O. The timer interrupt is not absolutely required for a multithreaded OS, though it is very useful for providing timeouts for such system calls and helping out if the system is overloaded with ready threads.

Consistency Rules for cudaHostAllocMapped

Does anyone know of documentation on the memory consistency model guarantees for a memory region allocated with cudaHostAlloc(..., cudaHostAllocMapped)? For instance, when writes from the device become visible to reads from the host would be useful (could be after the kernel completes, at earliest possible time during kernel execution, etc).
Writes from the device are guaranteed to be visible on the host (or on peer devices) after the performing thread has executed a __threadfence_system() call (which is only available on compute capability 2.0 or higher).
They are also visible after the kernel has finished, i.e. after a cudaDeviceSynchronize() or after one of the other synchronization methods listed in the "Explicit Synchronization" section of the Programming Guide has been successfully completed.
Mapped memory should never be modified from the host while a kernel using it is or could be running, as CUDA currently does not provide any way of synchronization in that direction.

Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize blocks. Please let me know where i am going wrong
Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.
On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.
This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.
Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.
The accepted answer is not always correct.
In most cases, kernel launch is asynchronous. But in the following case, it is synchronous. And they are easily ignored by people.
environment variable CUDA_LAUNCH_BLOCKING equals to 1.
using a profiler(nvprof), without enabling concurrent kernel profiling
memcpy that involve host memory which is not page-locked.
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.
From the NVIDIA CUDA programming guide(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device).
Concurrent kernel execution is supported since 2.0 CUDA capability version.
In addition, a return to the CPU code can be made earlier than all the warp kernel to have worked.
In this case, you can provide synchronization yourself.

Resources