stack management in CLR - clr

I understand the basic concept of stack and heap but great if any1 can solve following confusions:
Is there a single stack for entire application process or for each thread starting in a project a new stack is created?
Is there a single Heap for entire application process or for each thread starting in a project a new stack is created?
If Stack are created for each thread, then how process manage sequential flow of threads (and hence stacks)

There is a separate stack for every thread. This is true not only for CLR, and not only for Windows, but pretty much for every OS or platform out there.
There is single heap for every Application Domain. A single process may run several app domains at once. A single app domain may run several threads.
To be more precise, there are usually two heaps per domain: one regular and one for really large objects (like, say, a 64K array).
I don't understand what you mean by "sequential flow of threads".

One stack for each thread, all threads share the same heaps.
There is no 'sequential flow' of threads. A thread is an operating system object that stores a copy of the processor state. The processor state includes the register values. One of them is ESP, the stack pointer. Another really important one is EIP, the instruction pointer. When the operating system switches between threads, it stores the processor state in the current thread object and reloads the state from the thread object for the thread that was selected to run next. The processor now simply continues executing where it left off previously.
Getting a thread started is perhaps now easy to understand as well. The operating system allocates a megabyte of memory for the stack. And initializes the ESP register value to point to that memory. And sets the value of the EIP register to the address of the method where the thread should start executing. The value of the ThreadStart delegate in C#.

Each thread must have it's own stack, that's where local variables and parameters are held, and the return addresses of the previous functions.

Related

Why is there only limited usage of thread pools in TensorFlow-Federated?

TFF's threading libraries start a new thread from ThreadRun by default, and the only usage (as of TFF 0.42.0) of the optional ThreadPool parameter is in the implementation of a single executor. Why is this the case?
After conferring with some people who were close to the implementation, the understanding we came to was:
The issue with totally general usage of thread pools in TFF is generally that if used incorrectly, we may be courting deadlock. We need FIFO scheduling in the thread pool itself, and FIFO-compatible usage in the runtime (if you need the result of a computation, you need to know it will be started before you start).
When implementing the first usages of thread pools in the TF executor, we reasoned ourselves to believing the following statement is true: at the leaf executors (that is, so long as an executor doesnt have any children), this FIFO-compatible programming is guaranteed by the stateful executor interface. That is, if you need a value, you know it has already been created (otherwise the executor wouldn't be able to resolve it), so as long as the thread pool is FIFO, it will be ready before you execute. Either the creating function already pushed a function onto this FIFO queue, or just created the value directly, so you can push yourself onto the FIFO queue no sweat.
Due to difficulty, we haven't really tried to reason too hard about how / whether we might be able to make similar statements about executors which have children (and these children may be pushing work onto the queue; AFAIK we dont really currently make any guarantees about how we do this, but i could imagine reasoning about a similar invariant step-by-step 'up the stack'). Thus we have only considered it safe so far to inject thread pool usage at leaf executors. The fact that we don't have this in the XLAExecutor yet is simply due to lack of use.

Operating Systems: Processes, Pagination and Memory Allocation doubts

I have several doubts about processes and memory management. List the main. I'm slowly trying to solve them by myself but I would still like some help from you experts =).
I understood that the data structures associated with a process are more or less these:
text, data, stack, kernel stack, heap, PCB.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
Pagination allows you to allocate processes in a non-contiguous way:
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
Kernel schedulers (LTS, MTS, STS) when are they invoked? From what I understand there are three types of kernels:
separate kernel, below all processes.
the kernel runs inside the processes (they only change modes) but there are "process switching functions".
the kernel itself is based on processes but still everything is based on process switching functions.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
I really thank those who will help me.
Have a good day!
I think you have lots of misconceptions. Let's try to clear some of these.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
I don't know what you mean by LTS. The kernel can decide to send some pages to secondary memory but only on a page granularity. Meaning that it won't send a whole text segment nor a complete data segment but only a page or some pages to the hard-disk. Yes, the PCB is stored in kernel space and never swapped out (see here: Do Kernel pages get swapped out?).
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
On x86-64, each page table entry has 12 bits reserved for flags. The first (right-most bit) is the present bit. On access to the page referenced by this entry, it tells the processor if it should raise a page-fault. If the present bit is 0, the processor raises a page-fault and calls an handler defined by the OS in the IDT (interrupt 14). Virtual memory is not secondary memory. It is not the same. Virtual memory doesn't have a physical medium to back it. It is a concept that is, yes implemented in hardware, but with logic not with a physical medium. The kernel holds a memory map of the process in the PCB. On page fault, if the access was not within this memory map, it will kill the process.
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
The processes are allocated contiguously in the virtual memory but not in physical memory. See my answer here for more info: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?. I think stack overflow is checked with a page guard. The stack has a maximum size (8MB) and one page marked not present is left underneath to make sure that, if this page is accessed, the kernel is notified via a page-fault that it should kill the process. In itself, there can be no stack overflow attack in user mode because the paging mechanism already isolates different processes via the page tables. The heap has a portion of virtual memory reserved and it is very big. The heap can thus grow according to how much physical space you actually have to back it. That is the size of the swap file + RAM.
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
The programs assume an address (often 0x400000) for the base of the executable. Today, you also have ASLR where all symbols are kept in the executable and determined at load time of the executable. In practice, this is not done much (but is supported).
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
The kernel has a memory map for each process. When the process dies via abnormal termination, the memory map is crossed and cleared off of that process's use.
Kernel schedulers (LTS, MTS, STS) when are they invoked?
All your assumptions are wrong. The scheduler cannot be called otherwise than with a timer interrupt. The kernel isn't a process. There can be kernel threads but they are mostly created via interrupts. The kernel starts a timer at boot and, when there is a timer interrupt, the kernel calls the scheduler.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
The heap and stack have portions of virtual memory reserved for them. The text/data segment start at 0x400000 and end wherever they need. The space reserved for them is really big in virtual memory. They are thus limited by the amount of physical memory available to back them. The JVM is another thing. The stack in JVM is not the real stack. The stack in JVM is probably heap because JVM allocates heap for all the program's needs.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
The kernel doesn't do that. On Linux, the libstdc++/libc C++/C implementation does that instead. When you allocate memory dynamically, the C++/C implementation keeps track of the allocated space so that it won't request a new page for a small allocation.
EDIT
Do compiled (and interpreted?) Programs only work with virtual addresses?
Yes they do. Everything is a virtual address once paging is enabled. Enabling paging is done via a control register set at boot by the kernel. The MMU of the processor will automatically read the page tables (among which some are cached) and will translate these virtual addresses to physical ones.
So do pointers inside PCBs also use virtual addresses?
Yes. For example, the PCB on Linux is the task_struct. It holds a field called pgd which is an unsigned long*. It will hold a virtual address and, when dereferenced, it will return the first entry of the PML4 on x86-64.
And since the virtual memory of each process is contiguous, the kernel can immediately recognize stack overflows.
The kernel doesn't recognize stack overflows. It will simply not allocate more pages to the stack then the maximum size of the stack which is a simple global variable in the Linux kernel. The stack is used with push pops. It cannot push more than 8 bytes so it is simply a matter of reserving a page guard for it to create page-faults on access.
however the scheduler is invoked from what I understand (at least in modern systems) with timer mechanisms (like round robin). It's correct?
Round-robin is not a timer mechanism. The timer is interacted with using memory mapped registers. These registers are detected using the ACPI tables at boot (see my answer here: https://cs.stackexchange.com/questions/141870/when-are-a-controllers-registers-loaded-and-ready-to-inform-an-i-o-operation/141918#141918). It works similarly to the answer I provided for USB (on the link I provided here). Round-robin is a scheduler priority scheme often called naive because it simply gives every process a time slice and executes them in order which is not currently used in the Linux kernel (I think).
I did not understand the last point. How is the allocation of new memory managed.
The allocation of new memory is done with a system call. See my answer here for more info: Who sets the RIP register when you call the clone syscall?.
The user mode process jumps into a handler for the system call by calling syscall in assembly. It jumps to an address specified at boot by the kernel in the LSTAR64 register. Then the kernel jumps to a function from assembly. This function will do the stuff the user mode process requires and return to the user mode process. This is often not done by the programmer but by the C++/C implementation (often called the standard library) that is a user mode library that is linked against dynamically.
The C++/C standard library will keep track of the memory it allocated by, itself, allocating some memory and by keeping records. Then, if you ask for a small allocation, it will use the pages it already allocated instead of requesting new ones using mmap (on Linux).

How do modern OS's achieve idempotent cleanup functions for process deaths?

Let's say that I have an OS that implements malloc by storing a list of segments that the process points to in a process control block. I grab my memory from a free list and give it to the process.
If that process dies, I simply remove the reference to the segment from the process control block, and move the segment back to my free list.
Is it possible to create an idempotent function that does this process cleanup? How is it possible to create a function such that it can be called again, regardless of whether it was called many times before or if previous calls died in the middle of executing the cleanup function? It seems to me that you can't execute two move commands atomically.
How do modern OS's implement the magic involved in culling memory from processes that randomly die? How do they implement it so that it's okay for even the process performing the cull to randomly die, or is this a false assumption that I made?
I'll assume your question boils down to how the OS culls a process's memory if that process crashes.
Although I'm self educated in these matters, I'll give you two ways an OS can make sure any memory used by a process is reclaimed if the process crashes.
In a typical modern CPU and modern OS with virtual memory:
You have two layers of allocation. Whenever the process calls malloc, malloc tries to satisfy the request from already available memory pages the kernel gave the process. If not enough pages are available, malloc asks the kernel to allocate more pages.
In this case, whenever a process crashes or even if it exits normally, the kernel doesn't care what malloc did, or what memory the process forgot to release. It only needs to free all the pages it gave the process.
In a simpler OS that doesn't care much about performance, memory fragmentation or virtual memory and maybe not even about memory protection:
Malloc/free is implemented completely on the kernel side (e.g: system calls). Whenever a process calls malloc/free, the kernel does all the work, and therefore knows about all the memory that needs to be freed. Once the process crashes or exits, the kernel can cleanup. Since the kernel is never supposed to crash, and keep a record of all the allocated memory per process, it's trivial.
Like I said, I'm self educated, and I didn't check how for example Linux or Windows implement it.

Does ISR (Interrupt Service Routine) have a separate stack?

When using an RTOS (ex FreeRTOS), we have separate stack spaces for each thread. So what about ISR (Interrupt Service Routines), does they have a separate stack in the memory? Or is this configurable?
If they don't have a stack where the local variables declared in ISR get stored?
I have the exact same question and a lot of searching leads me to this conclusion: the answer is dependent on your chip and how the OS you use configures that chip.
So looking at one of my favorite chips ARM Cortex-M3 (for which interrupts are a form of exception), the documentation at various spots reads:
Operating Modes
The Cortex-M3 supports Privileged and User (non-privileged) execution.
Code run as Privileged has full access rights whereas code executed as
User has limited access rights. The limitations include restrictions
on instruction use such as MSR fields, access to memory and
peripherals based on system design, and restrictions imposed by the
MPU configuration.
The processor supports two operation modes, Thread mode and Handler
mode. Thread mode is entered on reset and normally on return from an
exception. When in Thread mode, code can be executed as either
Privileged or Unprivileged.
Handler mode will be entered as a result of an exception. Code in
Handler mode is always executed as Privileged, therefore the core will
automatically switch to Privileged mode when exceptions occur. You can
change between Privileged Thread mode and User Thread mode when
returning from an exception by modifying the EXC_RETURN value in the
link register (R14). You can also change from Privileged Thread to
User Thread mode by clearing CONTROL[0] using an MSR instruction.
However, you cannot directly change to privileged mode from
unprivileged mode without going through an exception, for example an
SVC.
Main and Process Stacks
The Cortex-M3 supports two different stacks, a main stack and a
process stack. To support this the Cortex-M3 has two stack pointers
(R13). One of these is banked out depending on the stack in use. This
means that only one stack pointer at a time is visible as R13.
However, both stack pointers can be accessed using the MRS and MSR
instructions. The main stack is used at reset, and is always used in
Handler mode (when entering an exception handler). The process stack
pointer is only available as the current stack pointer when in Thread
mode. You can select which stack pointer (main or process) is used in
Thread mode in one of two ways, either by using the EXC_RETURN value
when exiting from Handler Mode or while in Thread Mode by writing to
CONTROL[1] using an MSR instruction.
And...
When the processor takes an exception, unless the exception is a
tail-chained or a late-arriving exception, the processor pushes
information onto the current stack. This operation is referred to as
stacking and the structure of eight data words is referred as the
stack frame. ...
Immediately after stacking, the stack pointer indicates the lowest
address in the stack frame
From the book "The Definitive Guide to the ARM Cortex-M3":
The MSP, also called SP_main in ARM documentation, is the default SP
after power-up; it is used by kernel code and exception handlers. The
PSP, or SP_process in ARM documentation, is typically used by thread
processes in system with embedded OS running.
Because exception handlers always use the Main Stack Pointer, the main
stack memory should contain enough space for the largest number of
nesting interrupts.
When an exception takes place, the registers R0–R3, R12, LR, PC,
and Program Status (PSR) are pushed to the stack. If the code that is
running uses the Process Stack Pointer (PSP), the process stack will
be used; if the code that is running uses the Main Stack Pointer
(MSP), the main stack will be used. Afterward, the main stack will
always be used during the handler, so all nested interrupts will use
the main stack.
UPDATE 6/2017:
My previous answer was incorrect, I have analyzed FreeRTOS for cortex processors and rewritten my answer to:
The standard FreeRTOS version for the Cortex-M3 does in fact configure and use both the MSP and PSP. When the very first task runs it modifies MSP to point to the first address specified in the vector table (0x00000000), this tends to be the very last word in SRAM, then it triggers a system call, in the system call exception handler it sets the PSP to the next task stack location, then it modifies the exception LR value such that "return to thread mode and on return use the process stack".
This means that the interrupt service routine (AKA exception handler) stack is grows down from the address specified in the vector table.
You can configure your linker and startup code to locate the exception handler stack wherever you like, make sure your heap or other memory areas do not overlap the exception handler area and make sure the area is large enough.
The answer for other chips and operating systems could be completely different!
To help ensure your application has appropriate space on the ISR stack (MSP),
here's some additional code to check actual ISR stack use. Use in addition to the checking you're already doing on FreeRTOS task stack use:
https://sourceforge.net/p/freertos/discussion/382005/thread/8418dd523e/
Update: I've posted my version of port.c that includes the ISR stack use check on github:
https://github.com/DRNadler/FreeRTOS_helpers

Way to clean up child process address space even its still alive

In my process i have created 10 threads and will use those threads till my application is alive. Each thread will perform some file input and output operation every time. So the problem is every time thread start executing then my process virtual memory is getting increased.
My analysis is that when one file input output task is allowcated to the thread then the file will be loaded to thread address space when thread start to copy the file and after copy is completed then the thread address space will not be cleared as still the thread is not exited. So if i once again assign another task to the thread then the new file will be loaded to the thread address space.
Hence the main process virtual memory address space will be increase. SO Please correct me if i am wrong and also help to know this has some problem if the process run for log time.
A few things here.
1) Threads do not have their own memory address space. Processes do. (However, threads do get their own thread local storage.)
2) In managed languages, objects are not cleaned up and the heap compacted until the garbage collector is run. The garbage collector is not run until it needs to (e.g. the program is close to running out of memory). As long as the object has no strong references to it (nothing running can reach it) then the object will get cleaned up when the program needs it to be cleaned up, and you don't need to do anything else. If you want the garbage collector to run early, however, tell it to.
By the way, if resources are needed commonly amongst many different threads, you could consider having some sort of global cache for them. However, early optimization is a grievous sin, so don't go to all that effort until you've determined it solves a REAL problem.

Resources