I'd like to know how to increase the range of POSIX thread priorities beyond 1-99 for SCHED_RR. I've called sched_get_priority_min and sched_get_priority_max to verify the 1-99 range for SCHED_RR, but I'm porting code written for another operating system which uses more priority levels. I want each thread to have the same relative priority but do not want to force threads to share the same priority when they should be different.
You have probably read them already but the man pages seem clear enough:
The range of scheduling priorities
may vary on other POSIX systems, thus
it is a good idea for portable
applications to use a virtual priority
range and map it to the interval given
by sched_get_priority_max() and
sched_get_priority_min(). POSIX.1-2001
requires a spread of at least 32
between the maximum and the minimum
values for SCHED_FIFO and SCHED_RR.
I rather doubt there is a knob to twist to change this so my guess is you are stuck with either changing the kernel or scaling the priorities as suggested. I frankly have to wonder if more than 99 priorities really amount to a hill of beans in actual performance.
Related
Recently I was working with Rtos and created some tasks to perform my required actions. Although it seems like every time when I create new task with xTaskCreate or TI GUI configuration, I simply try to keep my stack size as much so that the stack must not get overflowed.
Is there any way to calculate the maximum stack size used by my task with respect to these events?
1. Stack used by global and local variable
2. Stack used by the maximum number of recursion of the function
3. Including the interrupt context switching
The compiler, compiler optimisation level, CPU architecture, local variable allocations and function call nesting depth all have a large impact on the stack size. The RTOS has minimal impact. For example, FreeRTOS will add approximately 60 bytes to the stack on a Cortex-M - which is used to store the task's context when the task is not running. Whichever method you use to calculate stack usage in your non-RTOS project can be used in your RTOS project too - then add approximately 60 bytes.
You can calculate these things, and that can be important in safety critical applications, but in other cases a more pragmatic approach is to try it and see - use the features of the RTOS to measure how much stack is actually being used and use the stack overflow detection - then adjust until you find something optimal.
http://www.freertos.org/Stacks-and-stack-overflow-checking.html
http://www.freertos.org/uxTaskGetStackHighWaterMark.html
I'm used this code:
TaskHandle_t cipTask;
UBaseType_t uxHighWaterMark;
/* Print actual size of stack has used */
for (;;) {
uxHighWaterMark = uxTaskGetStackHighWaterMark(cipTask);
Serial.println(uxHighWaterMark);
}
In some of the demos for FreeRTOS on cortex M0 MCUs configMINIMAL_STACK_SIZE is set to 60 while on some others it set to 70. Using the STM32Cube software it's set to 128.
My question is what is actually the MINIMAL stack size?
Looking in the STM32 Cortex-M0 programming manual I see that the processor registers are
R0-R12, MSP, PSP, LR, PC, PSR, ASPR, IPSR, EPSR, PRIMASK, CONTROL. Wouldn't that mean that the MINIMAL stack size is just 23 words? Or is there more info that needs to be saved for a context switch?
As per the description here: http://www.freertos.org/a00110.html#configMINIMAL_STACK_SIZE as far as the RTOS is concerned the constant does nothing more than set the size of the stack used by the idle task.
The stack has to be large enough to hold the context of the task, as well as any normal stack items used by the task (local variables, function call overhead, etc.) so the actual size required depends on what the idle task is doing - and will be at its very minimum if the idle task is doing nothing. If on the other hand there is an idle task hook function in use (http://www.freertos.org/a00016.html) then the required stack size will depend on what the hook function is doing (its function call depth, etc.).
The constant is also used by the demo tasks as a convenient way of being able to use the same demo tasks on multiple architectures, but that does not effect the RTOS, it is just demo code.
We are taught that the abstraction of the RAM memory is a long array of bytes. And that for the CPU it takes the same amount of time to access any part of it. What is the device that has the ability to access any byte out of the 4 gigabytes (on my computer) in the same time? As this does not seem as a trivial task for me.
I have asked colleagues and my professors, but nobody can pinpoint to the how this task can be achieved with simple logic gates, and if it isn't just a tricky combination of logic gates, then what is it?
My personal guess is that you could achieve the access of any memory in O(log(n)) speed, where n would be the size of memory. Because each gate would split the memory in two and send you memory access instruction to the next split the memory in two gate. But that requires ALOT of gates. I can't come up with any other educated guess, and I don't even know the name of the device that I should look up in Google.
Please help my anguished curiosity, and thanks in advance.
edit<
This is what I learned!
quote from yours "the RAM can send the value from cell addressed X to some output pins", here is where everyone skip (again) the thing that is not trivial for me. The way that I see it, In order to build a gate that from 64 pins decides which byte out of 2^64 to get, each pin needs to split the overall possible range of memory into two. If bit at index 0 is 0 -> then the address is at memory 0-2^64/2, else address is at memory 2^64/2-2^64. And so on, However the amout of gates (lets call them) that the memory fetch will go through will be 64, (a constant). However the amount of gates needed is N, where N is the number of memory bytes there are.
Just because there is 64 pins, it doesn't mean that you can still decode it into a single fetch from a range of 2^64. Does 4 gigabytes memory come with a 4 gigabytes gates in the memory control???
now this can be improved, because as I read furiously more and more about how this memory is architectured, if you place the memory into a matrix with sqrt(N) rows and sqrt(N) columns, the amount of gates that a fetch memory will need to go through is O(log(sqrt(N)*2) and the amount of gates that will be required will be 2*sqrt(N), which is much better, and I think that its probably a trade secret.
/edit<
What the heck, I might as well make this an answer.
Yes, in the physical world, memory access cannot be constant time.
But it cannot even be logarithmic time. The O(log n) circuit you have in mind ultimately involves some sort of binary (or whatever) tree, and you cannot make a binary tree with constant-length wires in a 3D universe.
Whatever the "bits per unit volume" capacity of your technology is, storing n bits requires a sphere with radius O(n^(1/3)). Since information can only travel at the speed of light, accessing a bit at the other end of the sphere requires time O(n^(1/3)).
But even this is wrong. If you want to talk about actual limitations of our universe, our physics friends say the absolute maximum number of bits you can store in any sphere is proportional to the sphere's surface area, not its volume. So the actual radius of a minimal sphere containing n bits of information is O(sqrt(n)).
As I mentioned in my comment, all of this is pretty much moot. The models of computation we generally use to analyze algorithms assume constant-access-time RAM, which is close enough to the truth in practice and allows a fair comparison of competing algorithms. (Although good engineers working on high-performance code are very concerned about locality and the memory hierarchy...)
Let's say your RAM has 2^64 cells (places where it is possible to store a single value, let's say 8-bit). Then it needs 64 pins to address every cell with a different number. When at the input pins of your RAM there 'appears' a binary number X the RAM can send the value from cell addressed X to some output pins, and your CPU can get the value from there. In hardware the addressing can be done quite easily, for example by using multiple NAND gates (such 'addressing device' from some logic gates is called a decoder).
So it is all happening at the hardware-level, this is just direct addressing. If the CPU is able to provide 64 bits to 64 pins of your RAM it can address every single memory cell (as 64 bit is enough to represent any number up to 2^64 -1). The only reason why you do not get the value immediately is a kind of 'propagation time', so time it takes for the signal to go through all the logic gates in the circuit.
The component responsible for dealing with memory accesses is the memory controller. It is used by the CPU to read from and write to memory.
The access time is constant because memory words are truly layed out in a matrix form (thus, the "byte array" abstraction is very realistic), where you have rows and columns. To fetch a given memory position, the desired memory address is passed on to the controller, which then activates the right column.
From http://computer.howstuffworks.com/ram1.htm:
Memory cells are etched onto a silicon wafer in an array of columns
(bitlines) and rows (wordlines). The intersection of a bitline and
wordline constitutes the address of the memory cell.
So, basically, the answer to your question is: the memory controller figures it out. Of course that, given a memory address, the mapping to column and row must be easy to calculate in a constant time.
To fully understand this topic, I recommend you to read this guide on how memory works: http://computer.howstuffworks.com/ram.htm
There are so many concepts to master that it is difficult to explain it all in one answer.
I've been reading your comments and questions until I answered. I think you are on the right track, but there is some confusion here. The random access in which you are implying doesn't exist in the same way you think it does.
Reading, writing, and refreshing are done in a continuous cycle. A particular cell in memory is only read or written in a certain interval if a signal is detected to do so in that cycle. There is going to be support circuitry that includes "sense amplifiers to amplify the signal or charge detected on a memory cell."
Unless I am misunderstanding what you are implying, your confusion is in how easy it is to read/write to a cell. It's different dependent on chip design but there IS a minimum number of cycles it takes to read or write data to a cell.
These are my sources:
http://www.doc.ic.ac.uk/~dfg/hardware/HardwareLecture16.pdf
http://www.electronics.dit.ie/staff/tscarff/memory/dram_cycles.htm
http://www.ece.cmu.edu/~ece548/localcpy/dramop.pdf
To avoid a humungous answer, I left most of the detail out but all three of these will describe the process you are looking for.
How do I determine the lower bound for the JVM option Xmx or otherwise economize on memory without a trial and error process? I happen to set Xms and Xmx to be the same amount, which I assume helps to economize on execution time. If I set Xmx to 7G, and likewise Xms, it will happily report that all of it is being used. I use the following query:
Runtime.getRuntime().totalMemory()
If I set it to less than that, say 5GB, likewise all of it will be used. It is not until I provide very much less, say 1GB will there be an out-of-heap exception. Since my execution times are typically 10 hours or more, I need to avoid trial and error processes.
I'd execute the program with plenty of heap while monitoring heap usage with JConsole. Take note of the highest memory use after a major garbage collection, and set about maximum heap size 50% to 100% higher than that amount to avoid frequent garbage collection.
As an aside, totalMemory reports the size of the heap, not how much of it is presently used. If you set minimum and maximum heap size to the same number, totalMemory will be the same irrespective of what your program does ...
Using Xms256M and Xmx512M, and a trivial program, freeMemory is 244M and totalMemory is 245M and maxMemory is 455M. Using Xms512M and Xmx512M, the amounts are 488M, 490M, and 490M. This suggests that totalMemory is a variable amount that can vary if Xms is less than Xmx. That suggests the answer to the question is to set Xms to a small amount and monitor the highwater mark of totalMemory. It also suggests maxMemory is the ultimate heap size that cannot be exceed by the total of current and future objects.
Once the highwater mark is known, set Xmx to be somewhat more than that to be prudent -- but not excessively more because this is an economization effort -- and set Xms to be the same amount to get the time efficiency that is evidently preferred.
I was wondering about the parameters of two APIs of epoll.
epoll_create (int size) - in this API, size is defined as the size of event pool. But, it seems that having more events than the size still works. (I've put the size as 2 and forced event pool to have 3 events... but it still works !?) Thus I was wondering what this parameter actually means and curious about the maximum value of this parameter.
epoll_wait (int maxevents) - for this API, the maxevents definition is straight-forward. However, I can see the lackness of information or advices on how to determin this parameter. I expect this parameter to be changed depending on the size of epoll event pool size. Any suggestions or advices will be great. Thank you!
1.
"man epoll_create"
DESCRIPTION
...
The size is not the maximum size of the backing store but just a hint
to the kernel about how to dimension internal structures. (Nowadays,
size is unused; see NOTES below.)
NOTES
Since Linux 2.6.8, the size argument is unused, but must be greater
than zero. (The kernel dynamically sizes the required data strucā
tures without needing this initial hint.)
2.
Just determine an accurate number by yourself, but be aware that
giving it a small number may drop out the efficiency a little bit.
Because the smaller number assigned to "maxevent" , the more often you may have to call epoll_wait() to consume all the events, queued already on the epoll.