Need to improve the Linux performance for embedded system - image-processing

I have a ARM OMAP based embedded system with 1 GHZ processor running Linux 2.6.33 cross compiled as CONFIG_PREEMPT. One of the Processes (process 1) is critical and need to run every 4 or 8 milli sec which is configurable. There is another process's (process 2) thread which transfers image to FTP or any other configured application. To trigger the time critical process 1 i use a high resolution timer as a seperate thread (FIFO, say 60) with highest Real time priority in the system. Process 2 is having lower RT priority (RR 20) than process 1 (RR 50).
If there is no image transfer enabled or configured i dont see any timeouts for the critical process (process 1) mentioned above. But if i enable any image transfer then the process 1 will timeout or the image transfer fails due to some error and one of these process dies and then other process runs fine.
I see that if the image resolution is higher then the timing out of process 1 is faster.
With higher resolution of image (say SXGA) the NET_RX ethernet interrupt holds the CPU for long time and by the time it gives up CPU, process 1 timesout. It looks like NET_RX interrupt is having highest priority than timer interrupt used for process 1 and it doesn't give the CPU.
I want to make sure both process running and process 1 should not miss the deadline.
How to debug the system that where it is exactly waiting so that i can remove those waits or atleast avoid those if possible.
How can i achieve this ? Please help.

Linux is not a real-time operating system. It offers no guarantees other than "best efforts" scheduling.
If you have a task which has to run at a particular rate all the time, you need to run that task under a proper RTOS which can make those sorts of guarantees.
Otherwise you have to relax your constraints to "runs every 4ms, mostly".

You may want to check "http://www.techonline.com/electrical-engineers/education-training/tech-papers/4402454/Challenges-in-Using-Linux-for-CPU-intensive-real-time-networking-products". It describes network performance in PREEMPT_RT

I found the solution to this performance issue by modifying priority of the thread sending image data to a SCHED_NORMAL and re arranging the source code avoiding unnecessary loops. Now i see that the image transfer is not affecting the performance of the whole system.

Related

If I run out of memory, which process will be killed?

I have an important process which I don't want to be killed. This process is using about 10GB of RAM. I have 32GB available. I want to run another process which will take up 18.2GB of RAM. There should be some room left. What happens if I hit the full 32GB? Will the last program I called be killed? That wouldn't be so bad, but the important one cannot die.
It is a likely chance that one of your programs will be moved to your disk but i'm not sure which one it is so I recommend batch processing.

How to determine concurrency (threads) while using shoryuken for background jobs?

In my Ruby on Rails application, I'm using shoryouken for background processing. I've many sqs queues (6-7) in my application. One of the queue has 2000-3000 jobs and it takes around 3 hours for the worker to process these 2-3k jobs with a default concurrency of 25. So based on what factors can we decide to increase the concurrency (which is the number of threads to process jobs). Please do comment if anything is unclear in the question.
Concurrency defaults to 25, but can be changed by altering your shoryuken.yml configuration (see below) or by adding the concurrency argument as so: shoryuken -c {desiredCount}
concurrency: 25 # Update with your desired value.
delay: 25 # The delay in seconds to pause a queue when it's empty. Default 0
queues:
- [high_priority, 6]
- [default, 2]
- [low_priority, 1]
You will need to test the optimal value for performance as you'll run into I/O and CPU bottlenecks as number of concurrent threads rises. Once you've reached the optimal value for your instance(s), you'll need to either increase the number of instances running this job or upgrade the instance(s).
If the bottleneck exists instead on your DB or other resource, you'll need to adjust it accordingly. (Not likely to be the case, but included for thoroughness' sake)
EDIT: Optimizing Performance
In response to your question on optimizing the thread count, the quickest/best way to determine the optimal concurrency value is to change concurrency and measure real-world throughput. There's other approaches, but the golden rule for performance is always to measure in a live production environment. Synthetic benchmarks are only helpful to the extent that they mirror real-time performance. (See also: premature optimization).
This is a case where you can easily end up overthinking things (then again, overthinking things is a perennial problem in development). Just measure with the appropriate metrics (CPU utilization, memory utilization, number of jobs completed per minute), and change the number of threads until you either maximize throughput or run into a bottleneck.
If your tasks are CPU bound you'll see your CPU utilization maxing out. If your tasks are I/O bound you'll see that after some point an increase in concurrent threads does not translate to an increase in throughput even though your CPU utilization fails to rise.
An I/O bottleneck can happen when any of the resources you're reading/writing are unable to keep up with your CPU demands. This includes system resources (memory, disk space), your database performance (DB CPU utilization, read/write limits), as well as other APIs you're connecting with. Network capacity is also a theoretical bottleneck but if it was you'd be big enough to have hired someone with experience in this area. Because there's so many different ways for this to happen, the only real way to figure it out what the bottlenecks are is to have your monitoring in place.
Re: formula, the short answer is that there's no one formula that you can use in this case. The long answer is probably yes, but you'd arrive at the optimum value in the course of collecting all the values you'd need to calculate it.
EDIT 2 : Concurrency, Latency, and Throughput
I realized I forgot to add one more piece of advice. When you're working with background tasks that users are not waiting for, your throughput (jobs per unit of time) is the only thing you want to optimize. Do not optimize for individual job time. It also means you cannot profile the current (and presumably un-bound) performance and get useful data because bottlenecks/constraints are target dependent. The constraints that exist for throughput will NOT be the same as the constraints that exist for individual task time.
(Technically speaking, your concurrency setting is your current constraint)
Three main factors are
Number of Cores
Type of Job - I/O or CPU bound
Is there another application or process running on server
Ideally for a cpu bound task keep number of thread to number of cpu cores.
For I/O bound task it requires benchmarking and calculating wait time for an I/O, and then you can decide the optimal value. For rough estimate if you have 4 cores than for I/O bound task you must keep at max 8 threads.
If you have your rails app running on the same then you will need to reduce number of cores.
Increasing the number of cores will not increase your performance if your system doesnt support.
Refer : http://baddotrobot.com/blog/2013/06/01/optimum-number-of-threads/

Execution window time

I've read an article in the book elixir in action about processes and scheduler and have some questions:
Each process get a small execution window, what is does mean?
Execution windows is approximately 2000 function calls?
What is a process implicitly yield execution?
Let's say you have 10,000 Erlang/Elixir processes running. For simplicity, let's also say your computer only has a single process with a single core. The processor is only capable of doing one thing at a time, so only a single process is capable of being executed at any given moment.
Let's say one of these processes has a long running task. If the Erlang VM wasn't capable of interrupting the process, every single other process would have to wait until that process is done with its task. This doesn't scale well when you're trying to handle tens of thousands of requests.
Thankfully, the Erlang VM is not so naive. When a process spins up, it's given 2,000 reductions (function calls). Every time a function is called by the process, it's reduction count goes down by 1. Once its reduction count hits zero, the process is interrupted (it implicitly yields execution), and it has to wait its turn.
Because Erlang/Elixir don't have loops, iterating over a large data structure must be done recursively. This means that unlike most languages where loops become system bottlenecks, each iteration uses up one of the process' reductions, and the process cannot hog execution.
The rest of this answer is beyond the scope of the question, but included for completeness.
Let's say now that you have a processor with 4 cores. Instead of only having 1 scheduler, the VM will start up with 4 schedulers (1 for each core). If you have enough processes running that the first scheduler can't handle the load in a reasonable amount of time, the second scheduler will take control of the excess processes, executing them in parallel to the first scheduler.
If those two schedulers can't handle the load in a reasonable amount of time, the third scheduler will take on some of the load. This continues until all of the processors are fully utilized.
Additionally, the VM is smart enough not to waste time on processes that are idle - i.e. just waiting for messages.
There is an excellent blog post by JLouis on How Erlang Does Scheduling. I recommend reading it.

When is it appropriate to increase the async-thread size from zero?

I have been reading the documentation trying to understand when it makes sense to increase the async-thread pool size via the +A N switch.
I am perfectly prepared to benchmark, but I was wondering if there were a rule-of-thumb for when one ought to suspect that growing the pool size from 0 to N (or N to N+M) would be helpful.
Thanks
The BEAM runs Erlang code in special threads it calls schedulers. By default it will start a scheduler for every core in your processor. This can be controlled and start up time, for instance if you don't want to run Erlang on all cores but "reserve" some for other things. Normally when you do a file I/O operation then it is run in a scheduler and as file I/O operations are relatively slow they will block that scheduler while they are running. Which can affect the real-time properties. Normally you don't do that much file I/O so it is not a problem.
The asynchronous thread pool are OS threads which are used for I/O operations. Normally the pool is empty but if you use the +A at startup time then the BEAM will create extra threads for this pool. These threads will then only be used for file I/O operations which means that the scheduler threads will no longer block waiting for file I/O and the real-time properties are improved. Of course this costs as OS threads aren't free. The threads don't mix so scheduler threads are just scheduler threads and async threads are just async threads.
If you are writing linked-in drivers for ports these can also use the async thread pool. But you have to detect when they have been started yourself.
How many you need is very much up to your application. By default none are started. Like #demeshchuk I have also heard that Riak likes to have a large async thread pool as they open many files. My only advice is to try it and measure. As with all optimisation?
By default, the number of threads in a running Erlang VM is equal to the number of processor logical cores (if you are using SMP, of course).
From my experience, increasing the +A parameter may give some performance improvement when you are having many simultaneous file I/O operations. And I doubt that increasing +A might increase the overall processes performance, since BEAM's scheduler is extremely fast and optimized.
Speaking of the exact numbers – that totally depends on your application I think. Say, in case of Riak, where the maximum number of opened files is more or less predictable, you can set +A to this maximum, or several times less if it's way too big (by default it's 64, BTW). If your application contains, like, millions of files, and you serve them to web clients – that's another story; most likely, you might want to run some benchmarks with your own code and your own environment.
Finally, I believe I've never seen +A more than a hundred. Doesn't mean you can't set it, but there's likely no point in it.

speed up php-cli

Why is a php cli process using 25% of CPU, is there a way to reduce this? Right now I'm running 3 instances but obviously I would like to run much more to finish the job faster.
Background info: I'm moving data from a transbase db to mysql db.
EDIT: If I run this in a browser there isn't such a noticeable load on the CPU.
More processes doesn't mean faster processing. The PHP process takes as much CPU as it can to finisgh the task as quick as possible. It's probably 25% because you got a quad-core processor and it's a single threaded task.
Ideally, you would need 4 processes if you could assign each of them to a different code. Also, because of waiting for database or disk-I/O, a single thread cannot fully use all CPU power all the time, so go ahead and run more processes. It's not that a 5th processes will crash because all CPU power is used up; it will just take its share, while the OS divides processing power to all running processes.
Just dont' start too many; every process has a little overhead, and you won't benefit from having 200 simultaneous processes.

Resources