Why does Firemonkey application use no more than 20% of CPU?

Why does Firemonkey application use no more than 20% of CPU? - delphi

I have a large binary file (700 Mb approximately) which I load to TMemoryStream. After that I perform the reading with TMemoryStream.Read() and make some simple calculations but the application never takes more than 20% of CPU. My PC has i7 processor.
Is there any chance to increase the CPU using and speed up the reading process without using the threads?

As far as I know, the only way to utilise the power of multiple cpu cores with Delphi is to use threads.
If you do choose to use threads in your application, there are a couple libraries that may ease development. How Do I Choose Between the Various Ways to do Threading in Delphi?

Adding on to Shannon's answer, on an i7 processor with multiple cores, one thread will only be utilizing one core. One thread cannot run on more than one processor core. Therefore, if you wish to utilize multiple cores, you need to create multiple threads to handle various tasks. Creating a thread isn't necessarily as simple as saying do this in that thread, there's a lot to know about multi-threading. For example, your application has one main GUI thread, then one thread might be dedicated for performing some long calculation, another thread might be updating a caption to real-time data, and so on.
Windows automatically decides which core to assign a thread to, and usually divides it up fairly. So, if you have 8 processor cores, and 16 threads, each core would get 2 threads (presumably) and since each core sends its own ticks apart from each other, more than one thread could literally be running at the same time (as opposed to a single core where it divides each 'tick' between each thread).
So to answer your question, if you had 5 threads performing something big at the same time, then you would see 100% processor usage.

Related

Is JavoNet a threadsafe library, and more imporantlty, does it allow usage of all threads?

Is javonet threadsafe? I couldn't find any documentation one way or the other. Even if it is threadsafe, is there some sort of "mutex" that's preventing full usages of all threads?
When I tried to run javonet in parallel, it did work, but the CPU usage did not significantly increase above the sequential load (ie on a 10CPU system, the CPU usage hovered around 20% for parallel load, whcih was only merely double the sequential CPU load of 10%); however, if I ran 10 version of the exact same sequential code (that used javonet), I achieved 100% CPU usage....so it "feels" like javonet must have some built-in mutexes that's preventing full parallel usage.

Javonet is thread safe. You just need to follow standard practices for writing multi-threaded applications and Javonet will take care of executing your code properly.
Javonet creates new corresponding .NET thread for calling Java threads. Also the other way for callbacks, events and delegates if called from other thread Javonet will create the corresponding thread on Java side. Once the calling thread completes, Javonet will close the thread on the other side.
If the corresponding thread already exists, Javonet will rejoin to valid thread.
Javonet does use internal mutexes / readwritelocks while accessing objects instances, some caching collections and types what depending on your Java code might affect the parallelization capabilities.

Should I programly put computation-heavy tasks on a separate thread on IOS to utilize multi-core?

I am making a real-time image processing app on IOS with my team. I am handling the custom computation kernel (mostly on CPU rather than GPU) and my teammates deals with the GUI. When I tested my kernel on a toy app, the core (ignoring any IO overhead ) runs steadily at 100ms per image. However, when put into the full-functioning one, it is slowed down to 500ms per image.
I have checked that the data is pretty much the same and I am only measuring time consumed within the kernel, on the same iphone6. There are hardly any other computation in the full-functioning app so I am not sure what is pulling behind. Though GPU-processing is definitely an alternative and I am working on it, I would like to know if there is any tricks to use for now.
Currently, there is no explicit multi-threading in the computation part, so my simple guess is: should I programly put the computation part on a separate thread so the second core can be utilized?
[Update]
It turns out that I made some mistakes in packing my code as library, as the copying over the source code works out nicely. I have not figured out my problem yet and am going to post it on a separate question.

GPU Acceleration
This massively depends on the tasks you're performing, the GPU is good a specific subset of tasks and simply utilising it can sometimes even slow things down. Check this out
A lot of image based tasks that are part of the Quartz framework e.t.c are GPU accelerated (like blurring). Also if you use a library like OpenCV you get GPU acceleration on certain tasks out the box.
Unless you're a real pro I would avoid using the GPU specifically and let the frameworks and libraries you use do that for you.
Concurrency
It will certainly help to put intensive tasks on a background thread. Just be aware of what it entails (i.e. you can't make any UIKit calls from a background thread.

The answer heavily depends on how you do the processing. Some methods in the SDK perform their job in a background thread, while others require the caller to create and use one.
In general, in the case of drawing, most methods require you to create one explicitly. This is important especially for the ones that perform their work on the CPU (e.g. using CoreGraphics to draw within a drawRect method). If you're using methods that use GPU for the processing, then creating threads won't be much of use since CPU won't be the cause of the bottleneck.
If you want to determine why your app slows down, use Instruments. (Time Profiler for CPU and Core Animation for drawing)

Reasons of sub-linear speedup in parallel programs

What are the reasons a parallelized program doesn't achieve the ideal speedup?
For example, I have thought about data dependencies, the cost of data transfer between threads (or actors), synchronisation for access to the same data structures, any other ideas (or subcategories of the reasons i mentioned)?
I'm particularly interested for problems occurring in the erlang actor model but any other issues are welcomed.

A few in no particular order:
Cache line sharing - multiple variables on the same cache-line can incur overhead between processors, even if the theoretical model says they should be independent.
Context switch overhead - if you have more threads than cores, there will be overhead in context switching.
Kernel scalability issues: kernels may be fine at say 4 cores, but less efficient at 8.
Lock conveying
Amdahl's law - The limit of the parallel speed up of a program is the proportion of the program that can parallelized.

One reason is that parallelizing a program is often more difficult than one imagines and there are many subtle problems which can occur. For a very good discussion on this see Amdahl's Law.

The main problem in the Erlang Actor model is that each process has its own heap of memory and messages passed are copied around. Contrast with the usual way of using shared memory where you can pass a pointer to a structure between processes.
In a shared memory environment, it is up to the programmer to ensure that only a single process/thread operates on a piece of memory at a time. That is, some process is designated as it and has responsibility for doing the right thing on that memory area. Not so much in Erlang: One process can't by design rummage in other processes memory areas and you must copy values to other processes. This is tremendously powerful when we consider robustness of programs, but not so much if we consider the speed by which the program executes. On the other hand, if we want a distributed environment of multiple computers, copying reigns king and is the only way to transfer data between machines.
Amdahl's law comes into play because parts of your program may be impossible to spread out over multiple cores. There are some problems which are inherently serial in nature: You have no hope of ever speeding them up. Usually they are iterative where each new iteration is dependent on the former and you can't make a guess at the new one.

Asynchronous Controllers at ASP.NET 4 make sense?

Asp.net 2 have 12 threads by default
Now Asp.Net 4 have 5000. We still need async controllers?

We still need async controllers?
Yes. Async controllers are useful in situations where you have lengthy operations such as network calls and you don't want to monopolize worker threads for them. The fact that there are 5000 worker threads by default doesn't mean that you have to waste them. Is it because you are a millionaire that you are giving away your money? No.
Obviously if you don't use async controllers correctly they will do more harm than good.

MVC 4/Dev11 makes Async controllers more appealing than previous versions. Add to that WebAPI making it easy to create web services.
Start of Levi's comments so they won't be missed (under #Darin Dimitrov's excellent answer)
Expanding on Darin's answer a bit - asynchronous I/O operations (which
is what AsyncController is intended for) operate using IOCP, not
ThreadPool threads. This is important, as each ThreadPool thread has
an associated 1 MB stack (plus other overhead), so if you're using
5000 ThreadPool threads, you're automatically losing 5 GB of memory
just due to overhead! IOCP continuations have nowhere near as much
overhead, so it's possible to juggle greater numbers of them at any
given time. ThreadPool threads are pooled and removed when no longer
needed - so you're only taking the hit for threads which are currently
active. But if you're doing a lot of concurrent CPU-bound work with
ThreadPool, you're very rapidly going to start hitting memory issues.
This is precisely one of the reasons the C# / VB teams released the
Async CTP a few months ago - to try to solve this issue.
Async for Web service often makes sense - see Should my database calls be Asynchronous Part II
For database applications using async operations to reduce the number of blocked threads on the web server is almost always a complete waste of time. A small web server can easily handle way more simultaneous blocking requests than your database back-end can process concurrently. Instead make sure your service calls are cheap at the database, and limit the number of concurrently executing requests to a number that you have tested to work correctly and maximize overall transaction throughput
See Should my database calls be Asynchronous?

4 or 5000, it doesn't matter, it is just a setting. You can set it to 1 billion if you want to, that won't make your application more scalabe. In the end your machine only has 4 cores (or 8 or 2, but not 5000). Always keep in mind that you can only ever have as many threads running at the same time that the number of cores. Every thread you have in excess to your number of cores is just overhead. It will create more context switches, consume CPU and occupy more memory.
IO (database access, web service, file access...) is not taking up any CPU. If you do it synchronously, it will block a thread for the length of the operation. If you have a lengthy operation (5 seconds), and a load of 1,000 request per second, you will be permanently blocking 5,000 threads. So you are already starving the thread pool (with a setting of 5,000). But what is worse is that you will be trashing your maching with context switches. If you do it asynchronously, no thread will be blocked, no resource is taken, and you have no limit on the number of concurrent IO you can do.
Adding more threads in the thread pool is a quick and dirty hack when you can't afford rewriting your application using asynchronous IO. It is not a clean solution.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören

So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.

I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören

Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart