SIGPROF on darwin - profiler

setitimer + sigprof doesn't work on 64 bit darwin.
References:
http://openradar.appspot.com/9336975
http://lists.apple.com/archives/Unix-porting/2007/Aug/msg00000.html
Given this, what's the recommended way to build a time based, sampling profiler on darwin? In a multithreaded environment.

I ended up creating a separate profiler thread that sleeps for intervals and pthread_kills all other threads with SIGPROF. The SIGPROF handler looks at the how much cpu time the thread consumed and punts if it's too little.

Related

Why should I use erlang's gen_server:call in sychronize call?

I tuning the performance of my Erlang project. I am running into an odd scenario. Because I want to use the full power of the CPU power (multi-core) in some cases, the CPU usage will around 100% in my testing.
Then I found some strange things, there are some codes that must synchronized, so I use gen_server:call to do this task. But when the CPU is busy, the gen_server's actual task is doing with a very short time (read a value from ETS). But I finally got a long process time of gen_server:call. After my profiling, I found most time spent in gen_server:call is pending(90%+). In my case, the task run 3000 times. The total tasks time is about 100 ms. But the call to gen_server:call costs about 3 seconds.
In this case, why should I use gen_server:call if it is very expensive just so it can implement the gen_server behavior? It looks it would be better if I just do the task directly.
Does anyone have an idea as to why gen_server:call might be taking so long?
Thank you,
Eric

Is JavoNet a threadsafe library, and more imporantlty, does it allow usage of all threads?

Is javonet threadsafe? I couldn't find any documentation one way or the other. Even if it is threadsafe, is there some sort of "mutex" that's preventing full usages of all threads?
When I tried to run javonet in parallel, it did work, but the CPU usage did not significantly increase above the sequential load (ie on a 10CPU system, the CPU usage hovered around 20% for parallel load, whcih was only merely double the sequential CPU load of 10%); however, if I ran 10 version of the exact same sequential code (that used javonet), I achieved 100% CPU usage....so it "feels" like javonet must have some built-in mutexes that's preventing full parallel usage.
Javonet is thread safe. You just need to follow standard practices for writing multi-threaded applications and Javonet will take care of executing your code properly.
Javonet creates new corresponding .NET thread for calling Java threads. Also the other way for callbacks, events and delegates if called from other thread Javonet will create the corresponding thread on Java side. Once the calling thread completes, Javonet will close the thread on the other side.
If the corresponding thread already exists, Javonet will rejoin to valid thread.
Javonet does use internal mutexes / readwritelocks while accessing objects instances, some caching collections and types what depending on your Java code might affect the parallelization capabilities.

Which factors affect the speed of cpu tracing?

When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.

Why does Firemonkey application use no more than 20% of CPU?

I have a large binary file (700 Mb approximately) which I load to TMemoryStream. After that I perform the reading with TMemoryStream.Read() and make some simple calculations but the application never takes more than 20% of CPU. My PC has i7 processor.
Is there any chance to increase the CPU using and speed up the reading process without using the threads?
As far as I know, the only way to utilise the power of multiple cpu cores with Delphi is to use threads.
If you do choose to use threads in your application, there are a couple libraries that may ease development. How Do I Choose Between the Various Ways to do Threading in Delphi?
Adding on to Shannon's answer, on an i7 processor with multiple cores, one thread will only be utilizing one core. One thread cannot run on more than one processor core. Therefore, if you wish to utilize multiple cores, you need to create multiple threads to handle various tasks. Creating a thread isn't necessarily as simple as saying do this in that thread, there's a lot to know about multi-threading. For example, your application has one main GUI thread, then one thread might be dedicated for performing some long calculation, another thread might be updating a caption to real-time data, and so on.
Windows automatically decides which core to assign a thread to, and usually divides it up fairly. So, if you have 8 processor cores, and 16 threads, each core would get 2 threads (presumably) and since each core sends its own ticks apart from each other, more than one thread could literally be running at the same time (as opposed to a single core where it divides each 'tick' between each thread).
So to answer your question, if you had 5 threads performing something big at the same time, then you would see 100% processor usage.

NSThread, NSOperation or GCD for CoreMotion and accurate timing purposes?

I'm looking to do some high precision core motion reading (>=100Hz if possible) and motion analysis on the iPhone 4+ which will run continuously for the duration of the main part of the app. It's imperative that the motion response and the signals that the analysis code sends out are as free from lag as possible.
My original plan was to launch a dedicated NSThread based on the code in the metronome project as referenced here: Accurate timing in iOS, along with a protocol for motion analysers to link in and use the thread. I'm wondering whether GCD or NSOperation queues might be better?
My impression after copious reading is that they are designed to handle a quantity of discrete, one-off operations rather than a small number of operations performed over and over again on a regular interval and that using them every millisecond or so might inadvertently create a lot of thread creation/destruction overhead. Does anyone have any experience here?
I'm also wondering about the performance implications of an endless while loop in a thread (such as in the code in the above link). Does anyone know more about how things work under the hood with threads? I know that iPhone4 (and under) are single core processors and use some sort of intelligent multitasking (pre-emptive?) which switches threads based on various timing and I/O demands to create the effect of parallelism...
If you have a thread that has a simple "while" loop running endlessly but only doing any additional work every millisecond or so, does the processor's switching algorithm consider the endless loop a "high demand" on resources thus hogging them from other threads or will it be smart enough to allocate resources more heavily towards other threads in the "downtime" between additional code execution?
Thanks in advance for the help and expertise...
IMO the bottleneck are rather the sensors. The actual update frequency is most often not equal to what you have specified. See update frequency set for deviceMotionUpdateInterval it's the actual frequency? and Actual frequency of device motion updates lower than expected, but scales up with setting
Some time ago I made a couple of measurements using Core Motion and the raw sensor data as well. I needed a high update rate too because I was doing a Simpson integration and thus wnated to minimise errors. It turned out that the real frequency is always lower and that there is limit at about 80 Hz. It was an iPhone 4 running iOS 4. But as long as you don't need this for scientific purposes in most cases 60-70 Hz should fit your needs anyway.

Resources