I found the process reduction is large in our product environment, and the messages didn't decrease.
FYI, the reduction is 10831243888178 and then 10838818431635 after 5 minutes. The message_queue_len is 1012 and then 1014 according to the reduction.
I supposed that the messages returned from process_info(Pid) should be consumed in the 5 minutes but it didn't. Can I say that the process was blocked by some messages?
I read from the web that one reduction can be looked as one function call, but I don't fully understand it. I'll appreciate if someone can tell me more about the "reduction".
Reductions is a way to measure work done by a process.
Every scheduled process given a number of reductions to spend before preempting, in other words before it will have to let other processes to execute. Calling a function will spend 1 reduction, that seems right, but it is not the only thing that spends them, a lot of reductions will vanish inside this function call too.
It seems that numbers you given are accumulated reductions spent by a process. A big number by itself do not mean something at all actually. A big increase, however, means that process is doing some hard work. If this workhorse is not consuming the message queue, a great chance it is stuck inside one very long, or even unending computation.
You may try to inspect it further with process_info(Pid, current_function) or process_info(Pid, current_stacktrace).
Related
I tuning the performance of my Erlang project. I am running into an odd scenario. Because I want to use the full power of the CPU power (multi-core) in some cases, the CPU usage will around 100% in my testing.
Then I found some strange things, there are some codes that must synchronized, so I use gen_server:call to do this task. But when the CPU is busy, the gen_server's actual task is doing with a very short time (read a value from ETS). But I finally got a long process time of gen_server:call. After my profiling, I found most time spent in gen_server:call is pending(90%+). In my case, the task run 3000 times. The total tasks time is about 100 ms. But the call to gen_server:call costs about 3 seconds.
In this case, why should I use gen_server:call if it is very expensive just so it can implement the gen_server behavior? It looks it would be better if I just do the task directly.
Does anyone have an idea as to why gen_server:call might be taking so long?
Thank you,
Eric
Have profiled an app on an iPhone 4 using "Time Profiler" and "CPU Monitor" and trying to make sense of it.
Given that execution time is 8 minutes, CPU "Running Time" is around 2 minutes.
About 67% of that is on the main thread, out of which 52% is coming from "own code".
Now, I can see the majority of time being spent in enumerating over arrays (and associated work), UIKit operations, etc.
The problem is, how do I draw any meaningful conclusions out of this data? i.e. there is something wrong going on here that needs fixing.
I can see a lot of CPU load over that running time (median at 70%) that isn't "justifiable" given the nature of the app.
Having said that, there are some things that do stand out. Parsing HTTP responses on the main thread, creating objects eagerly (backed up by memory profiling as well).
However, what I am looking for here is offending code along with useful conclusions solely based on CPU running time. i.e. spending "too much" time here.
Update
Let me try and elaborate in order to give a better picture.
Based on the functional requirements of this app, I can't see why it shouldn't be able to run on an iPhone 3G. A median CPU usage of around 70%, with a peak of 97% only looks like a red flag on an iPhone 4.
The most obvious response to this is to investigate the code and draw conclusions from that.
What I am hoping for is a categorical answer of the following form
if you spend anywhere between 25% - 50% of your time on CA, there is something wrong with your animations
if you spend 1000ms on anything related to UIKit, better check your processing
Then again, maybe there aren't any answers only indications of things being off when it comes to running time and CPU usage.
Answer for question "is there something wrong going on here that needs fixing" is simple: do you see the problem while using application? If yes (you see glitches in animation, or app hang for a while), you probably want to fix it. If not, you may be looking for premature optimization.
Nonetheless, parsing http responses in main thread, may be a bad idea.
In dev presentations Apple have pointed out that whilst CPU usage is not an accurate indicator in the simulator it is something to hold stock of when profiling on device. Personally I would consider any thread that takes significant CPU time without good reason a problem that needs to be resolved.
Find the time sinks, prioritise by percentage, and start working through them. These may not be visible problems now but they will begin to, if they have not already, degrade the user's experience of the app and potentially the device too.
Check out their documentation on how to effectively use CPU profiling for some handy hints.
If enumeration of arrays is taking a lot of time then I would suggest that dictionaries or other more effective caches could be appropriate, assuming you can spare some memory to ease CPU.
An effective approach may be to remove all business logic from the main thread (a given) and make a good boundary layer between the app and the parsing / business logic. From here you can better hook in some test suites that could better tell you if the code is at fault or if it's simply the significant requirements of the app UI itself...
Eight minutes?
Without beating around the bush, you want to make your application faster, right?
Forget looking at CPU load and wondering if it's the right amount.
Forget guessing if it's HTTP parsing. Maybe it is, but guessing won't tell you.
Forget rummaging around in the code timing things in hopes that you will find the problem(s).
You can find out directly why it is spending so much time.
Here's the method I use,
and here's an (amateurish) video of it.
Here's what will happen if you do that.
First you will find something you would never have guessed, and when you fix it you will lop a big chunk off that 8 minutes, like maybe down to 6 minutes.
Then you do it again, and lop off another big chunk.
You repeat until you can't find anything to fix, and then it will be much faster than your 8 minutes.
OK, now the ball is in your court.
I want to improve the running time of some code.
In order to that I first time the running time of all relevant code, using code like this:
before:= rdtsc;
myobject.run;
after:= rdtsc;
Then I zoom in and time a relevant part, like so:
procedure myobject.part;
begin
StartTime:= rdtsc;
...
EndTime:= rdtsc;
inc(TotalTime, (EndTime- StartTime));
end;
I have some code to copy paste the timings into Excel, a typical outcome would look like:
(the 89.8% and 10.2% adding up to 100% is a coincidence and has nothing to do with the data or the question)
(when the data shows 1 it means 0 to avoid divide by zero errors)
Note the difference between run A and run B.
I have not changed anything yet so run A and B should give the same running time.
Further note that I know that on both runs procedure part was invoked exactly the same number of times (the data is the same and the algorithm is deterministic).
The running time of procedure part is very short (it is just called many times).
If there was some way to block out other processes during these short bursts of runtime (less than 700 CPU cycles) my timings would be much more accurate.
How do I get these timings to be more reliable?
Is there a way to monopolize the CPU to only run my task when timing and nothing else?
Note that I'm not looking for obvious answers like:
- Close other running programs
- Disable the virusscanner etc...
I've tagged the question Delphi because I'm using Delphi right now (and there may be some Delphi specific option to achieve this result).
I've also tagged it language-agnostic because there may be some more general way.
Update
Because I'm using the CPU instruction RDTSC I'm not affected by CPU throttling. If the CPU slows down, the number of cycles stays the same.
Update2
I have 2 answers, but neither answers the question...
The question is how do I prevent these changes in running time?
Do I have to run the code 20x and always compare the lowest running time out of the 20 runs?
Or to I set my program priority to realtime?
Or is there some other trick to use so my code sample does not get interrupted?
To want to improve the running time of some code.
In order to that I first time the running time of all relevant code, ...
OK, I'm a bit of a stuck record on this subject, but lots of people think that to improve running time requires first measuring it accurately.
Not So.
Improving running time requires finding out what's taking a large fraction of time (the exact fraction does not matter) and doing it differently or maybe not at all.
What it's doing is often not revealed by timing individual routines.
Here's the method I use,
and here's a very amateur video of it.
The problem with profiling your code like that, by sticking special statements into it, is that those special statements themselves take time to run. And since the things taking the most time are likely to be things happening in tight loops, the more they run, the more they distort your timings. What you need for good information is something that will observe your program from outside, without modifying the executing code.
In other words, you need a sampling profiler. And there just happens to be a very good one for Delphi available for free, by the rather descriptive name of Sampling Profiler. It runs your program and watches what it's doing, then correlates that against the map file (make sure to set up your project options to generate a Detailed map file) to give you an intelligible readout on what your program is spending its time on.
And if you want to narrow things down, you can use OutputDebugString to output profiling commands to make it only pay attention to specific parts of your code. It's got instructions in the help file.
I've used a lot of different methods, and this is the most useful way I've found to figure out what Delphi programs are spending their time on. And it's free. Give it a try.
I have a background worker in my rails project that executes a lot of complicated data aggregation in-memory in ruby. I'm seeing a strange behavior. When I boot up a process for executing the jobs (thousands), I see a strange performance decrease over time. In the beginning a job completion takes around 300ms but after processing around 10.000 jobs the execution time will gradually have decreased to around 2000ms. This is a big problem for me and I'm puzzled about how this can possibly happen. I see no memory leaks (RAM usage is pretty stable), and I see no errors. What might cause this on a low level, and where should I start looking?
Background facts:
Among the things the job does, it does a lot of regexp comparisons of a lot of strings. There is no external database calls made except for read/write operations to a redis instance.
I have tried to execute the same on different servers/computers, and the symptoms are all the same.
If I restart the process when it starts to perform too bad, the performance turns good again immediately after.
I'm running ruby 1.9.3p194 and rails 3.2 and sidekiq 2.9.0 for job processor
It is difficult to tell from the limited description of your service, but the behaviour is consistent with a small (i.e. not leaky) cache of data that either has poor lookup performance, or that you are relying on very heavily, and that is growing at just a modest rate. A contrived example might be a list of "jobs done so far by this worker" which is being sorted on demand at a few points in the code.
One such cache is out of your direct control: Ruby's symbol table. Finding a Symbol is something like O(log(n)) on number of symbols in the system, which is good. But this could still impact you if you handle a lot of symbols, and each iteration of your worker can generate new symbols (for instance if keys in an input hash can be arbitrary data, and you use a symbolize_keys method or call to_sym on a lot of varying strings). Symbols are cached permanently in the Ruby process. In theory a few million would not show up as a memory leak. But if your code can go from say 10,000 symbols to 1,000,000 in total, all the symbol generating and checking code would slow down by a small fixed amount. If you are doing that a lot, it could potentially explain a few hundred ms.
If hunting through suspect code is getting you nowhere, your best bet to find the problem is to use a profiler. You should collect a profile of the code behaving well, and behaving badly, and compare the two.
When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.