While looking at traces of slow response time (3.1s) of a generally fast API(50ms average, 200ms at 99%ile), I found this weird behaviour where there is no execution for almost 2.9s after the before_action callback.
Any idea why this would be happening?
Expected that there would be a trace of execution for the whole 3.1 seconds, but the majority of the execution time, this process seems to be waiting for CPU.
Related
This is a followon to this question about using the DX11VideoRenderer sample (a replacement for EVR that uses DirectX11 instead of EVR's DirectX9).
I've been trying to track down why it uses so much more CPU than the EVR. Task Manager shows me that most of that time is kernel mode.
Using profiling tools, I see that a LOT of time is being spent in numerous calls to NtDelayExecution (aka Sleep). How many calls? ~100,000 over the course of ~12 seconds. Ok, yeah, I'm sending a lot of frames in those 12 seconds, but that's still a lot of calls, every one of which requires a kernel mode transition.
The callstack shows the last call in "my" code is to IDXGISwapChain1::Present(0, 0). The actual call seems be Sleep(0) and comes from nvwgf2umx.dll (which is why this question is tagged NVidia: hopefully someone there can call up the code and see what the logic is behind such frequent calls).
I couldn't quite figure out why it would need to do /any/ Sleeping during Present. It's not like we wait for vertical retrace anymore, is it? But the other reason to use Sleep has to do with yielding to other threads. Which led me to a serious clue:
If I use D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS, the CPU utilization drops. Along with some other fixes, the DX11 version is now faster and uses less CPU time than the DX9 version (which is what I would hope/expect). Profiling shows that Sleep has dropped from >30% to <1%.
Unfortunately, this page tells me:
This flag is not recommended for general use.
Oh.
So, any ideas on how to get decent performance without using debug flags?
I tuning the performance of my Erlang project. I am running into an odd scenario. Because I want to use the full power of the CPU power (multi-core) in some cases, the CPU usage will around 100% in my testing.
Then I found some strange things, there are some codes that must synchronized, so I use gen_server:call to do this task. But when the CPU is busy, the gen_server's actual task is doing with a very short time (read a value from ETS). But I finally got a long process time of gen_server:call. After my profiling, I found most time spent in gen_server:call is pending(90%+). In my case, the task run 3000 times. The total tasks time is about 100 ms. But the call to gen_server:call costs about 3 seconds.
In this case, why should I use gen_server:call if it is very expensive just so it can implement the gen_server behavior? It looks it would be better if I just do the task directly.
Does anyone have an idea as to why gen_server:call might be taking so long?
Thank you,
Eric
I have an IPAD program designed to torture the thread creation process,
which is failing unexpectedly after a few thousand iterations of a
process like this:
pthread_create [a thread that does nothing and exits very quickly]
sleep(100);
Eventually, pthread_create fails with error 35 "too many threads"
The failure occurs, eventually, no matter how long the sleep interval.
Is there some arbitrary limit to threads over time, or is there some
resource I'm likely to be consuming without realizing it?
You need to either call pthread_join from another thread to reclaim the resources, or call pthread_detach to indicate that the resources should be reclaimed automatically. Like unix processes, threads have a result that needs to be reaped before it can completely go away.
I have a background worker in my rails project that executes a lot of complicated data aggregation in-memory in ruby. I'm seeing a strange behavior. When I boot up a process for executing the jobs (thousands), I see a strange performance decrease over time. In the beginning a job completion takes around 300ms but after processing around 10.000 jobs the execution time will gradually have decreased to around 2000ms. This is a big problem for me and I'm puzzled about how this can possibly happen. I see no memory leaks (RAM usage is pretty stable), and I see no errors. What might cause this on a low level, and where should I start looking?
Background facts:
Among the things the job does, it does a lot of regexp comparisons of a lot of strings. There is no external database calls made except for read/write operations to a redis instance.
I have tried to execute the same on different servers/computers, and the symptoms are all the same.
If I restart the process when it starts to perform too bad, the performance turns good again immediately after.
I'm running ruby 1.9.3p194 and rails 3.2 and sidekiq 2.9.0 for job processor
It is difficult to tell from the limited description of your service, but the behaviour is consistent with a small (i.e. not leaky) cache of data that either has poor lookup performance, or that you are relying on very heavily, and that is growing at just a modest rate. A contrived example might be a list of "jobs done so far by this worker" which is being sorted on demand at a few points in the code.
One such cache is out of your direct control: Ruby's symbol table. Finding a Symbol is something like O(log(n)) on number of symbols in the system, which is good. But this could still impact you if you handle a lot of symbols, and each iteration of your worker can generate new symbols (for instance if keys in an input hash can be arbitrary data, and you use a symbolize_keys method or call to_sym on a lot of varying strings). Symbols are cached permanently in the Ruby process. In theory a few million would not show up as a memory leak. But if your code can go from say 10,000 symbols to 1,000,000 in total, all the symbol generating and checking code would slow down by a small fixed amount. If you are doing that a lot, it could potentially explain a few hundred ms.
If hunting through suspect code is getting you nowhere, your best bet to find the problem is to use a profiler. You should collect a profile of the code behaving well, and behaving badly, and compare the two.
When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.