I am running into the following issue while profiling an application under VC6. When I profile the application, the profiler is indicating that a simple getter method similar to the following is being called many hundreds of thousands of times:
int SomeClass::getId() const
{
return m_iId;
};
The problem is, this method is not called anywhere in the test app. When I change the code to the following:
int SomeClass::getId() const
{
std::cout << "Is this method REALLY being called?" << std::endl;
return m_iId;
};
The profiler never includes getId in the list of invoked functions. Comment out the cout and I'm right back to where I started, 130+ thousand calls! Just to be sure it wasn't some cached profiler data or corrupted function lookup table, I'm doing a clean and rebuild between each test. Still the same results!
Any ideas?
I'd guess that what's happening is that the compiler and/or the linker is 'coalescing' this very simple function to one or more other functions that are identical (the code generated for return m_iId is likely exactly the same as many other getters that happen to return a member that's at the same offset).
essentially, a bunch of different functions that happen to have identical machine code implementations are all resolved to the same address, confusing the profiler.
You may be able to stop this from happening (if this is the problem) by turning off optimizations.
I assume you are profiling because you want to find out if there are ways to make the program take less time, right? You're not just profiling because you like to see numbers.
There's a simple, old-fashioned, tried-and-true way to find performance problems. While the program is running, just hit the "pause" button and look at the call stack. Do this several times, like from 5 to 20 times. The bigger a problem is, the fewer samples you need to find it.
Some people ask if this isn't basically what profilers do, and the answer is only very few. Most profilers fall for one or more common myths, with the result that your speedup is limited because they don't find all problems:
Some programs are spending unnecessary time in "hotspots". When that is the case, you will see that the code at the "end" of the stack (where the program counter is) is doing needless work.
Some programs do more I/O than necessary. If so, you will see that they are in the process of doing that I/O.
Large programs are often slow because their call trees are needlessly bushy, and need pruning. If so, you will see the unnecessary function calls mid-stack.
Any code you see on some percentage of stacks will, if removed, save that percentage of execution time (more or less). You can't go wrong. Here's an example, over several iterations, of saving over 97%.
Related
I have been struggling with parallel and async constructs in F# for the last couple days and not sure where to go at this point. I have been programming with F# for about 4 months - certainly no expert - and I currently have a series of calculations that are implemented in F# (asp.net 4.5) and are working correctly when executed sequentially. I am running the calculations on a multi-core server and since there are millions of inputs to perform the same calculation on, I am hoping to take advantage of parallelism to speed it up.
The calculations are extremely data parallel - basically the exact calculation on different input data. I have tried a number of different avenues and I continually run into the same issue - it seems as if the parallel looping never gets to the end of the input data set. I have tried TPL, ConcurrentQueues, Parallel.Array.map/iter and all the same result: the program starts out fine and then somewhere in the middle (indeterminate) it just hangs and never completes. For simplicity I actually removed the calculation from the program and I am just calling a print method, and Here is where the code is currently at:
let runParallel =
let ids = query {for c in db.CustTable do select c.id} |> Seq.take(5)
let customerInputArray= getAllObservations ids
Array.Parallel.iter(fun c -> testParallel c) customerInputArray
let key = System.Console.ReadKey()
0
A few points...
I limited the results above to only 5 just for debugging. The actual program does not apply the Take(5).
The testParallel method is just a printfn "test".
The customerInputArray is a complex data type. It is a tuple of lists that contain records. So I am pretty sure my problem must be there...but I added exception handling and no exception is getting raised, so have no idea how to go about finding the problem.
Any help is appreciated. Thanks in advance.
EDIT: Thanks for the advice...I think it is definitely deadlock. When I remove all of the printfn, sprintfn, and string concat operations, it completes. (of course, I need those things in there.)
Is printfn, sprintfn, and string ops not thread-safe?
Another EDIT: Iteration always stops on the last item..So if my input array has 15 items, the processing stops on item 14, or seems to never get to item 15. Then everything just hangs. Does not matter what the size of the input array is..Any ideas what can be causing this? I even switched over to Parallel.ForEach (instead of Array.Parallel) and same behavior.
Update on the situation and how I resolved this issue.
I was unable to upload code from my example due to my company's firewall policy, so in the end my question did not have enough details. I failed to mention that I was using a type provider which was important information in this situation. But here is what I figured out.
I am using the F# type provider for SQL Server and was passing around its Service Types which I suspect are not thread-safe. When I replaced the ServiceTypes with plain old F# Records, the code worked fine - no more deadlocks and everything completed without error.
I'm writing a package which makes heavy use of buffers internally for temporary storage. I have a single global (but not exported) byte slice which I start with 1024 elements and grow by doubling as needed.
However, it's very possible that a user of my package would use it in such a way that caused a large buffer to be allocated, but then stop using the package, thus wasting a large amount of allocated heap space, and I would have no way of knowing whether to free the buffer (or, since this is Go, let it be GC'd).
I've thought of three possible solutions, none of which is ideal. My question is: are any of these solutions, or maybe ones I haven't thought of, standard practice in situations like this? Is there any standard practice? Any other ideas?
Screw it.
Oh well. It's too hard to deal with this, and leaving allocated memory lying around isn't so bad.
The problem with this approach is obvious: it doesn't solve the problem.
Exported "I'm done" or "Shrink internal memory usage" function.
Export a function which the user can call (and calling it intelligently is obviously up to them) which will free the internal storage used by the package.
The problem with this approach is twofold. First, it makes for a more complex, less clean interface to the user. Second, it may not be possible or practical for the user to know when calling such a function is wise, so it may be useless anyway.
Run a goroutine which frees the buffer after a certain period of the package going unused, or which shrinks the buffer (perhaps halving the length) whenever its size hasn't been increased in a while.
The problem with this approach is primarily that it puts unnecessary strain on the scheduler. Obviously a single goroutine isn't so bad, but if this were accepted practice, it wouldn't scale well if every package you imported were doing this under the hood. Also, if you have a time-sensitive application, you may not want code running when you're not aware of it (that is, you may assume that the package isn't doing any work when its functions are not being called - a reasonable assumption, I'd say).
So... any ideas?
NOTE: You can see the existing project here (the relevant code is only a few tens of lines).
A common approach to this is letting the client pass an existing []byte (or whatever) as an argument to some call/function/method. For example:
// The returned slice may be a sub-slice of dst if dst was large enough
// to hold the entire encoded block. Otherwise, a newly allocated slice
// will be returned. It is valid to pass a nil dst.
func Foo(dst []byte, whatever Bar) (ret []byte, err error)
(Example)
Another approach is to get a new []byte from a, for example cache and/or for example pool (if you prefer the later name for that concept) and rely on clients to return used buffers to such "recycle-bin".
BTW: You're doing it right by thinking about this. Where it's possible to reasonably reuse []byte buffers, there's a potential for lowering the GC load and thus making your program better performing. Sometimes the difference can be critical.
You could reslice your buffer at the end of every operation.
buffer = buffer[:0]
Then your function extendAndSliceBuffer would have the original backing array most likely available if it needs to grow. If not, you would suffer a new allocation, which you might get anyway when you do extendAndSliceBuffer.
Overall, I think a cleaner solution is to do like #jnml said and let the users pass their own buffer if they care about performance. If they don't care about performance, then you should not use a global var and simply allocate the buffer as you need and let it go when it gets out of scope.
I have a single global (but not exported) byte slice which I start
with 1024 elements and grow by doubling as needed.
And there's your problem. You shouldn't have a global like this in your package.
Generally the best approach is to have an exported struct with attached functions. The buffer should reside in this struct unexported. That way the user can instantiate it and let the garbage collector clean it up when they let go of it.
You also want to avoid requiring globals like this as it can hamper unit tests. A unit test should be able to instantiate the exported struct, as the user can, and do it each time for every test.
Also depending on what kind of buffer you need, bytes.Buffer may be useful as it already provides io.Reader and io.Writer functions. bytes.Buffer also automatically grows and shrinks its buffer. In buffer.go you'll see various calls to b.Truncate(0) that does the shrinking with the comment "reset to recover space".
It's generally really really bad form to write Go code that is not thread-safe. If two different goroutines call functions that modify the buffer at the same time, who knows what state the buffer will be in when they finish? Just let the user provide a scratch-space buffer if they decide that the allocation performance is a bottleneck.
I've tried every possible fields but can not find the number of times functions are called.
Besides, I don't get Self and # Self. What do these two numbers mean?
There are several other ways to accomplish this. One is obviously to create a static hit counter and an NSLog that emits and increments a counter. This is intrusive though and I found a way to do this with lldb.
Set a breakpoint
Execute the program until you hit the breakpoint the first time and note the breakpoint number on the right hand side of the line you hit (e.g. "Thread 1: breakpoint 7.1", note the 7.1)
Context click on the breakpoint and choose "Edit Breakpoint"
Leave condition blank and choose "Add Action"
Choose "Debugger Command"
In the command box, enter "breakpoint list 7.1" (using the breakpoint number for your breakpoint from step 2). I believe you can use "info break " if you are using gdb.
Check Options "Automatically Continue after evaluating"
Continue
Now, instead of stopping, llvm will emit info about the breakpoint including the number of times it has been passed.
As for the discussion between Glenn and Mike on the previous answer, I'll describe a performance problem where function execution count was useful: I had a particular action in my app where performance degraded considerably with each execution of the action. The Instruments time profiler showed that each time the action was executed, a particular function was taking twice as long as the time before until quickly the app would hang if the action was performed repeatedly. With the count, I was able to determine that with each execution, the function was called twice as many times as it was during the previous execution. It was then pretty easy to look for the reason, which turned out to be that someone was re-registering for a notification in NotificationCenter on each event execution. This had the effect of doubling the number of response handler calls on each execution and thus doubling the "cost" of the function each time. Knowing that it was doubling because it was called twice as many times and not because the performance was just getting worse caused me to look at the calling sequence rather than for reasons the function itself could be degrading over time.
While it's interesting, knowing the number of times called doesn't have anything to do with how much time is spent in them. Which is what Time Profiler is all about. In fact, since it does sampling, it cannot answer how many times.
It seems you cannot use Time Profiler for counting function calls. This question seems to address potential methods for counting.
W/ respect to self and #self:
Self is "The number of times the symbol calls itself." according to the Apple Docs on the Time Profiler.
From the way the numbers look though, it seems self is the summed duration of samples that had this symbol at the bottom of its stack trace. That would make:
# self: the number of samples where this symbol was at the bottom of the stack trace
% self: the percent of self samples relative to total samples of currently displayed call tree
(eg - #self / total samples).
So this wouldn't tell you how many times a method was called. But it would give you an idea how much time is spent in a method or lower in the call tree.
NOTE: I too am unsure about the various 'self' meanings though. Would love to see someone answer this authoritatively. Arrived here searching for that...
IF your objective is to find out what you need to fix to make the program as fast as possible,
Number of calls and self time may be interesting but are irrelevant.
Look at my answer to this question, in particular points 6 and 8.
EDIT: To clarify the point further, suppose the following is the timeline of execution of the program. Some of that time (in this case about 50%) is spent in an activity that can be removed, if you know what it is, such as needless buried I/O, excessive calls to new, runaway notifications, or "insignificant" data validation. If a random-time sample is taken, it has a 50% chance of occurring in that activity, and an examination of the call stack and/or program variables shows that it is doing something that can be removed. Then, if 10 such samples are taken, the activity will be seen on roughly 5 of them, regardless of whether the activity occurs in a few large chunks of time, or many small ones. The activity may be a few lines of code in a function doing something unnecessary, or it may be something much more generalized. Regardless, you recognize it, fix it, and get roughly a factor of 2 speedup. Call counts and self time contribute nothing to this process.
I have a somewhat similar question as:
Mathematica running out of memory
I am interested in something like this:
ParallelTable[F[i], {i, 0, 14.9, 0.001}]
where F[i] is a complicated numerical integral (I haven't yet found an easy way to reproduce the problem without page filling definitions for an integral).
My problem is that the subkernels blow up in memory and I have to stop evaluation if I won't let the machine swapping.
But even if I have stopped evaluation the kernels won't give free their occupied memory.
ClearSystemCache[]
I even have tried
ParallelEvaluate[ClearSystemCache[]]
but
ParallelEvaluate[MemoryInUse[]]
stays at
{823185944, 833146832, 812429208, 840150336, 850057024, 834441704,
847068768, 850424224}
it seems that all memory controlling only works for the main kernel?
By now the only way is to shut down all the kernels and launch them again.
I really hope there are some solutions out there...
Thanks a lot.
Memory control works for the kernel where control expressions involving such functions as MemoryConstrained, MemoryInUse, Clear, Unset, Remove, $HistoryLength, ClearSystemCache etc. are evaluated. It seems that in your case the source of the memory leaks is not due to Mathematica's internal caching mechanism (thanks for the link, BTW!).
Have you tried to evaluate $HistoryLength=0; in all subkernels before using them for computations? If you have not yet, I strongly recommend to try.
Since you are working with numerical integration functions, I suggest also to try to optimize usage of them. For example, if you make numerical integration using NDSolve and need only a limited set of calculated points (or even the only one point) you should use the form NDSolve[eqns,y,{x,x_needed_min,x_needed_max}] (or even NDSolve[eqns,y,{x,x_max,x_max}]) instead of NDSolve[eqns,y,{x,x_min,x_max}] or NDSolve[eqns,y,{x,0,x_max}]. This can dramatically reduce memory usage in some cases! You can also use EventLocator for memory control.
I was(am?) having the exact same problem, almost word for word. I just had some good luck with adding the option to the problem integral:
Method-> {"GlobalAdaptive", "SymbolicProcessing"->False}
You can probably choose any other method if you'd like, but I had success with this within the last few minutes. Also, a lot of nasty inconsistencies I used to be getting are gone, and integration proceeds MUCH faster.
I've been writing some scripts for a game, the scripts are written in Lua. One of the requirements the game has is that the Update method in your lua script (which is called every frame) may take no longer than about 2-3 milliseconds to run, if it does the game just hangs.
I solved this problem with coroutines, all I have to do is call Multitasking.RunTask(SomeFunction) and then the task runs as a coroutine, I then have to scatter Multitasking.Yield() throughout my code, which checks how long the task has been running for, and if it's over 2 ms it pauses the task and resumes it next frame. This is ok, except that I have to scatter Multitasking.Yield() everywhere throughout my code, and it's a real mess.
Ideally, my code would automatically yield when it's been running too long. So, Is it possible to take a Lua function as an argument, and then execute it line by line (maybe interpreting Lua inside Lua, which I know is possible, but I doubt it's possible if all you have is a function pointer)? In this way I could automatically check the runtime and yield if necessary between every single line.
EDIT:: To be clear, I'm modding a game, that means I only have access to Lua. No C++ tricks allowed.
check lua_sethook in the Debug Interface.
I haven't actually tried this solution myself yet, so I don't know for sure how well it will work.
debug.sethook(coroutine.yield,"",10000);
I picked the number arbitrarily; it will have to be tweaked until it's roughly the time limit you need. Keep in mind that time spent in C functions etc will not increase the instruction count value, so a loop will reach this limit far faster than calls to long-running C functions. It may be viable to set a far lower value and instead provide a function that sees how much os.clock() or similar has increased.