tracking memory allocation in Clojure - memory

In my program all the state is held in a giant map in an atom, which is updated by a load of pure functions in each iteration. I have determined that the heap size is increasing, how do I find the code that's responsible ? I tried VisualVM, but it gives generic information and I can't find which part of my state is growing and which function is causing it to grow.

Look for common gotchas like forgetting to use with-open, hanging onto the head of a sequence, etc.
Isolate smaller segments of your code and see if you still see the same kinds of memory growth using JVisualVM. If knocking out or mocking some piece makes no difference then put it back, and if it makes a difference then you can focus on that and figure out what is going on.
I don't know of any silver bullet tool or technique, it's just a process of divide and conquer, and thinking about what you are doing in your code.

Related

Find out what is taking my memory in torch7

I have written a rather complex torch application and it works quite well, that is if it doesn't run out of memory. Now I have tried to see what sort of inputs or situations cause it too seemingly randomly run out of memory but so far I have had little to no success. So now I'm looking for a way to check which variables take how much (v)ram.
I can with a simple statement switch between running my code on caffe:cuda or caffe:cl which changes whatever or not my program runs in RAM or on the GPU, I imagine that such a switch will make validating my memory usage a lot easier.
I have already tried to use print(collectgarbage("count")*1024) to check how much memory is in usage at a given point in time however this does not clearly show me where the memory is being used, perhaps because the program is relatively complex (although there are a few variables which I suspect are hugging a lot of memory, neural networks, large matrices and such).
I already know that once I have identified who is hogging my memory I can assign a nill value to it and call the garbage collector too free it.
So in short is there a program or a tool that allows me to run a torch program and then list each variable and it's memory usage?
I don't know if you tried google :)
But here you are:
Torch7-profiling
Neural Model profiler script
"How to Profile a Lua Script using Pepperfish"
Easy Lua Profiling
tbo, I've never had memory issues with Torch7 so it might be your implementation which is just not optimal. It might be a loop without collectgarbage call somewhere where it should be, e.g. in a training loop or between the epochs.

Reading the code profiler(gprof)

I'm trying to implement someone's Quadtree system, but it's insanely slow, so I'm attempting to figure out why. I've done some tests, and it takes a staggering two seconds to query 1000 against 1000 with it. It's insanely faster to just iterate through the whole thing instead.
I used code::blocks code profiler to try to find out why, but there's quite alot in there I don't understand, namely the pthread things. I do use multithreadding for networking, but the Quadtree system never touches it, as far as I can tell.
Image of the readout below.
http://imgur.com/a/i1PnH
I have a feeling I'm just using it wrong, but It doesn't hurt to learn what all those pthread and unwind things are. I've seen them often, but they've never been at the top so prominently before this quadtree thing.
You could learn some things about performance.
One is that there is code you can change (sometimes called "your code") and code you can't change, like library code including all this pthread and unwind stuff.
The most the latter can do is maybe give you a distant clue about what in your code can be taking time.
It is more likely to either leave you mystified or exploring a wrong path.
Another is that "self time" is seldom useful. In all but the simplest toy programs, the computer's instruction pointer does not go very far in your code before calling into a subroutine (function, method, whatever it is called).
Then in that routine it doesn't go very far before calling yet another.
So you're going to see that if there's much self time in any routine, it's not likely to be in your code.
Another is that accuracy of measurement doesn't really matter.
If a program is some factor, like 100, slower than it will be after it is fixed, what does that mean?
That means that typically only 1 out of 100 nanoseconds are spent progressing toward the finish line.
The other 99 are doing something useless, that if you knew what it was, you would easily recognize it as useless, so you would get rid of it.
SO, if you run the program under an IDE or debugger, and simply click the "pause" button to make it stop at a random time, what is the chance you will see it doing the useless thing?
Practically certain - the chance that it's in the 1% is 1%.
If you're not sure if it's in the 1%, do it again.
What's the chance that it's again in the 1% - 1% squared or 0.0001.
So the chance that you've seen the speed bug is almost certain.
That's two samples, not thousands.
To generalize, suppose only 30% is being wasted.
Then the number of samples, on average, needed to see the problem two times is 2 / 30% or 6.67 samples.
Typically there are multiple problems, so each time you fix one you get a speedup factor, and the percent taken by the others grows by that factor.
That means you eventually get to the ones that were small at first.
That's the principle behind random pausing, the method many people and I use,
and it is very effective.

Callstack sampling in Erlang

I am currently investigating a performance issue within a large Erlang application. The application exhibits larger-than-expected CPU load. To get a first grasp which parts of the system are responsible for the load, I'd like to perform callstack sampling as described in this answer.
Is there a better way to do this than calling erlang:process_info(Pid, backtrace) repeatedly and grepping the functions from that output?
Note that the system is too large to use fprof, and that etop did not point me into the right direction as well. Using fprof for only parts of the system is not possible right now as well, as I first need to pin-point the general location of the performance issue.
A simple way to get the actual size of the stack is process_info(Pid, stack_size). While this only return the size of the stack in words it is a very simple and efficient way of seeing which processes have large stacks.

Memory efficiency vs Processor efficiency

In general use, should I bet on memory efficiency or processor efficiency?
In the end, I know that must be according to software/hardware specs. but I think there's a general rule when there's no boundaries.
Example 01 (memory efficiency):
int n=0;
if(n < getRndNumber())
n = getRndNumber();
Example 02 (processor efficiency):
int n=0, aux=0;
aux = getRndNumber();
if(n < aux)
n = aux;
They're just simple examples and wrote them in order to show what I mean. Better examples will be well received.
Thanks in advance.
I'm going to wheel out the universal performance question trump card and say "neither, bet on correctness".
Write your code in the clearest possible way, set specific measurable performance goals, measure the performance of your software, profile it to find the bottlenecks, and then if necessary optimise knowing whether processor or memory is your problem.
(As if to make a case in point, your 'simple examples' have different behaviour assuming getRndNumber() does not return a constant value. If you'd written it in the simplest way, something like n = max(0, getRndNumber()) then it may be less efficient but it would be more readable and more likely to be correct.)
Edit:
To answer Dervin's criticism below, I should probably state why I believe there is no general answer to this question.
A good example is taking a random sample from a sequence. For sequences small enough to be copied into another contiguous memory block, a partial Fisher-Yates shuffle which favours computational efficiency is the fastest approach. However, for very large sequences where insufficient memory is available to allocate, something like reservoir sampling that favours memory efficiency must be used; this will be an order of magnitude slower.
So what is the general case here? For sampling a sequence should you favour CPU or memory efficiency? You simply cannot tell without knowing things like the average and maximum sizes of the sequences, the amount of physical and virtual memory in the machine, the likely number of concurrent samples being taken, the CPU and memory requirements of the other code running on the machine, and even things like whether the application itself needs to favour speed or reliability. And even if you do know all that, then you're still only guessing, you don't really know which one to favour.
Therefore the only reasonable thing to do is implement the code in a manner favouring clarity and maintainability (taking factors you know into account, and assuming that clarity is not at the expense of gross inefficiency), measure it in a real-life situation to see whether it is causing a problem and what the problem is, and then if so alter it. Most of the time you will not have to change the code as it will not be a bottleneck. The net result of this approach is that you will have a clear and maintainable codebase overall, with the small parts that particularly need to be CPU and/or memory efficient optimised to be so.
You think one is unrelated to the other? Why do you think that? Here are two examples where you'll find often unconsidered bottlenecks.
Example 1
You design a DB related software system and find that I/O is slowing you down as you read in one of the tables. Instead of allowing multiple queries resulting in multiple I/O operations you ingest the entire table first. Now all rows of the table are in memory and the only limitation should be the CPU. Patting yourself on the back you wonder why your program becomes hideously slow on memory poor computers. Oh dear, you've forgotten about virtual memory, swapping, and such.
Example 2
You write a program where your methods create many small objects but posses O(1), O(log) or at the worst O(n) speed. You've optimized for speed but see that your application takes a long time to run. Curiously, you profile to discover what the culprit could be. To your chagrin you discover that all those small objects adds up fast. Your code is being held back by the GC.
You have to decide based on the particular application, usage etc. In your above example, both memory and processor usage is trivial, so not a good example.
A better example might be the use of history tables in chess search. This method caches previously searched positions in the game tree in case they are re-searched in other branches of the game tree or on the next move.
However, it does cost space to store them, and space also requires time. If you use up too much memory you might end up using virtual memory which will be slow.
Another example might be caching in a database server. Clearly it is faster to access a cached result from main memory, but then again it would not be a good idea to keep loading and freeing from memory data that is unlikely to be re-used.
In other words, you can't generalize. You can't even make a decision based on the code - sometimes the decision has to be made in the context of likely data and usage patterns.
In the past 10 years. main memory has increased in speed hardly at all, while processors have continued to race ahead. There is no reason to believe this is going to change.
Edit: Incidently, in your example, aux will most likely end up in a register and never make it to memory at all.
Without context I think optimising for anything other than readability and flexibilty
So, the only general rule I could agree with is "Optmise for readability, while bearing in mind the possibility that at some point in the future you may have to optimise for either memory or processor efficiency in the future".
Sorry it isn't quite as catchy as you would like...
In your example, version 2 is clearly better, even though version 1 is prettier to me, since as others have pointed out, calling getRndNumber() multiple times requires more knowledge of getRndNumber() to follow.
It's also worth considering the scope of the operation you are looking to optimize; if the operation is time sensitive, say part of a web request or GUI update, it might be better to err on the side of completing it faster than saving memory.
Processor efficiency. Memory is egregiously slow compared to your processor. See this link for more details.
Although, in your example, the two would likely be optimized to be equivalent by the compiler.

What is the optimal trade off between refactoring and increasing the call stack?

I'm looking at refactoring a lot of large (1000+ lines) methods into nice chunks that can then be unit tested as appropriate.
This started me thinking about the call stack, as many of my rafactored blocks have other refactored blocks within them, and my large methods may well have been called by other large methods.
I'd like to open this for discussion to see if refactoring can lead to call stack issues. I doubt it will in most cases, but wondered about refactored recursive methods and whether it would be possible to cause a stack overflow without creating an infinite loop?
Excluding recursion, I wouldn't worry about call stack issues until they appear (which they likely won't).
Regarding recursion: it must be carefully implemented and carefully tested no matter how it's done so this would be no different.
I guess it's technically possible. But not something that I would worry about unless it actually happens when I test my code.
When I was a kid, and computers had 64K of RAM, the call stack size mattered.
Nowadays, it's hardly worth discussing. Memory is huge, stack frames are small, a few extra function calls are hardly measurable.
As an example, Python has an artificially small call stack so it detects infinite recursion promptly. The default size is 1000 frames, but this is adjustable with a simple API call.
The only way to run afoul of the stack in Python is to tackle Project Euler problems without thinking. Even then, you typically run out of time before you run out of stack. (100 trillion loops would take far longer than a human lifespan.)
I think it's highly unlikely for you to get a stackoverflow without recursion when refactoring. The only way that I can see that this would happen is if you are allocating and/or passing a lot of data between methods on the stack itself.

Resources