I have some constraints which z3 takes a long time to solve. I am aware of the "-st" command-line flag that prints statistics but at the very end, and the TRACE facility for printing out internal data structure values. Is is there a way to get diagnostic information from within z3 (eg. to monitor memory usage continuously) as it is running (external tools like ps are not always convenient and do not always serve the purpose), when it is being used from the command-line? Thanks.
You can use the option -v:100, it sets the verbosity level to 100. It may not still display the memory usage as often as you want.
Another option is to add the following line of code in appropriate places.
timeit tt(get_verbosity_level() >= 3, "report");
It will display memory usage if the verbosity level is >= 3.
For example, a good place is in the beginning of the method lbool context::bounded_search() at src/smt/smt_context.cpp. This method is executed after each restart.
Related
I use recon_alloc:memory(allocated_types) and get info like below.
34> recon_alloc:memory(allocated_types).
[{binary_alloc,1546650440},
{driver_alloc,21504840},
{eheap_alloc,28704768840},
{ets_alloc,526938952},
{fix_alloc,145359688},
{ll_alloc,403701800},
{sl_alloc,688968},
{std_alloc,67633992},
{temp_alloc,21504840}]
The eheap_alloc is using 28G. But sum up with heap_size of all process
>lists:sum([begin {_, X}=process_info(P, heap_size), X end || P <- processes()]).
683197586
Only 683M !Any idea where is the 28G ?
You are not comparing the right values. From erlang:process_info
{heap_size, Size}
Size is the size in words of youngest heap generation of the
process. This generation currently include the stack of the process.
This information is highly implementation dependent, and may change if
the implementation change.
recon_alloc:memory(allocated_types) is in bytes by default. You can change it using set_unit. It is not the memory that is currently used but it is the memory reserved by the VM grouped into different allocators. You can use recon_alloc:memory(used) instead. More details in allocator() - Recon Library
Searching through the Erlang source code for the eheap_alloc keyword I didn't come up with much. The most relevant piece of code was this XML code from erts_alloc.xml (https://github.com/erlang/otp/blob/172e812c491680fbb175f56f7604d4098cdc9de4/erts/doc/src/erts_alloc.xml#L46):
<tag><c>eheap_alloc</c></tag>
<item>Allocator used for Erlang heap data, such as Erlang process heaps.</item>
This says that process heaps are stored in eheap_alloc but it doesn't say what else is stored in eheap_alloc. The eheap_alloc stores everything your application needs to run along with some extra memory along with some additional space, so the VM doesn't have to request more memory from the OS every time something needs to be added. There are things the VM must keep in memory that aren't associated with a specific process. For example, large binaries, even though they may used within a process, are not stored inside that processes heap. They are stored in a shared process binary heap called binary_alloc. The binary heap, along with the process heaps and some extra memory, are what make up eheap_alloc.
In your case it looks like you have a lot of memory in your binary_alloc. binary_alloc is probably using a significant portion of your eheap_alloc.
For more details on binary handling checkout these pages:
http://blog.bugsense.com/post/74179424069/erlang-binary-garbage-collection-a-love-hate
http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65224
Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.
I have a somewhat similar question as:
Mathematica running out of memory
I am interested in something like this:
ParallelTable[F[i], {i, 0, 14.9, 0.001}]
where F[i] is a complicated numerical integral (I haven't yet found an easy way to reproduce the problem without page filling definitions for an integral).
My problem is that the subkernels blow up in memory and I have to stop evaluation if I won't let the machine swapping.
But even if I have stopped evaluation the kernels won't give free their occupied memory.
ClearSystemCache[]
I even have tried
ParallelEvaluate[ClearSystemCache[]]
but
ParallelEvaluate[MemoryInUse[]]
stays at
{823185944, 833146832, 812429208, 840150336, 850057024, 834441704,
847068768, 850424224}
it seems that all memory controlling only works for the main kernel?
By now the only way is to shut down all the kernels and launch them again.
I really hope there are some solutions out there...
Thanks a lot.
Memory control works for the kernel where control expressions involving such functions as MemoryConstrained, MemoryInUse, Clear, Unset, Remove, $HistoryLength, ClearSystemCache etc. are evaluated. It seems that in your case the source of the memory leaks is not due to Mathematica's internal caching mechanism (thanks for the link, BTW!).
Have you tried to evaluate $HistoryLength=0; in all subkernels before using them for computations? If you have not yet, I strongly recommend to try.
Since you are working with numerical integration functions, I suggest also to try to optimize usage of them. For example, if you make numerical integration using NDSolve and need only a limited set of calculated points (or even the only one point) you should use the form NDSolve[eqns,y,{x,x_needed_min,x_needed_max}] (or even NDSolve[eqns,y,{x,x_max,x_max}]) instead of NDSolve[eqns,y,{x,x_min,x_max}] or NDSolve[eqns,y,{x,0,x_max}]. This can dramatically reduce memory usage in some cases! You can also use EventLocator for memory control.
I was(am?) having the exact same problem, almost word for word. I just had some good luck with adding the option to the problem integral:
Method-> {"GlobalAdaptive", "SymbolicProcessing"->False}
You can probably choose any other method if you'd like, but I had success with this within the last few minutes. Also, a lot of nasty inconsistencies I used to be getting are gone, and integration proceeds MUCH faster.
i'm using valgrind to find and trace memory issues. Now i want to do something like this:
before = getValgrindState();
do_something_curious();
after = getValgrindState();
difference = after - before;
std::cout << difference;
Is something like this possible with valgrind?
The MS Visual C++ runtime provides the following functions:
_CrtMemCheckpoint (to gather the current state of the allocated memory)
_CrtMemDifference (to calculate the difference between two states)
And i would like to know if there's a way to implement a similar functionality with valgrind.
A primitive/destructive way to do what you want is to use _exit() (note the underscore) to avoid calling any of the destructors.
run valgrind/memcheck against your code that calls _exit() prior to do_something_curious();
run valgrind/memcheck again with _exit() after do_something_curious();
compare results to see what do_something_curious() has left around.
[I couldn't figure out how massif would do what you want (is there a way to have massif keep track of free/delete operations and reconcile with malloc/new operations that I missed?)]
What do you want to measure? What is "difference" supposed to be? If you want to examine memory usage, try with valgrind's massif tool. Massif Visualizer is useful for interpreting the results.
I have a MS-Visual Studio 2005 workspace having all c code. This application(exe) allocates memory dynamically from heap using malloc and realloc. I want to calculate the maximum size allocated size allocated on heap using malloc/realloc by this application program when i run particular test case.
I do not want to change the code by noting the malloc sizes and accumulating them, because:
a) there can be a scenario, that some memory of 1KB is malloc'ed, then freed, and then a memory of 2KB is malloc'ed. So max is 2KB, which i need to get as the value and not 1+2=3KB.
So i have to really see whereall malloc/free is happening in this code and add code for this, which i want to avoid.
1) So are there any tools(freeware/licensed) to find size of maximum or total memory allocated dynamically using malloc/realloc?
2)Does MS Visual Studio 2005/2008 itself provide anything of this sort?
thanks,
-AD
If you statically link with the CRT, you can 'overrule' the implementation of malloc, realloc, free (in fact, all functions that appear in malloc.c, realloc,c free.c and/or dbgheap.c in the CRT). It is doable but may require some iterations before you get the full set of functions that need to be overruled.
If you dynamically link with the CRT, you can redefine malloc, realloc and free like this:
#define malloc(s) mymalloc(s)
#define realloc(p,s) myrealloc(p,s)
#define free(p) myfree(p)
The implementations of mymalloc, myrealloc and myfree can then simply use malloc, realloc and free (be sure not to use the #define in the source file that implements mymalloc, ...) or you could use the native Windows functions.
Memory Validator can do this.
There are several different reports that you will find useful:
Running Totals. This is presented as a dialog box and provides current, cumulative and total values for each of the main memory allocator (C runtime, HeapAlloc, LocalAlloc, GlobalAlloc, CoTaskMemAlloc, etc).
Objects. This is one of the main tabs and displays object type, size, count, cumulative.
Also subtabs for per-thread and per-dll values.
Sizes. This is one of the main tabs and displays size, count, cumulative.
Also subtabs for per-thread and per-dll values.
Virtual. This displays a graphical view of memory (one pixel == one page of memory) and has subtabs showing detailed virtual memory data for virtual memory pages and virtual memory paragraphs.
Full disclosure: I am part of the Memory Validator team.
I would recommend the following:
It you do have access to the source code that you wish to analyze, replace all malloc/realloc calls with calls you your OWN function that would perform the analysis.
If you don't have access to the source code, you can use the Detours library from Microsoft. The library intercepts calls to arbitrary functions and redirects them to the custom-made implementation. In this implementation you can perform the analysis and then fall back to standard malloc/realloc.
VS has a number of heap debugging tools such as _heapwalk, which will let you walk through the heap and get information about blocks on the heap. Most of what you need to do is figure out when your heap is at maximum usage, so you know when to walk it and find its size.