What is the unit of memory usage in Z3 statistics? - z3

What is the unit in which memory usage is measured in z3 statistics? Is it MB or KB?
And what does the memory exactly means? Is it the maximum memory usage or the aggregate sum of all allocations during the execution?

It's an approximation of the maximum heap size during execution and it is added to the statistics object through the following function in cmd_context.cpp:
void cmd_context::display_statistics(...) {
statistics st;
...
unsigned long long mem = memory::get_max_used_memory();
...
st.update("memory", static_cast<double>(mem)/static_cast<double>(1024*1024));
...
}
Thus it is in MB. It is only an approximation though, because the counters are not updated at every allocation; see the following comment in memory_manager.cpp:
// We only integrate the local thread counters with the global one
// when the local counter > SYNCH_THRESHOLD
#define SYNCH_THRESHOLD 100000

Related

OpenCL kernel out of resources based on number of loop iterations INSIDE the kernel. Can the compiled kernel be too large to fit on the GPU?

TL;DR I have an OpenCL kernel that loops for a large number of iterations and calls several user-made functions. The kernel works fine for few iterations, but increasing the number of iterations causes an CL_OUT_OF_RESOURCES (-5) error. If the same kernel is executed on a better GPU it is able to loop for more iterations without the error. What can be causing this error based on the number of iterations? Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
I am developing an OpenCL kernel to run on GPU that computes a very complex function. To keep things organized, I have a "kernel.cl" file with the main kernel (the __kernel void function) and a "aux_functions.cl" file with ~20 auxiliary functions (they are of type int, int2, int16, but not __kernel) that are called several times by the kernel and by themselves.
The problem specification is roughly as follows (justification for such many loops):
I have two arrays representing full HD images (1920x1080 integers)
For each 128x128 patch of one image, I must find the value of 4 parameters that optimize a given function (the second image is used to evaluate how good it is)
For the same 128x128 patch and the same 4 parameters, each 4x4 sub-patch is transformed slightly different based on its position inside the larger 128x128 patch
And I tried to model the execution as follows:
Each workgroup will compute the kernel for one 128x128 patch (I started processing only the 10 first patches -- 10 workgroups)
Each workgroup is composed of 256 workitems
Each workitem will test a distinct set of values (a fraction of a predefiend set) for the 4 parameters based on their IDs
The main structure of the kernel is as follows:
__kernel void funct(__global int *referenceFrameSamples, __global int *currentFrameSamples,const int frameWidth, const int frameHeight, __global int *result){
// Initialize some variables and get global and local IDs
for(executed X times){ // These 4 outer loops are used to test different combinations of parameters to be applied in a given function in the inner loops
for(executed Y times){
for(executed Z times){
for(executed W times){
// Simple assignments based on the outer for loops
for(executed 32x){ // Each execution of the inner loop applies a function to a 4x4 patch
for(executed 32x){
// The relevant computation is performed here
// Calls a couple of lightweight functions using only the private variables
// Calls a complex function that uses the __global int *referenceFrameSamples variable
}
}
// Compute something and use select() to update the best value
}
}
}
// Write the best value to __global *results buffer
}
The problem is that when the outer 4 loops are repeated a few times the kernel runs fine, but if I increase the iterations the kernel crashes with the error ERROR! clWaitForEvents returned CL_OUT_OF_RESOURCES (-5). I am testing it on a notebook with a GPU GeForce 940MX with 2GB, and the kernel starts crashing when X * Y * Z * W = 1024.
The clEnqueueNDRangeKernel() call has no error, only the clWaitForEvents() called after it returns an error. I am using CLIntercept to profile the errors and running time of the kernel. Also, when the kernel runs smooth I can measure the execution time correctly (showed next), but when it crashes, the "measured" execution time is ridiculously wrong (billions of miliseconds) even though it crashes on the first seconds.
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f miliseconds \n",nanoSeconds / 1000000.0);
What I tested:
Improve the complex auxiliary function that used __global variable: instead of passing the __global pointer, I read the relevant part of the array into a private array and passed it as argument. Outcome: improved running time on success cases, but still fails in the same case
Reduce workgroups and workitems: even using 1 workgroup and 1 workitem (the absolute minimum) with the same number of iterations yields the same error. For a smaller number of iterations, running time decreases with less workitems/groups
Running the same kernel on a better GPU: after doing the previous 2 modifications (improved function and reduced workitems) I launched the kernel on a desktop equipped with a GPU Titan V with 12GB. It is able to compute the kernel with a larger number of iterations (I tried up to 1 million iterations) without giving the CL_OUT_OF_RESOURCES, and the running time seems to increase linearly with the iterations. Although this is the computer that will actually run the kernel over a dataset to solve my problem, it is a server that must be accessed remotely. I would prefer to do the development on my notebook and deploy the final code on the server.
My guess: I know that function calls are inlined in GPU. Since the program is crashing based on the number of iterations, my only guess is that these for loops are being unrolled, and with the inlined functions, the compiled kernel is too big to fit on the GPU (even with a single workitem). This also explains why using a better GPU allows increasing the number of iterations.
Question: What could be causing this CL_OUT_OF_RESOURCES error based on the number of iterations?
Of course I could reduce the number of iterations in each workitem, but then I would need multiple workgroups to process the same set of data (the same 128x128 patch) and would need to access global memory to select the best result between workgroups of the same patch. I may end up proceeding in this direction, but I would really like to know what is happening with the kernel at this moment.
Update after #doqtor comment:
Using -cl-nv-verbose when building the program reports the following resources usage. It's strange that these values do not change irrespective of the number of iterations, either when the program runs successfully and when it crashes.
ptxas info : 0 bytes gmem, 532 bytes cmem[3]
ptxas info : Compiling entry function 'naive_affine_2CPs_CU_128x128_SR_64x64' for 'sm_50'
ptxas info : Function properties for naive_affine_2CPs_CU_128x128_SR_64x64
ptxas . 66032 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 140 registers, 360 bytes cmem[0]
Running clinfo reports that my GPU has
Registers per block (NV) 65536
Global memory size 2101870592 (1.958GiB)
Max memory allocation 525467648 (501.1MiB)
Local memory size 49152 (48KiB)
It seems that I am not using too many registers, but I don't know how those stack frame, cmem[0] and cmem[3] relate to the memory information reported by clinfo.
Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
Yey, that is part of the problem. The compiler sees that you have a loop with a fixed, small range, and it automatically unrolls it. This happens for the six nested loops and then the assembly blows up. This will get register spilling into global memory which makes the application very slow.
However even if the compiler does not unroll the loops, every thread does X*Y*Z*W*32*32 iterations of "The relevant computation", which takes an eternety. The system thinks it freezed up and you get CL_OUT_OF_RESOURCES .
Can you really not parallelize any of these six nested loops? The best solution would be to parallelize them all, that means include them in the kernel range and launch a few hundred million threads that do "The relevant computation" without any loops. You should have as much independent threads / workgroups as possible to get the best performance (saturate the GPU).
Remember, your GPU has thousands of cores grouped into warps of 32 and SMs of 2 or 4 warps, and if you only launch a single workgroup, it will run on only a single SM with 64 or 128 cores and the remaining cores stay idle.

CUDA Local memory register spilling overhead

I have a kernel which uses a lot of registers and spills them into local memory heavily.
4688 bytes stack frame, 4688 bytes spill stores, 11068 bytes spill loads
ptxas info : Used 255 registers, 348 bytes cmem[0], 56 bytes cmem[2]
Since the spillage seems quite high I believe it gets past L1 or even L2 cache. Since the local memory is private to each thread, how are accesses to local memory coalesced by the compiler? Is this memory read in 128byte transactions like global memory? With this amount of spillage I am getting low memory bandwidth utilisation (50%). I have similar kernels without the spillage that obtain up to 80% of the peak memory bandwidth.
EDIT
I've extracted some more metrics from with the nvprof tool. If I understand well the technique mentioned here, then I have a significant amount of memory traffic due to register spilling (4 * l1 hits and misses / sum of all writes across 4 sectors of L2 = (4 * (45936 + 4278911)) / (5425005 + 5430832 + 5442361 + 5429185) = 79.6%). Could somebody verify whether I am right here?
Invocations Event Name Min Max Avg
Device "Tesla K40c (0)"
Kernel: mulgg(double const *, double*, int, int, int)
30 l2_subp0_total_read_sector_queries 5419871 5429821 5425005
30 l2_subp1_total_read_sector_queries 5426715 5435344 5430832
30 l2_subp2_total_read_sector_queries 5438339 5446012 5442361
30 l2_subp3_total_read_sector_queries 5425556 5434009 5429185
30 l2_subp0_total_write_sector_queries 2748989 2749159 2749093
30 l2_subp1_total_write_sector_queries 2748424 2748562 2748487
30 l2_subp2_total_write_sector_queries 2750131 2750287 2750205
30 l2_subp3_total_write_sector_queries 2749187 2749389 2749278
30 l1_local_load_hit 45718 46097 45936
30 l1_local_load_miss 4278748 4279071 4278911
30 l1_local_store_hit 0 1 0
30 l1_local_store_miss 1830664 1830664 1830664
EDIT
I've realised that it is 128-byte and not bit transactions I was thinking of.
According to
Local Memory and Register Spilling
the impact of register spills on performance entails more than just coalescing decided at compile time; more important: read/write from/to L2 cache is already quite expensive and you want to avoid it.
The presentation suggests that using a profiler you can count at run time the number of L2 queries due to local memory (LMEM) access, see whether they have a major impact on the total number of all L2 queries, then optimize the shared to L1 ratio in favour of the latter, through a single host call for example
cudaDeviceSetCacheConfig( cudaFuncCachePreferL1 );
Hope this helps.

CUDA: are access times for texture memory similar to coalesced global memory?

My kernel threads access a linear character array in a coalesced fashion. If I map
the array to texture I don't see any speedup. The running times are
almost the same. I'm working on a Tesla C2050 with compute capability 2.0 and read
somewhere that global accesses are cached. Is that true? Perhaps that is why I
am not seeing a difference in the running time.
The array in the main program is
char *dev_database = NULL;
cudaMalloc( (void**) &dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
and I bind it to texture texture<char> texdatabase with
cudaBindTexture(NULL, texdatabase, dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
Each thread then reads a character ch = tex1Dfetch(texdatabase, p + id) where id
is threadIdx.x + blockIdx.x * blockDim.x and p is an offset.
I'm binding only once and dev_database is a large array. Actually I found that
if the size is too large the bind fails. Is there a limit on the size of the array
to bind? Thanks very much.
There are several possibilities for why you don't see any difference in performance, but the most likely is that this memory access is not your bottleneck. If it is not your bottleneck, making it faster will have no effect on performance.
Regarding caching: for this case, since you are reading only bytes, each warp will read 32 bytes, which means each group of 4 warps will map to each cache line. So assuming few cache conflicts, you will get up to 4x reuse from the cache. So if this memory access is a bottleneck, it is conceivable that the texture cache might not benefit you more than the general-purpose cache.
You should first determine if you are bandwidth bound and if this data access is the culprit. Once you have done that, then optimize your memory accesses. Another tactic to consider is to access 4 to 16 chars per thread per load (using a char4 or int4 struct with byte packing/unpacking) rather than one per thread to increase the number of memory transactions in flight at a time -- this can help to saturate the global memory bus.
There is a good presentation by Paulius Micikevicius from GTC 2010 that you might want to watch. It covers both analysis-driven optimization and the specific concept of memory transactions in flight.

Problem with the timings of a program that uses 1-8 threads on a server that has 4 Dual Core Cpu's?

I am runing a program on a server at my university that has 4 Dual-Core AMD Opteron(tm) Processor 2210 HE and the O.S. is Linux version 2.6.27.25-78.2.56.fc9.x86_64. My program implements Conways Game of Life and it runs using pthreads and openmp. I timed the parrallel part of the program using the getimeofday() function using 1-8 threads. But the timings don't seem right. I get the biggest time using 1 thread(as expected), then the time gets smaller. But the smallest time I get is when I use 4 threads.
Here is an example when I use an array 1000x1000.
Using 1 thread~9,62 sec, Using 2 Threads~4,73 sec, Using 3 ~ 3.64 sec, Using 4~2.99 sec, Using 5 ~4,19 sec, Using 6~3.84, Using 7~3.34, Using 8~3.12.
The above timings are when I use pthreads. When I use openmp the timing are smaller but follow the same pattern.
I expected that the time would decrease from 1-8 because of the 4 Dual core cpus? I thought that because there are 4 cpus with 2 cores each, 8 threads could run at the same time. Does it have to do with the operating system that the server runs?
Also I tested the same programs on another server that has 7 Dual-Core AMD Opteron(tm) Processor 8214 and runs Linux version 2.6.18-194.3.1.el5. There the timings i get are what I expected. The timings get smaller starting from 1(the biggest) to 8(smallest execution time).
The program implements the Game of Life correct, both using pthreads and openmp, I just can't figure out why the timings are like the example I posted. So in conclusion, my questions are:
1) The number of threads that can run at the same time on a system depends by the cores of the cpus? it depends only by the cpus although each cpu has more than one cores? It depends by all the previous and the Operating System?
2) Does it have to do with the way I divide the 1000x1000 array to the number of threads? But if I did then the openmp code wouldn't give the same pattern of timings?
3) What is the reason I might get such timings?
This is the code I use with openmp:
#define Row 1000+2
#define Col 1000+2 int num; int (*temp)[Col]; int (*a1)[Col]; int (*a2)[Col];
int main() {
int i,j,l,sum;
int array1[Row][Col],array2[Row][Col];
struct timeval tim;
struct tm *tm;
double start,end;
int st,en;
for (i=0; i<Row; i++)
for (j=0; j<Col; j++)
{
array1[i][j]=0;
array2[i][j]=0;
}
array1[3][16]=1;
array1[4][16]=1;
array1[5][15]=1;
array1[6][15]=1;
array1[6][16]=1;
array1[7][16]=1;
array1[5][14]=1;
array1[4][15]=1;
a1=array1;
a2=array2;
printf ("\nGive number of threads:");
scanf("%d",&num);
gettimeofday(&tim,NULL);
start=tim.tv_sec+(tim.tv_usec/1000000.0); omp_set_num_threads(num);
#pragma omp parallel private(l,i,j,sum)
{
printf("Number of Threads:%d\n",omp_get_num_threads());
for (l=0; l<100; l++)
{
#pragma omp for
for (i=1; i<(Row-1); i++)
{
for (j=1; j<(Col-1); j++)
{
sum=a1[i-1][j-1]+a1[i-1][j]+a1[i-1][j+1]+a1[i][j-1]+a1[i][j+1]+a1[i+1][j-1]+a1[i+1][j]+a1[i+1][j+1];
if ((a1[i][j]==1) && (sum==2||sum==3))
a2[i][j]=1;
else if ((a1[i][j]==1) && (sum<2))
a2[i][j]=0;
else if ((a1[i][j]==1) && (sum>3))
a2[i][j]=0;
else if ((a1[i][j]==0 )&& (sum==3))
a2[i][j]=1;
else if (a1[i][j]==0)
a2[i][j]=0;
}//end of iteration J
}//end of iteration I
#pragma omp barrier
#pragma omp single
{
temp=a1;
a1=a2;
a2=temp;
}
#pragma omp barrier
}//end of iteration L
}//end of paraller region
gettimeofday(&tim,NULL);
end=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("\nTime Elapsed:%.6lf\n",end-start);
printf("all ok\n");
return 0; }
TIMINGS with openmp code
a)System with 7 Dual Core Cpus
Using 1 thread~7,72 sec, Using 2 threads~4,53 sec, Using 3 Threads~3,64 sec, Using 4 threads~ 2,24 sec, Using 5~2,02 sec, Using 6~ 1,78 sec, Using 7 ~1,59 sec,Using 8 ~ 1,44 sec
b)System with 4 Dual Core Cpus
Using 1 thread~9,06 sec, Using 2 threads~4,86 sec, Using 3 Threads~3,49 sec, Using 4 threads~ 2,61 sec, Using 5~3,98 sec, Using 6~ 3,53 sec, Using 7 ~3,48 sec,Using 8 ~ 3,32 sec
Above are the timings I get.
One thing you have to remember is that you're doing this on a shared memory architecture. The more loads/stores you are trying to do in parallel, the more chance you're going to have to hit contention with regards to memory access, which is a relatively slow operation. So in typical applications in my experience, don't benefit from more than 6 cores. (This is anecdotal, I could go into a lot of detail, but I don't feel like typing. Suffice to say, take these numbers with a grain of salt).
Try instead to minimize access to shared resources if possible, see what that does to your performance. Otherwise, optimize for what you got, and remember this:
Throwing more cores at a problem does not mean it will go quicker. Like with taxation, there's a curve as to when the number of cores, starts becoming a detriment to collecting the most performance out of your program. Find that "sweet spot", and use it.
You write
The above timings are when i use
pthreads. When i use openmp the timing
are smaller but follow the same
pattern.
Congratulations, you have discovered the pattern which all parallel programs follow ! If you plot execution time against number of processors the curve eventually flattens out and starts to rise; you reach a point where adding more processors slows things down.
The interesting question is how many processors you can profitably use and the answer to this is dependent on many factors. #jer has pointed out some of the factors which affect the scalability of programs on shared-memory computers. Other factors, principally the ratio of communication to computation, ensure that the shape of the performance curve will be the same on distributed-memory computers too.
The other factor which is important when measuring the parallel scalability of your program is the problem size(s) you use. How does your performance curve change when you try a grid of 1414 x 1414 cells ? I would expect that the curve will be below the curve for the problem on 1000 x 1000 cells and will flatten out later.
For further reading Google for Amdahl's Law and Gustafson's Law.
Could be your sysadmin is controlling how many threads you can execute simultaneously or how many cores you run on. I don't know if it is possible at the sysadmin level, but it sure is possible to tell a process that.
Or, your algorithm could be using L2 cache poorly. Hyper-threading or whatever they call it now works best when one thread is doing something that takes a long time and the other thread is not. Accessing memory not in L2 cache is SLOW and the thread doing so will stall while it waits. This is just one example of where the time to run multiple threads on a single core comes from. A Quad core memory bus might allow each core to access some of the ram at the same time, but not each thread in each core. If both threads go for RAM then they basically are running sequentially. So that could be where your 4 comes from.
You might look to see if you can change your loops so they operate on contiguous RAM. If you break the problem into small blocks of data that fit in your L2 cache and iterate through those blocks, you might get 8x. If you search for the intel machine language programmers guides for their latest processors...they talk about these issues.

Matlab: Free memory is lost after calling a function

I have some troubles with memory management in Matlab. Finally it leads to not enough free memory and an error.I tried to pinpoint the problem and found one interesting "feature": Somehow I loose free Memory in Matlab.
I do the following:
1) Start Matlab
2) typing "memory" I get: Maximum possible array: 1293 mb, Memory available for all arrays: 1456 mb
3) I'll call a function. The function is rather long, so it's hard to paste it here. But basically it loads 5 ca. 300mb mat files (sequentially), picks some few values and returns them. The returned matrix is ca. 1,2mb (4650x35 double)
4) I clear all variables in workspace ("clear all")
5) typing "memory" I get: Maximum possible array: 759 mb, Memory available for all arrays: 1029 mb
If I repeat steps 3) to 5) the memory numbers are constant.
So what is wrong here? Where do I loose the 400mb of free space? The memory used by Matlab is constant at around 330mb.
Does anyone have some ideas what is wrong here? Or is this something totally natural, but I miss it??
Thanks
Thomas
Ps: I use Matlab 2010a and Win 7 pro 32bit.
A good part of this "lost" memory is probably due to memory fragmentation. As Matlab allocates and frees arrays over the course of a session, the memory gets broken up into smaller areas, and some is lost to overhead in the memory manager, at both the Matlab and the underlying C levels. The overhead is not counted as "used" by Matlab because it's not being used to hold M-code array values. Some memory may also be consumed by Matlab loading additional M-files and libraries, allocating internal buffers or structures, or by expansion of the Java heap in Matlab's embedded JVM. This is normal. After doing some work, Matlab won't have as much memory available as it did in a fresh session.
AFAIK, once low-level fragmentation occurs, there's nothing you can do to eliminate it aside from restarting Matlab. Allocating lots of small arrays can accelerate fragmentation. This sometimes happens if you use large cellstrs or large arrays of objects. So if you are having problems, you may need to reduce your peak memory usage in the function by breaking the work in to smaller chunks, reducing cell usage, and so on. And if you have big cellstr arrays in the MAT files, convert them to char. The "high water mark" of allocation is what governs fragmentation, so if you can break your data set in to smaller chunks, you can fit it in less memory.
Inside your function, clear as much as you can from one MAT file before moving on to the next. One way to do this implicitly is to move the per-file processing into a subfunction if it's currently sitting in a loop in your main function.
To help debug, do a "dbstop if all error", which will get triggered by the OOM. From there, you can use whos and the debugger to find out where the space is being taken up when you exhaust memory. That might reveal temp variables that need to be cleaned up, or suggest ways of chunking the work.
If you'd like to experiment to see what fragmentation looks like and how it affects memory()'s output, here's a function that will just create some fragmentation.
function fragmem(nbytes, chunksize)
%FRAGMEM Fragment the Matlab session's memory
if nargin < 2; chunksize = 1*2^10; end
nbytes = nbytes - rem(nbytes, chunksize);
nsteps = 100; % to make initial input relatively small
c = cell([1 nsteps]);
stepsize = nbytes / nsteps;
chunksperstep = ceil(stepsize / chunksize);
fprintf('Fragmenting %d MB memory into %d KB chunks (%d steps of %d chunks)\n',...
round(nbytes/2^20), round(chunksize/2^10), nsteps, chunksperstep);
x = zeros([1 chunksperstep * chunksize], 'uint8');
colsizes = repmat(chunksize, [1 chunksperstep]);
for i = 1:nsteps
c{i} = mat2cell(x, 1, colsizes);
end
Fragging 300 MB in 1KB chunks on my machine reproduces a "loss" on my win32 machine about the size you're seeing.
>> memory
Maximum possible array: 1384 MB (1.451e+009 bytes) *
Memory available for all arrays: 1552 MB (1.627e+009 bytes) **
Memory used by MATLAB: 235 MB (2.463e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>> fragmem(300*2^20)
Fragmenting 300 MB memory into 1 KB chunks (100 steps of 3072 chunks)
>> memory
Maximum possible array: 1009 MB (1.059e+009 bytes) *
Memory available for all arrays: 1175 MB (1.232e+009 bytes) **
Memory used by MATLAB: 257 MB (2.691e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>>

Resources