cuda max number in Blocks and allocation manage - memory

i'm writing a CUDA kernel and I have to execute on this device:
name: GeForce GTX 480
CUDA capability: 2.0
Total global mem: 1610285056
Total constant Mem: 65536
Shared mem per mp: 49152
Registers per mp: 32768
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (65535, 65535, 65535)
The kernel, in minimal form, is:
_global__ void CUDAvegas( ... )
{
devParameters p;
extern __shared__ double shared[];
int width = Ndim * Nbins;
int ltid = p.lId;
while(ltid < 2* Ndim){
shared[ltid+2*width] = ltid;
ltid += p.lOffset; //offset inside a block
}
__syncthreads();
din2Vec<double> lxp(Ndim, Nbins);
__syncthreads();
for(int i=0; i< Ndim; i++){
for(int j=0; j< Nbins; j++){
lxp.v[i][j] = shared[i*Nbins+j];
}
}
}// end kernel
where Ndim=2, Nbins=128, devParameters is a class whose method p.lId is for counting the local thread's id (inside a block), and din2Cec is a class for creating a Vector of dim Ndim*Nbins whit the new command (in it's destructor I've implemented the corresponds delete[]).
The nvcc output is:
nvcc -arch=sm_20 --ptxas-options=-v file.cu -o file.x
ptxas info : Compiling entry function '_Z9CUDAvegas4LockidiiPdS0_S0_P7sumAccuP17curandStateXORWOWS0_i' for 'sm_20'
ptxas info : Function properties for _Z9CUDAvegas4LockidiiPdS0_S0_P7sumAccuP17curandStateXORWOWS0_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 116 bytes cmem[0], 51200 bytes cmem[2]
the number of threads is compatible with the MultiProcessors limits: max Shared memory, max register per thread and MP and warps per MP.
If I launch 64 threads X 30 blocks (Shared Memory per Block is 4128), it's all right, but if use more than 30 block I obtain the error:
cudaCheckError() failed at file.cu:508 : unspecified launch failure
========= Invalid __global__ read of size 8
========= at 0x000015d0 in CUDAvegas
========= by thread (0,0,0) in block (1,0,0)
========= Address 0x200ffb428 is out of bounds
I think that's a problem in allocating single thread's memory, but I don't understand what's my limit per MP and for total blocks...
Someone can help me or remind to a right topic?
PS: I know the kernel presented do nothing, but It's just to understand my limit problems.

I think that the error you receive is explanatory. It is pointed out that there is an out-of-bounds global read of a datatype of size 8. The responsible for the out-of-bounds read is thread (0,0,0) in block (1,0,0). I suspect that responsible instruction is lxp.v[i][j] = shared[i*Nbins+j]; in the last nested for loop. Probably you allocate an amount of global memory which is not related to the number of blocks you launch so that, when you launch too many blocks, you receive such an error.

Related

malloc kernel panics instead of returning NULL

I'm attempting to do an exercise from "Expert C Programming" where the point is to see how much memory a program can allocate. It hinges on malloc returning NULL when it cannot allocate anymore.
#include <stdio.h>
#include <stdlib.h>
int main() {
int totalMB = 0;
int oneMeg = 1<<20;
while (malloc(oneMeg)) {
++totalMB;
}
printf("Allocated %d Mb total \n", totalMB);
return 0;
}
Rather than printing the total, I get a kernel panic after allocating ~8GB on my 16GB Macbook Pro.
Kernel panic log:
Anonymous UUID: 0B87CC9D-2495-4639-EA18-6F1F8696029F
Tue Dec 13 23:09:12 2016
*** Panic Report ***
panic(cpu 0 caller 0xffffff800c51f5a4): "zalloc: zone map exhausted while allocating from zone VM map entries, likely due to memory leak in zone VM map entries (6178859600 total bytes, 77235745 elements allocated)"#/Library/Caches/com.apple.xbs/Sources/xnu/xnu-3248.50.21/osfmk/kern/zalloc.c:2628
Backtrace (CPU 0), Frame : Return Address
0xffffff91f89bb960 : 0xffffff800c4dab12
0xffffff91f89bb9e0 : 0xffffff800c51f5a4
0xffffff91f89bbb10 : 0xffffff800c5614e0
0xffffff91f89bbb30 : 0xffffff800c5550e2
0xffffff91f89bbba0 : 0xffffff800c554960
0xffffff91f89bbd90 : 0xffffff800c55f493
0xffffff91f89bbea0 : 0xffffff800c4d17cb
0xffffff91f89bbf10 : 0xffffff800c5b8dca
0xffffff91f89bbfb0 : 0xffffff800c5ecc86
BSD process name corresponding to current thread: a.out
Mac OS version:
15F34
I understand that this can easily be fixed by the doctor's cliche of "It hurts when you do that? Then don't do that" but I want to understand why malloc isn't working as expected.
OS X 10.11.5
For the definitive answer to that question, you can look at the source code, which you'll find here:
zalloc.c source in XNU
In that source file find the function zalloc_internal(). This is the function that gives the kernel panic.
In the function you'll find a "for (;;) {" loop, which basically tries to allocate the memory you're requesting in the specified zone. If there isn't enough space, it immediately tries again. If that fails it does a zone_gc() (garbage collect) to try to reclaim memory. If that also fails, it simply kernel panics - effectively halting the computer.
If you want to understand how zalloc.c works, look up zone-based memory allocators.
Your program is making the kernel run out of space in the zone called "VM map entries", which is a predefined zone allocated at boot. You could probably get the result you are expecting from your program, without a kernel panic, if you allocated more than 1 MB at a time.
In essence it is not really a problem for the kernel to allocate you several gigabytes of memory. However, allocating thousands of smaller allocations summing up to those gigabytes is much harder.

Process memory limit and address space for UNIX and Linux and Windows

what is the maximum amount of memory for a single process in UNIX and Linux and windows? how to calculate that? How much user address space and kernel address space for 4 GB of RAM?
How much user address space and kernel address space for 4 GB of RAM?
The address space of a process is divided into two parts,
User space: On standard 32 bit x86_64 architecture,the maximum addressable memory is 4GB, out of which addresses from 0x00000000 to 0xbfffffff = (3GB) meant for code, data segments. This region can be addressed when user process executing either in user or kernel mode.
Kernel space: Similarly, addresses from 0xc0000000 to 0xffffffff = (1GB) are meant for virtual address space of the kernel and can only addressed when the process executes in kernel mode.
This particular address space split on x86 is determined by the value of PAGE_OFFSET. Referring to Linux 3.11.1v page_32_types.h and page_64_types.h, page offset is defined as below,
#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
Where Kconfig defines a default value of default 0xC0000000 also with other address split options available.
Similarly for 64 bit,
#define __PAGE_OFFSET _AC(0xffff880000000000, UL).
On 64 bit architecture 3G/1G split doesn't hold anymore due to huge address space. As per the source latest Linux version has given above offset as offset.
When I see my 64 bit x86_64 architecture, a 32 bit process can have entire 4GB of user address space and kernel will hold address range above 4GB. Interestingly on modern 64 bit x86_64 CPU's not all address lines are enabled(or the address bus is not large enough) to provide us 2^64 = 16 exabytes of virtual address space. Perhaps AMD64/x86 architectures has 48/42 lower bits enabled respectively resulting to 2^48 = 256TB / 2^42= 4TB of address space. Now this definitely improves performance with large amount of RAM, at the same time question arises how it is efficiently managed with the OS limitations.
In Linux there's a way to find out the limit of address space you can have.
Using the rlimit structure.
struct rlimit {
rlim_t cur; //current limit
rlim_t max; //ceiling for cur.
}
rlim_t is a unsigned long type.
and you can have something like:
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
//Bytes To GigaBytes
static inline unsigned long btogb(unsigned long bytes) {
return bytes / (1024 * 1024 * 1024);
}
//Bytes To ExaBytes
static inline double btoeb(double bytes) {
return bytes / (1024.00 * 1024.00 * 1024.00 * 1024.00 * 1024.00 * 1024.00);
}
int main() {
printf("\n");
struct rlimit rlim_addr_space;
rlim_t addr_space;
/*
* Here we call to getrlimit(), with RLIMIT_AS (Address Space) and
* a pointer to our instance of rlimit struct.
*/
int retval = getrlimit(RLIMIT_AS, &rlim_addr_space);
// Get limit returns 0 if succeded, let's check that.
if(!retval) {
addr_space = rlim_addr_space.rlim_cur;
fprintf(stdout, "Current address_space: %lu Bytes, or %lu GB, or %f EB\n", addr_space, btogb(addr_space), btoeb((double)addr_space));
} else {
fprintf(stderr, "Coundn\'t get address space current limit.");
return 1;
}
return 0;
}
I ran this on my computer and... prrrrrrrrrrrrrrrrr tsk!
Output: Current address_space: 18446744073709551615 Bytes, or 17179869183 GB, or 16.000000 EB
I have 16 ExaBytes of max address space available on my Linux x86_64.
Here's getrlimit()'s definition
it also lists the other constants you can pass to getrlimits() and introduces getrlimit()s sister setrlimit(). There is when the max member of rlimit becomes really important, you should always check you don't exceed this value so the kernel don't punch your face, drink your coffee and steal your papers.
PD: please excuse my sorry excuse of a drum roll ^_^
On Linux systems, see man ulimit
(UPDATED)
It says:
The ulimit builtin is used to set the resource usage limits of the
shell and any processes spawned by it. If a new limit value is
omitted, the current value of the limit of the resource is printed.
ulimit -a prints out all current values with switch options, other switches, e.g. ulimit -n prints out no. of max. open files.
Unfortunatelly, "max memory size" tells "unlimited", which means that it is not limited by system administrator.
You can view the memory size by
cat /proc/meminfo
Which results something like:
MemTotal: 4048744 kB
MemFree: 465504 kB
Buffers: 316192 kB
Cached: 1306740 kB
SwapCached: 508 kB
Active: 1744884 kB
(...)
So, if ulimit says "unlimited", the MemFree is all yours. Almost.
Don't forget that malloc() (and new operator, which calls malloc()) is a STDLIB function, so if you call malloc(100) 10 times, there will be lot of "slack", follow link to learn why.

Get Memory Size With INT 12

I want to get memory size in assembly with int 12, but when I call this interrupt it only gives 639. What does 639 mean? (I converted from integer to String)
Ex:
bits 16
org 0x0
start:
int 12h;GET MEMORY TO AX (KB)
mov bx,ax ;BX=AX
call int_to_str ;IN:BX(INT)-OUT:BX(STRING)
mov si,bx ;SI=BX
call print_string ;PRINT SI
mov ax,10h ;KEY STROKE
int 16h
ret
This code gives only 639. I didn't understand yet. Please help. Thanks!
INT 12h only reports first 640KB of memory. Your program occupied one 1KB block, hence it returned 639. Getting available memory is a little bit tricky. For details see here.

Can I resize the max size of the stack when using lua_pushnumber

We have a problem in our project,we use lua 5.1 as our scripting language.
But when using lua_pushnumber to pass too many data from C++ to lua in one
function, lua stack seems like stack-overflow and cause other part of the memory
in C++ has been written, and it cause our system crash when the callback return
to C++. I want to know whether there are some parameters to control the size of
lua stack size. I try to change the parameter LUA_MINSTACK which defined in lua.h,
but it seems doesn't work. I also try to use lua_checkstack() to avoid pushing
too many number to lua stack but it also doesn't work
getNineScreenEntity(lua_State* L)
{
DWORD screenid = GET_LUA_VALUE(DWORD,1)
struct EntryCallback : public ScreenEntityCallBack
{
EntryCallback(){ }
bool exec(ScreenEntity * entity)
{
list.push_back(entity)
return true;
}
std::vector<ScreenEntity*> list;
};
EntryCallback exec;
Screen* screen = ScreenManager::getScreenByID(screenid);
if (!screen)
return 0;
screen->execAllOfScreenEntity(exec);
int size = 0;
std::vector<ScreenEntity*>::iterator vit = exec.list.begin();
for (; vit != exec.list.end(); ++vit)
{
lua_pushnumber(L,(*vit)->id);
++size;
}
return size;
}
It seems like when there are too many entities in one screen, our program will crash.
Maybe this will help (from Lua 5.2 manual)
int lua_checkstack (lua_State *L, int extra);
"Ensures that there are at least 'extra' free stack slots in the stack. It returns false if it cannot fulfill the request, because it would cause the stack to be larger than a fixed maximum size (typically at least a few thousand elements) or because it cannot allocate memory for the new stack size. This function never shrinks the stack; if the stack is already larger than the new size, it is left unchanged."
Here is an example c function...
static int l_test1 (lua_State *L) {
int i;
printf("test1: on the way in"); stackDump(L);
int cnt = lua_tointeger(L, 1);
printf("push %d items onto stack\n", cnt);
printf("try to grow stack: %d\n", lua_checkstack(L, cnt));
for (i=0; i<cnt; i++) {
lua_pushinteger(L, i); /* loop -- push integer */
}
printf("test1: on the way out"); stackDump(L);
return 1;
}
This code:
dumps the stack on the way into the function. (1)
tries to expand the stack size to have 'cnt' free slots (it returns either true, it worked, or false, it didn't).
pushes 'cnt' number of integers on the stack
dumps the stack on the way out.
$ lua demo.lua
running stack test with 10 pushes
test1: on the way in
---1--
[1] 10
-----
push 10 items onto stack
test1: on the way out
---11--
[11] 9
[10] 8
[9] 7
[8] 6
[7] 5
[6] 4
[5] 3
[4] 2
[3] 1
[2] 0
[1] 10
-----
running stack test with 1000 pushes
test1: on the way in
---1--
[1] 1000
-----
push 1000 items onto stack
try to grow stack: 1
test1: on the way out
---1001--
[1001] 999
[1000] 998
[999] 997
[998] 996
[997] 995
[996] 994
...
When the code above doesn't have the lua_checkstack() call, we get an error trying to push 1000 items onto the stack.
running stack test with 1000 pushes
test1: on the way in
---1--
[1] 1000
-----
push 1000 items onto stack
Segmentation fault (core dumped)
$
(1) stackDump() is similar to what appears in PiL 3rd ed. for dumping stack contents.
You should check Lua stack every time before pushing into it.
Stack size is 20 by default, it should be manually enlarged if more space required.
Lua manual

Why is cudaMalloc giving me an error when I know there is sufficient memory space?

I have a Tesla C2070 that is supposed to have 5636554752 bytes of memory.
However, this gives me an error:
int *buf_d = NULL;
err = cudaMalloc((void **)&buf_d, 1000000000*sizeof(int));
if( err != cudaSuccess)
{
printf("CUDA error: %s\n", cudaGetErrorString(err));
return EXIT_ERROR;
}
How is this possible? Does this have something to do with the maximum memory pitch? Here are the GPU's specs:
Device 0: "Tesla C2070"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5636554752 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
As for the machine I'm running on, it has 24 Intel® Xeon® Processor X565, with the Linux distribution Rocks 5.4 (Maverick).
Any ideas? Thanks!
The basic problem is in your question title - you don't actually know that you have sufficient memory, you are assuming you do. The runtime API includes the cudaMemGetInfo function which will return how much free memory there is on the device. When a context is established on a device, the driver must reserved space for device code, local memory for each thread, fifo buffers for printf support, stack for each thread, and heap for in-kernel malloc/new calls (see this answer for further details). All of this can consume rather a lot of memory, leaving you with much less than the maximum avialable memory after ECC reservations you are assuming to be available to your code. The API also includes cudaDeviceGetLimit which you can use to query the amounts of memory that on device runtime support is consuming. There is also a companion call cudaDeviceSetLimit which can allow you to change the amount of memory each component of runtime support will reserve.
Even after you tuned the runtime memory footprint to your tastes and have the actual free memory value from the driver, there is still page size granularity and fragmentation considerations to contend with. Rarely is it possible to allocate every byte of what the API will report as free. Usually, I would do something like this when the objective is to try and allocate every available byte on the card:
const size_t Mb = 1<<20; // Assuming a 1Mb page size here
size_t available, total;
cudaMemGetInfo(&available, &total);
int *buf_d = 0;
size_t nwords = total / sizeof(int);
size_t words_per_Mb = Mb / sizeof(int);
while(cudaMalloc((void**)&buf_d, nwords * sizeof(int)) == cudaErrorMemoryAllocation)
{
nwords -= words_per_Mb;
if( nwords < words_per_Mb)
{
// signal no free memory
break;
}
}
// leaves int buf_d[nwords] on the device or signals no free memory
(note never been near a compiler, only safe on CUDA 3 or later). It is implicitly assumed that none of the obvious sources of problems with big allocations apply here (32 bit host operating system, WDDM windows platform without TCC mode enabled, older known driver issues).

Resources