Optimum buffer to memory ratio - memory

I am trying to build a DAQ using Sparrow's Kmax. I have a ready template in which the total memory is 16 MB.
static final int evSize = 4; // The num of parameters per event of this type
static final int BUF_SIZE = evSize*1000; /** <------------------Why pick this buffer size*/ // Buffer size
static final int LP_MEM_TOP = 0xFFFF00; // Memory size 16MB
static final int READ_START = LP_MEM_TOP - BUF_SIZE; // We start the read/write pointer 1 buffer before the end
In the above code you can see that the buffer is very small compared to the total memory. From what know, the buffer is the temporary memory where data is stored before being sent to the computer.
In my case I am using a SCSI bus to transfer the data and the system is really slow. What can I do with the buffer to increase the speed or the performance? Is there a particular reason to have such a small buffer? I am not sure if I have understood what exactly does the memory and the buffer do.
Any help is more than welcome!!!

Related

CUDA "out of memory" with plenty of memory in the VRAM [duplicate]

Seems like there are a lot of questions on here about moving double (or int, or float, etc) 2d arrays from host to device. This is NOT my question.
I have already moved all of the data onto the GPU and, the __global__ kernel calls several __device__ functions.
In these device kernels, I have tried the following:
To allocate:
__device__ double** matrixCreate(int rows, int cols, double initialValue)
{
double** temp; temp=(double**)malloc(rows*sizeof(double*));
for(int j=0;j<rows;j++) {temp[j]=(double*)malloc(cols*sizeof(double));}
//Set initial values
for(int i=0;i<rows;i++)
{
for(int j=0;j<cols;j++)
{
temp[i][j]=initialValue;
}
}
return temp;
}
To deallocate:
__device__ void matrixDestroy(double** temp,int rows)
{
for(int j=0;j<rows;j++) { free( temp[j] ); }
free(temp);
}
For single dimension arrays the __device__ mallocs work great, can't seem to keep it stable in the multidimensional case. By the way, the variables are sometime used like this:
double** z=matrixCreate(2,2,0);
double* x=z[0];
However, care is always taken to ensure no calls to free are done with active data. The code is actually an adaption of cpu only code, so I know nothing funny is going on with the pointers or memory. Basically I'm just re-defining the allocators and throwing a __device__ on the serial portions. Just want to run the whole serial bit 10000 times and the GPU seems like a good way to do it.
++++++++++++++ UPDATE +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Problem solved by Vyas. As per cuda specifications, heap size is initially set to 8Mb, if your mallocs exceed this, NSIGHT will not launch and the kernel crashes. Use the following under host code.
float increaseHeap=10;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size[0]*increaseHeap);
Worked for me!
The GPU side malloc() is a suballocator from a limited heap. Depending on the number of allocations, it is possible the heap is being exhausted. You can change the size of the backing heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size). For more info see : CUDA programming guide

OpenCL slow memory access in for loop

I have a program that I built in OpenCL, in which each kernel accesses a read-only buffer located in global memory. At some point each kernel needs to copy some data from global memory into a temporary buffer. I made a for loop to copy a region of global memory byte-by-byte into the temporary buffer. I execute the aforementioned kernel using the clEnqueueNDRangeKernel command which is located inside a while loop. In order to measure how fast the clEnqueueNDRangeKernel command is, I added a counter called ups (Updates Per Second) which is incremented at the end of each while loop. Every one second I print the value of the counter and set it to zero.
I noticed that my program was running slowly, at about 53 ups. After some investigation I found out that the problem was the memory copying loop that was described above. This is the code:
typedef uchar byte;
byte tempBuffer[128]
byte* destPtr = (byte*)&tempBuffer0];
__global const byte* srcPtr = (__global const byte*)globalMemPtr;
for(size_t k = 0; k < regionSize; ++k)
{
destPtr[k] = srcPtr[k];
}
In variable globalMemPtr is a pointer to the region of global memory that needs to be copied into the temporary buffer, and tempBuffer the temporary buffer. The variable regionSize holds the size of the region to be copied in bytes. In this case its value is 12.
What I noticed was that if I replace regionSize with 12, the kernel runs much faster, at about 90 ups. My assumption is that the OpenCL compiler can optimize the for loop to copy memory much faster when 12 is used, but it can't when regionSize is used.
Does anyone know what is happening? Can any one help me?

OS memory allocation addresses

Quick curious question, memory allocation addresses are choosed by the language compiler or is it the OS which chooses the addresses for the memory asked?
This is from a doubt about virtual memory, where it could be quickly explained as "let the process think he owns all the memory", but what happens on 64 bits architectures where only 48 bits are used for memory addresses if the process wants a higher address?
Lets say you do a int a = malloc(sizeof(int)); and you have no memory left from the previous system call so you need to ask the OS for more memory, is the compiler the one who determines the memory address to allocate this variable, or does it just ask the OS for memory and it allocates it on the address returned by it?
It would not be the compiler, especially since this is dynamic memory allocation. Compilation is done well before you actually execute your program.
Memory reservation for static variables happens at compile time. But the static memory allocation will happen at start-up, before the user defined Main.
Static variables can be given space in the executable file itself, this would then be memory mapped into the process address space. This is only one of few times(?) I can image the compiler actually "deciding" on an address.
During dynamic memory allocation your program would ask the OS for some memory and it is the OS that returns a memory address. This address is then stored in a pointer for example.
The dynamic memory allocation in C/C++ is simply done by runtime library functions. Those functions can do pretty much as they please as long as their behavior is standards-compliant. A trivial implementation of compliant but useless malloc() looks like this:
void * malloc(size_t size) {
return NULL;
}
The requirements are fairly relaxed -- the pointer has to be suitably aligned and the pointers must be unique unless they've been previously free()d. You could have a rather silly but somewhat portable and absolutely not thread-safe memory allocator done the way below. There, the addresses come from a pool that was decided upon by the compiler.
#include "stdint.h"
// 1/4 of available address space, but at most 2^30.
#define HEAPSIZE (1UL << ( ((sizeof(void*)>4) ? 4 : sizeof(void*)) * 2 ))
// A pseudo-portable alignment size for pointerĊšbwitary types. Breaks
// when faced with SIMD data types.
#define ALIGNMENT (sizeof(intptr_t) > sizeof(double) ? sizeof(intptr_t) : siE 1Azeof(double))
void * malloc(size_t size)
{
static char buffer[HEAPSIZE];
static char * next = NULL;
void * result;
if (next == NULL) {
uintptr_t ptr = (uintptr_t)buffer;
ptr += ptr % ALIGNMENT;
next = (char*)ptr;
}
if (size == 0) return NULL;
if (next-buffer > HEAPSIZE-size) return NULL;
result = next;
next += size;
next += size % ALIGNMENT;
return result;
}
void free(void * ptr)
{}
Practical memory allocators don't depend upon such static memory pools, but rather call the OS to provide them with newly mapped memory.
The proper way of thinking about it is: you don't know what particular pointer you are going to get from malloc(). You can only know that it's unique and points to properly aligned memory if you've called malloc() with a non-zero argument. That's all.

Large allocation - anything like continous virtual memory?

in my program I create a large (~10 million elements) list of objects, where each object is about 500 byte large. Currently the allocation is like this:
const int N = 10000000;
object_type ** list = malloc( N * sizeof * list );
for (int i=0; i < N; i++)
list[i] = malloc( sizeof * list[i]);
This works OK - but I have discovered that with the high number of small allocations a significant part of the run time goes to the malloc() and subsequent free() calls. I am therefor about to change the implementation to allocate larger chunks. The simplest for me would be to allocate everything as one large chunk.
Now I know there is at least one level of virtualization between the user space memory model and actual physical memory, but is there still a risk that I will get problems getting a so large 'continous' block of memory?
Contiguous virtual does not imply contiguous physical. If your process can allocate N pages individually, it will also be able to allocate them all in one call (and is actually better from many points of view to do it in one call). On old 32bit architectures the limited size of the virtual memory address space was a problem, but on 64 bit is no longer and issue. Besides, even on 32 bit, if you could allocate 10MM individually you should be able to allocate same 10MM in one single call.
That being said, your probably need to carefully revisit your design and reconsider why do you need to allocate 10MM elements in memory.

Binding texture memory to a GPU allocated matrix

I created a float point matrix on the GPU of size (p7P_NXSTATES)x(p7P_NXTRANS) like so:
// Special Transitions
// Host pointer to array of device pointers
float **tmp_xsc = (float**)(malloc(p7P_NXSTATES * sizeof(float*)));
// For every alphabet in scoring profile...
for(i = 0; i < p7P_NXSTATES; i++)
{
// Allocate memory for device for every alphabet letter in protein sequence
cudaMalloc((void**)&(tmp_xsc[i]), p7P_NXTRANS * sizeof(float));
// Copy over arrays
cudaMemcpy(tmp_xsc[i], gm.xsc[i], p7P_NXTRANS * sizeof(float), cudaMemcpyHostToDevice);
}
// Copy device pointers to array of device pointers on GPU (matrix)
float **dev_xsc;
cudaMalloc((void***)&dev_xsc, p7P_NXSTATES * sizeof(float*));
cudaMemcpy(dev_xsc, tmp_xsc, p7P_NXSTATES * sizeof(float*), cudaMemcpyHostToDevice);
This memory, once copied over to the GPU, is never changed and is only read from. Thus, I've decided to bind this to texture memory. Problem is that when working with 2D texture memory, the memory being bound to it is really just an array that uses offsets to function as a matrix.
I'm aware I need to use cudaBindTexture2D() and cudaCreateChannelDesc() to bind this 2D memory in order to access it as such
tex2D(texXSC,x,y)
-- but I'm just not sure how. Any ideas?
The short answer is that you cannot bind arrays of pointers to textures. You can either create a CUDA array and copy data to it from linear source memory, or use pitched linear memory directly bound to a texture. But an array of pointers will not work.

Resources