Binding texture memory to a GPU allocated matrix - memory

I created a float point matrix on the GPU of size (p7P_NXSTATES)x(p7P_NXTRANS) like so:
// Special Transitions
// Host pointer to array of device pointers
float **tmp_xsc = (float**)(malloc(p7P_NXSTATES * sizeof(float*)));
// For every alphabet in scoring profile...
for(i = 0; i < p7P_NXSTATES; i++)
// Allocate memory for device for every alphabet letter in protein sequence
cudaMalloc((void**)&(tmp_xsc[i]), p7P_NXTRANS * sizeof(float));
// Copy over arrays
cudaMemcpy(tmp_xsc[i], gm.xsc[i], p7P_NXTRANS * sizeof(float), cudaMemcpyHostToDevice);
// Copy device pointers to array of device pointers on GPU (matrix)
float **dev_xsc;
cudaMalloc((void***)&dev_xsc, p7P_NXSTATES * sizeof(float*));
cudaMemcpy(dev_xsc, tmp_xsc, p7P_NXSTATES * sizeof(float*), cudaMemcpyHostToDevice);
This memory, once copied over to the GPU, is never changed and is only read from. Thus, I've decided to bind this to texture memory. Problem is that when working with 2D texture memory, the memory being bound to it is really just an array that uses offsets to function as a matrix.
I'm aware I need to use cudaBindTexture2D() and cudaCreateChannelDesc() to bind this 2D memory in order to access it as such
-- but I'm just not sure how. Any ideas?

The short answer is that you cannot bind arrays of pointers to textures. You can either create a CUDA array and copy data to it from linear source memory, or use pitched linear memory directly bound to a texture. But an array of pointers will not work.


CUDA "out of memory" with plenty of memory in the VRAM [duplicate]

Seems like there are a lot of questions on here about moving double (or int, or float, etc) 2d arrays from host to device. This is NOT my question.
I have already moved all of the data onto the GPU and, the __global__ kernel calls several __device__ functions.
In these device kernels, I have tried the following:
To allocate:
__device__ double** matrixCreate(int rows, int cols, double initialValue)
double** temp; temp=(double**)malloc(rows*sizeof(double*));
for(int j=0;j<rows;j++) {temp[j]=(double*)malloc(cols*sizeof(double));}
//Set initial values
for(int i=0;i<rows;i++)
for(int j=0;j<cols;j++)
return temp;
To deallocate:
__device__ void matrixDestroy(double** temp,int rows)
for(int j=0;j<rows;j++) { free( temp[j] ); }
For single dimension arrays the __device__ mallocs work great, can't seem to keep it stable in the multidimensional case. By the way, the variables are sometime used like this:
double** z=matrixCreate(2,2,0);
double* x=z[0];
However, care is always taken to ensure no calls to free are done with active data. The code is actually an adaption of cpu only code, so I know nothing funny is going on with the pointers or memory. Basically I'm just re-defining the allocators and throwing a __device__ on the serial portions. Just want to run the whole serial bit 10000 times and the GPU seems like a good way to do it.
++++++++++++++ UPDATE +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Problem solved by Vyas. As per cuda specifications, heap size is initially set to 8Mb, if your mallocs exceed this, NSIGHT will not launch and the kernel crashes. Use the following under host code.
float increaseHeap=10;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size[0]*increaseHeap);
Worked for me!
The GPU side malloc() is a suballocator from a limited heap. Depending on the number of allocations, it is possible the heap is being exhausted. You can change the size of the backing heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size). For more info see : CUDA programming guide

Running compute kernel on portion of MTLBuffer?

I am populating an MTLBuffer with float2 vectors. The buffer is being created and populated like this:
struct Particle {
var position: float2
let particleCount = 100000
let bufferSize = MemoryLayout<Particle>.stride * particleCount
particleBuffer = device.makeBuffer(length: bufferSize)!
var pointer = particleBuffer.contents().bindMemory(to: Particle.self, capacity: particleCount)
pointer = pointer.advanced(by: currentParticles)
pointer.pointee.position = [x, y]
In my Metal file the buffer is being accessed like this:
struct Particle {
float2 position;
kernel void compute(device Particle *particles [[buffer(0)]],
uint id [[thread_position_in_grid]] … )
I need to be able to compute a given range of the MTLBuffer. For example, is it possible to run the compute kernel on say starting from the 50,000 value and ending at 75,000 value?
It seems like the offset parameter allows me to specify the start position, but it does not have a length parameter.
I see there is this call:
Does the range specify what portion of the buffer to run? It seems like the range specifies what buffers are used and not the range of values to use.
A compute shader doesn't run "on" a buffer (or portion). It runs on a grid, which is an abstract concept. As far as Metal is concerned, the grid isn't related to a buffer or anything else.
A buffer may be an input that your compute shader uses, but how it uses it is up to you. Metal doesn't know or care.
Here's my answer to a similar question.
So, the dispatch command you encode using a compute command encoder governs how many times your shader is invoked. It also dictates what thread_position_in_grid (and related) values each invocation receives. If your shader correlates each grid position to an element of an array backed by a buffer, then the number of threads you specify in your dispatch command governs how much of the buffer you end up accessing. (Again, this is not something Metal dictates; it's implicit in the way you code your shader.)
Now, to start at the 50,000th element, using an offset on the buffer binding to make that the effective start of the buffer from the point of view of the shader is a good approach. But it would also work to just add 50,000 to the index the shader computes when it accesses the buffer. And, if you only want to process 25,000 elements (75,000 minus 50,000), then just dispatch 25,000 threads.

OpenCL slow memory access in for loop

I have a program that I built in OpenCL, in which each kernel accesses a read-only buffer located in global memory. At some point each kernel needs to copy some data from global memory into a temporary buffer. I made a for loop to copy a region of global memory byte-by-byte into the temporary buffer. I execute the aforementioned kernel using the clEnqueueNDRangeKernel command which is located inside a while loop. In order to measure how fast the clEnqueueNDRangeKernel command is, I added a counter called ups (Updates Per Second) which is incremented at the end of each while loop. Every one second I print the value of the counter and set it to zero.
I noticed that my program was running slowly, at about 53 ups. After some investigation I found out that the problem was the memory copying loop that was described above. This is the code:
typedef uchar byte;
byte tempBuffer[128]
byte* destPtr = (byte*)&tempBuffer0];
__global const byte* srcPtr = (__global const byte*)globalMemPtr;
for(size_t k = 0; k < regionSize; ++k)
destPtr[k] = srcPtr[k];
In variable globalMemPtr is a pointer to the region of global memory that needs to be copied into the temporary buffer, and tempBuffer the temporary buffer. The variable regionSize holds the size of the region to be copied in bytes. In this case its value is 12.
What I noticed was that if I replace regionSize with 12, the kernel runs much faster, at about 90 ups. My assumption is that the OpenCL compiler can optimize the for loop to copy memory much faster when 12 is used, but it can't when regionSize is used.
Does anyone know what is happening? Can any one help me?

EmguCV - Mat.Data array is always null after image loading

When I am trying to read image from file, then after load Mat.Data array is alway null. But when I am looking into Mat object during debug there is byte array in which are all data from image.
Mat image1 = CvInvoke.Imread("minion.bmp", Emgu.CV.CvEnum.LoadImageType.AnyDepth);
Do you have any idea why?
I recognize this question is super old, but I hit the same issue and I suspect the answer lies in the Emgu wiki. Specifically:
Accessing the pixels from Mat
Unlike the Image<,> class, where memory are pre-allocated and fixed, the memory of Mat can be automatically re-allocated by Open CV function calls. We cannot > pre-allocate managed memory and assume the same memory are used through the life time of the Mat object. As a result, Mat class do not contains a Data > property like the Image<,> class, where the pixels can be access through a managed array. To access the data of the Mat, there are a few possible choices.
The easy way and safe way that cost an additional memory copy
The first option is to copy the Mat to an Image<,> object using the Mat.ToImage function. e.g.
Image<Bgr, Byte> img = mat.ToImage<Bgr, Byte>();
The pixel data can then be accessed using the Image<,>.Data property.
You can also convert the Mat to an Matrix<> object. Assuming the Mat contains 8-bit data,
Matrix<Byte> matrix = new Matrix<Byte>(mat.Rows, mat.Cols, mat.NumberOfChannels);
Note that you should create Matrix<> with a matching type to the Mat object. If the Mat contains 32-bit floating point value, you should replace Matrix in the above code with Matrix. The pixel data can then be accessed using the Matrix<>.Data property.
The fastest way with no memory copy required. Be caution!!!
The second option is a little bit tricky, but will provide the best performance. This will usually require you to know the size of the Mat object before it is created. So you can allocate managed data array, and create the Mat object by forcing it to use the pinned managed memory. e.g.
//load your 3 channel bgr image here
Mat m1 = ...;
//3 channel bgr image data, if it is single channel, the size should be m1.Width * m1.Height
byte[] data = new byte[m1.Width * m1.Height * 3];`
GCHandle handle = GCHandle.Alloc(data, GCHandleType.Pinned);`
using (Mat m2 = new Mat(m1.Size, DepthType.Cv8U, 3, handle.AddrOfPinnedObject(), m1.Width * 3))`
CvInvoke.BitwiseNot(m1, m2);`
At this point the data array contains the pixel data of the inverted image. Note that if the Mat m2 was allocated with the wrong size, data[] array will contains all 0s, and no exception will be thrown. So be really careful when performing the above operations.
TL;DR: You can't use the Data object in the way you're hoping to (as of version 3.2 at least). You must copy it to another object which allows use of the Data object.

HLSL: Index to unaligned/packed floats

I have a vertex shader (2.0) doing some instancing - each vertex specifies an index into an array.
If I have an array like this:
float instanceData[100];
The compiler allocates it 100 constant registers. Each constant register is a float4, so it's allocating 4 times as much space as is needed.
I need a way to make it allocate just 25 constant registers and store four values in each of them.
Ideally I'd like a method where it still looks like a float[] on both the CPU and GPU (Right now I am calling EffectParamter.SetValue(Single[]), I'm using XNA). But manually packing and unpacking a float4[] is an option, too.
Also: what are the performance implications for doing this? Is it actually worth it? (For me, this will save about one batch in every four or five).
Does that helps?:
float4 packedInstanceData[25];
float data = packedInstanceData[index / 4][index % 4];
