Fast streaming to framebuffer - stream

What is the most best method to write a stream/array of raw RGB pixels (format as required by Xorg) to a window of fixed size? No synchronisation is required, nor are there any timing requirements, just minimal average CPU usage.
Is it necessary to copy the data to a memory location managed by Xorg
or can one pass just a pointer?
If it is necessary to perform a full copy of the data, is there also a portable way (Linux only, but portable between Intel, Nvidia and AMD) of doing this with GPU hardware acceleration?
How to write existing data to the screen in a fast way?
struct {
unsigned char r;
unsigned char g;
unsigned char b;
} pixel_t; //other format/padding also possible, if required
struct {
pixel_t[WIDTH*HEIGHT] data;
} frame_t;
frame_t* frame = get_next_frame_from_stream(); //<= already optimized
set_as_xorg_framebuffer(frame); //<= I'm searching for this

Related

CUDA "out of memory" with plenty of memory in the VRAM [duplicate]

Seems like there are a lot of questions on here about moving double (or int, or float, etc) 2d arrays from host to device. This is NOT my question.
I have already moved all of the data onto the GPU and, the __global__ kernel calls several __device__ functions.
In these device kernels, I have tried the following:
To allocate:
__device__ double** matrixCreate(int rows, int cols, double initialValue)
{
double** temp; temp=(double**)malloc(rows*sizeof(double*));
for(int j=0;j<rows;j++) {temp[j]=(double*)malloc(cols*sizeof(double));}
//Set initial values
for(int i=0;i<rows;i++)
{
for(int j=0;j<cols;j++)
{
temp[i][j]=initialValue;
}
}
return temp;
}
To deallocate:
__device__ void matrixDestroy(double** temp,int rows)
{
for(int j=0;j<rows;j++) { free( temp[j] ); }
free(temp);
}
For single dimension arrays the __device__ mallocs work great, can't seem to keep it stable in the multidimensional case. By the way, the variables are sometime used like this:
double** z=matrixCreate(2,2,0);
double* x=z[0];
However, care is always taken to ensure no calls to free are done with active data. The code is actually an adaption of cpu only code, so I know nothing funny is going on with the pointers or memory. Basically I'm just re-defining the allocators and throwing a __device__ on the serial portions. Just want to run the whole serial bit 10000 times and the GPU seems like a good way to do it.
++++++++++++++ UPDATE +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Problem solved by Vyas. As per cuda specifications, heap size is initially set to 8Mb, if your mallocs exceed this, NSIGHT will not launch and the kernel crashes. Use the following under host code.
float increaseHeap=10;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size[0]*increaseHeap);
Worked for me!
The GPU side malloc() is a suballocator from a limited heap. Depending on the number of allocations, it is possible the heap is being exhausted. You can change the size of the backing heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size). For more info see : CUDA programming guide

How could I vectorize this for loop?

I have this loop
void f1(unsigned char *data, unsigned int size) {
unsigned int A[256] = {0u};
for (register unsigned int i = 0u; i < size; i++) {
++A[data[i]];
}
...
Is there any way to vectorize it manually?
Since multiple entries in data[i] might contain the same value, I don't see how this could be vectorized simply since there can be race conditions. The point of vectorization is that each element is independent of the other elements, and so can be computed in parallel. But your algorithm doesn't allow that. "Vectorize" is not the same thing as "make go faster."
What you seem to be building here is a histogram, and iOS has built-in, optimized support for that. You can create a single-channel, single-row image and use vImageHistogramCalculation_Planar8 like this:
void f1(unsigned char *data, unsigned int size) {
unsigned long A[256] = {0u};
vImage_Buffer src = { data, 1, size, size };
vImage_Error err = vImageHistogramCalculation_Planar8(&src, A, kvImageDoNotTile);
if (err != kvImageNoError) {
// error
}
...
}
Be careful about assuming this is always a win, though. It depends on the size of your data. Making a function call is very expensive, so it can take several million bytes of data to make it worth it. If you're computing this on smaller sets than that, then a simple, compiler-optimized loop is often the best approach. You need to profile this on real devices to see which is faster for your purposes.
Just make sure to allow the compiler to apply all vectorizing optimizations by turning on -Ofast (Fastest, Aggressive). That won't matter in this case because your loop can't be simply vectorized. But in general, -Ofast allows the compiler to apply vectorizing optimizations in cases that it might slightly grow code size (which isn't allowed under the default -Os). -Ofast also allows a little sloppiness in how floating point math is performed, so should not be used in cases where strict IEEE floating point conformance is required (but this is almost never the case for iOS apps, so -Ofast is almost always the correct setting).
The optimisation the compiler would attempt to do here is to parallelize ++A[data[i]]
It cannot do so because the contents of A depend on the previous iteration of the loop.
You could break this dependancy by using one frequency array (A) per way of parallelism, and then computing the sum of these at the end. I assume here you've got two ways of parallelism and that the size is even.
void f1(const unsigned char * const data, unsigned int size) {
unsigned int A0[256] = {0u};
unsigned int A1[256] = {0u};
for (unsigned int i = 0u; i < size /2u; i++) {
++A0[data[2*i]];
++A1[data[2*i+1]];
}
for (unsigned i=0u; i < 256; ++i){
A0[i] = A0[i] + A1[i];
}
}
Does this win you much? There only one way to find out - try it and measure the results. I suspect that the Accelerate framework will do much better than this, even for relatively small values on size. It's also optimised at run-time for the target architecture.
Compilers are pretty smart, but there are things you can do in C or C++ to help the compiler:
Apply const wherever possible: It's then obvious which data is invariant.
Identify pointers to non-overlapping memory regions with the restrict (__restrict in C++) qualifier. Without knowing this, the compiler must assume a write through one pointer potentially alters data that could be read with another. clang will in fact generate run-time checks and code-paths for both the overlapping and non-overlapping case, but there will be limits to this, and you can probably reduce code-size by being explicit.
I doubt the register qualifier for i makes any difference.

OS memory allocation addresses

Quick curious question, memory allocation addresses are choosed by the language compiler or is it the OS which chooses the addresses for the memory asked?
This is from a doubt about virtual memory, where it could be quickly explained as "let the process think he owns all the memory", but what happens on 64 bits architectures where only 48 bits are used for memory addresses if the process wants a higher address?
Lets say you do a int a = malloc(sizeof(int)); and you have no memory left from the previous system call so you need to ask the OS for more memory, is the compiler the one who determines the memory address to allocate this variable, or does it just ask the OS for memory and it allocates it on the address returned by it?
It would not be the compiler, especially since this is dynamic memory allocation. Compilation is done well before you actually execute your program.
Memory reservation for static variables happens at compile time. But the static memory allocation will happen at start-up, before the user defined Main.
Static variables can be given space in the executable file itself, this would then be memory mapped into the process address space. This is only one of few times(?) I can image the compiler actually "deciding" on an address.
During dynamic memory allocation your program would ask the OS for some memory and it is the OS that returns a memory address. This address is then stored in a pointer for example.
The dynamic memory allocation in C/C++ is simply done by runtime library functions. Those functions can do pretty much as they please as long as their behavior is standards-compliant. A trivial implementation of compliant but useless malloc() looks like this:
void * malloc(size_t size) {
return NULL;
}
The requirements are fairly relaxed -- the pointer has to be suitably aligned and the pointers must be unique unless they've been previously free()d. You could have a rather silly but somewhat portable and absolutely not thread-safe memory allocator done the way below. There, the addresses come from a pool that was decided upon by the compiler.
#include "stdint.h"
// 1/4 of available address space, but at most 2^30.
#define HEAPSIZE (1UL << ( ((sizeof(void*)>4) ? 4 : sizeof(void*)) * 2 ))
// A pseudo-portable alignment size for pointerĊšbwitary types. Breaks
// when faced with SIMD data types.
#define ALIGNMENT (sizeof(intptr_t) > sizeof(double) ? sizeof(intptr_t) : siE 1Azeof(double))
void * malloc(size_t size)
{
static char buffer[HEAPSIZE];
static char * next = NULL;
void * result;
if (next == NULL) {
uintptr_t ptr = (uintptr_t)buffer;
ptr += ptr % ALIGNMENT;
next = (char*)ptr;
}
if (size == 0) return NULL;
if (next-buffer > HEAPSIZE-size) return NULL;
result = next;
next += size;
next += size % ALIGNMENT;
return result;
}
void free(void * ptr)
{}
Practical memory allocators don't depend upon such static memory pools, but rather call the OS to provide them with newly mapped memory.
The proper way of thinking about it is: you don't know what particular pointer you are going to get from malloc(). You can only know that it's unique and points to properly aligned memory if you've called malloc() with a non-zero argument. That's all.

Binding texture memory to a GPU allocated matrix

I created a float point matrix on the GPU of size (p7P_NXSTATES)x(p7P_NXTRANS) like so:
// Special Transitions
// Host pointer to array of device pointers
float **tmp_xsc = (float**)(malloc(p7P_NXSTATES * sizeof(float*)));
// For every alphabet in scoring profile...
for(i = 0; i < p7P_NXSTATES; i++)
{
// Allocate memory for device for every alphabet letter in protein sequence
cudaMalloc((void**)&(tmp_xsc[i]), p7P_NXTRANS * sizeof(float));
// Copy over arrays
cudaMemcpy(tmp_xsc[i], gm.xsc[i], p7P_NXTRANS * sizeof(float), cudaMemcpyHostToDevice);
}
// Copy device pointers to array of device pointers on GPU (matrix)
float **dev_xsc;
cudaMalloc((void***)&dev_xsc, p7P_NXSTATES * sizeof(float*));
cudaMemcpy(dev_xsc, tmp_xsc, p7P_NXSTATES * sizeof(float*), cudaMemcpyHostToDevice);
This memory, once copied over to the GPU, is never changed and is only read from. Thus, I've decided to bind this to texture memory. Problem is that when working with 2D texture memory, the memory being bound to it is really just an array that uses offsets to function as a matrix.
I'm aware I need to use cudaBindTexture2D() and cudaCreateChannelDesc() to bind this 2D memory in order to access it as such
tex2D(texXSC,x,y)
-- but I'm just not sure how. Any ideas?
The short answer is that you cannot bind arrays of pointers to textures. You can either create a CUDA array and copy data to it from linear source memory, or use pitched linear memory directly bound to a texture. But an array of pointers will not work.

Question about cl_mem in OpenCL

I have been using cl_mem in some of my OpenCL boilerplate code, but I have been using it through context and not a sharp understanding of what exactly it is. I have been using it as a type for the memory I push on and off the board, which has so far been floats. I tried looking at the OpenCL docs, but cl_mem doesn't show up (does it?). Is there any documentation on it, or is it simple and can someone explain.
The cl_mem type is a handle to a "Memory Object" (as described in Section 3.5 of the OpenCL 1.1 Spec). These essentially are inputs and outputs for OpenCL kernels, and are returned from OpenCL API calls in host code such as clCreateBuffer
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags,
size_t size, void *host_ptr, cl_int *errcode_ret)
The memory areas represented can be permitted different access patterns e.g. Read Only, or be allocated in different memory regions, depending on the flags set in the create buffer calls.
The handle is typically stored to allow a later call to release the memory, e.g:
cl_int clReleaseMemObject (cl_mem memobj)
In short, it provides an abstraction over where the memory actually is: you can copy data into the associated memory or back out via the OpenCL APIs clEnqueueWriteBuffer and clEnqueueReadBuffer, but the OpenCL implementation can allocate the space where it wants.
For the computer a cl_mem is a number (like a file handler for Linux) that is reserved for the use as a "memory identifier"( the API/driver whatever stores information about your memory under this number that it knows what it holds/how big it is and stuff like that)

Resources