cuda: involuntary memory changes during kernels [closed]

cuda: involuntary memory changes during kernels [closed] - memory

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
im a beginer cuda programmer,
im trying to build an application similar to the Nvidia particle system example (many balls in a cube).
i have a kernel louncher function as below :
void Ccuda:: sort_Particles_And_Find_Cell_Start (int *Cell_Start, // output
int *Cell_End, // output
float3 *Sorted_Pos, // output
float3 *Sorted_Vel, //output
int *Particle_Cell, // input
int *Particle_Index, // input
float3 *Old_Pos,
float3 *Old_Vel,
int Num_Particles,
int Num_Cells)
{
int numThreads, numBlocks;
/*Cell_Start = (int*) cudaAlloc (Num_Cells, sizeof(int));
Cell_End = (int*) cudaAlloc (Num_Cells, sizeof(int));
Sorted_Pos = (float3*) cudaAlloc (Num_Particles, sizeof(int));
Sorted_Vel = (float3*) cudaAlloc (Num_Particles, sizeof(int));*/
int *h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
free (h_p_cell);
computeGridSize(Num_Particles, 512, numBlocks, numThreads);
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
free (h_p_cell);
}
And this global kernel function :
__global__ void sort_Particles_And_Find_Cell_StartD(int *Cell_Start, // output
int *Cell_End, // output
float3 *Sorted_Pos, // output
float3 *Sorted_Vel, //output
int *Particle_Cell, // input
int *Particle_Index, // input
float3 *Old_Pos,
float3 *Old_Vel,
int Num_Particles)
{
int hash;
extern __shared__ int Shared_Hash[]; // blockSize + 1 elements
int index = blockIdx.x*blockDim.x + threadIdx.x;
if (index < Num_Particles)
{
hash = Particle_Cell[index];
Shared_Hash[threadIdx.x+1] = hash;
if (index > 0 && threadIdx.x == 0)
{
// first thread in block load previous particle hash
Shared_Hash[0] = Particle_Cell[index-1];
}
}
__syncthreads();
if (index < Num_Particles)
{
// If this particle has a different cell index to the previous
// particle then it must be the first particle in the cell,
// so store the index of this particle in the cell.
// As it isn't the first particle, it must also be the cell end of
// the previous particle's cell
if (index == 0 || hash != Shared_Hash[threadIdx.x]) // if its the first thread in the grid or its particle cell index is different from cell index of the previous neighboring thread
{
Cell_Start[hash] = index;
if (index > 0)
Cell_End[Shared_Hash[threadIdx.x]] = index;
}
if (index == Num_Particles - 1)
{
Cell_End[hash] = index + 1;
}
// Now use the sorted index to reorder the pos and vel data
int Sorted_Index = Particle_Index[index];
//float3 pos = FETCH(Old_Pos, Sorted_Index); // macro does either global read or texture fetch
//float3 vel = FETCH(Old_Vel, Sorted_Index); // see particles_kernel.cuh
float3 pos = Old_Pos[Sorted_Index];
float3 vel = Old_Vel[Sorted_Index];
Sorted_Pos[index] = pos;
Sorted_Vel[index] = vel;
}
during execute i got this debug arror massege r6010 saying an abort has been called.
as you may see in the louncher function (the first one) i use int *h_p_cell to view
Particle_Cell content before and after the kernel execution, and it seems like the content has been changed, although inside the kernel there is no assignment to Particle_Cell.
Particle_Cell memory allocated by cudaMemcpy during program init().
i have trying for few days to solve this issue, without success
can anyone help ?

Your kernel is expecting dynamically allocated shared memory:
extern __shared__ int Shared_Hash[]; // blockSize + 1 elements
But you aren't allocating any in your kernel invocation:
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
^
|
missing shared memory size parameter
You should provide a shared memory amount in your launch configuration. You probably want something like this:
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads, ((numThreads+1)*sizeof(int))>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
This error will cause your kernel to abort when it tries to access shared memory.
You should also do cuda error checking on all cuda API calls and kernel calls. I don't see any evidence of that in your code.
Once you have all the API errors sorted out, run your code with cuda-memcheck. The reason for the unexpected writes to Particle_Cell may be due to out-of-bounds accesses from your kernel, which will become evident with cuda-memcheck.

Related

How to make an operation similar to _mm_extract_epi8 with non-immediate input?

What I want is extracting a value from vector using a variable scalar index.
Like _mm_extract_epi8 / _mm256_extract_epi8 but with non-immediate input.
(There are some results in the vector, the one with the given index is found out to be the true result, the rest are discarded)

Especially, if index is in a GPR, the easiest way is probably to store val to memory and then movzx it into another GPR. Sample implementation using C:
uint8_t extract_epu8var(__m256i val, int index) {
union {
__m256i m256;
uint8_t array[32];
} tmp;
tmp.m256 = val;
return tmp.array[index];
}
Godbolt translation (note that a lot of overhead happens for stack alignment -- if you don't have an aligned temporary storage area, you could just vmovdqu instead of vmovdqa): https://godbolt.org/z/Gj6Eadq9r

So far the best option seem to be using _mm_shuffle_epi8 for SSE
uint8_t extract_epu8var(__m128i val, int index) {
return (uint8_t)_mm_cvtsi128_si32(
_mm_shuffle_epi8(val, _mm_cvtsi32_si128(index)));
}
Unfortunately this does not scale well for AVX. vpshufb does not shuffle across lanes. There is a cross lane shuffle _mm256_permutevar8x32_epi32, but the resulting stuff seem to be complicated:
uint8_t extract_epu8var(__m256i val, int index) {
int index_low = index & 0x3;
int index_high = (index >> 2);
return (uint8_t)(_mm256_cvtsi256_si32(_mm256_permutevar8x32_epi32(
val, _mm256_zextsi128_si256(_mm_cvtsi32_si128(index_high))))
>> (index_low << 3));
}

Is it possible to use GPU just for calculations without copying the variables to GPU. Can we use RAM or even HardDisk for variable stoorage? [duplicate]

I tried the code in this link Is CUDA pinned memory zero-copy?
The one who asked claims the program worked fine for him
But does not work the same way on mine
the values does not change if I manipulate them in the kernel.
Basically my problem is, my GPU memory is not enough but I want to do calculations which require more memory. I my program to use RAM memory, or host memory and be able to use CUDA for calculations. The program in the link seemed to solve my problem but the code does not give output as shown by the guy.
Any help or any working example on Zero copy memory would be useful.
Thank you
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
void test()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
}

First of all, to allocate ZeroCopy memory, you have to specify cudaHostAllocMapped flag as an argument to cudaHostAlloc.
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
Still the pinnedHostPointer will be used to access the mapped memory from the host side only. To access the same memory from device, you have to get the device side pointer to the memory like this:
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
Pass this pointer as kernel argument.
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
Also, you have to synchronize the kernel execution with the host to read the updated values. Just add cudaDeviceSynchronize after the kernel call.
The code in the linked question is working, because the person who asked the question is running the code on a 64 bit OS with a GPU of Compute Capability 2.0 and TCC enabled. This configuration automatically enables the Unified Virtual Addressing feature of the GPU in which the device sees host + device memory as a single large memory instead of separate ones and host pointers allocated using cudaHostAlloc can be passed directly to the kernel.
In your case, the final code will look like this:
#include <cstdio>
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
int main()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
cudaDeviceSynchronize();
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
return 0;
}

ID3D11DeviceContext::DrawIndexed() Failed

my program is Directx Program that draws a container cube within it smaller cubes....these smaller cubes fall by time i hope you understand what i mean...
The program isn't complete yet ...it should draws the container only ....but it draws nothing ...only the background color is visible... i only included what i think is needed ...
this is the routines that initialize the program
bool Game::init(HINSTANCE hinst,HWND _hw){
Directx11 ::init(hinst , _hw);
return LoadContent();}
Directx11::init()
bool Directx11::init(HINSTANCE hinst,HWND hw){
_hinst=hinst;_hwnd=hw;
RECT rc;
GetClientRect(_hwnd,&rc);
height= rc.bottom - rc.top;
width = rc.right - rc.left;
UINT flags=0;
#ifdef _DEBUG
flags |=D3D11_CREATE_DEVICE_DEBUG;
#endif
HR(D3D11CreateDevice(0,_driverType,0,flags,0,0,D3D11_SDK_VERSION,&d3dDevice,&_featureLevel,&d3dDeviceContext));
if (d3dDevice == 0 || d3dDeviceContext == 0)
return 0;
DXGI_SWAP_CHAIN_DESC sdesc;
ZeroMemory(&sdesc,sizeof(DXGI_SWAP_CHAIN_DESC));
sdesc.Windowed=true;
sdesc.BufferCount=1;
sdesc.BufferDesc.Format=DXGI_FORMAT_R8G8B8A8_UNORM;
sdesc.BufferDesc.Height=height;
sdesc.BufferDesc.Width=width;
sdesc.BufferDesc.Scaling=DXGI_MODE_SCALING_UNSPECIFIED;
sdesc.BufferDesc.ScanlineOrdering=DXGI_MODE_SCANLINE_ORDER_UNSPECIFIED;
sdesc.OutputWindow=_hwnd;
sdesc.BufferDesc.RefreshRate.Denominator=1;
sdesc.BufferDesc.RefreshRate.Numerator=60;
sdesc.Flags=0;
sdesc.BufferUsage = DXGI_USAGE_RENDER_TARGET_OUTPUT;
if (m4xMsaaEnable)
{
sdesc.SampleDesc.Count=4;
sdesc.SampleDesc.Quality=m4xMsaaQuality-1;
}
else
{
sdesc.SampleDesc.Count=1;
sdesc.SampleDesc.Quality=0;
}
IDXGIDevice *Device=0;
HR(d3dDevice->QueryInterface(__uuidof(IDXGIDevice),reinterpret_cast <void**> (&Device)));
IDXGIAdapter*Ad=0;
HR(Device->GetParent(__uuidof(IDXGIAdapter),reinterpret_cast <void**> (&Ad)));
IDXGIFactory* fac=0;
HR(Ad->GetParent(__uuidof(IDXGIFactory),reinterpret_cast <void**> (&fac)));
fac->CreateSwapChain(d3dDevice,&sdesc,&swapchain);
ReleaseCOM(Device);
ReleaseCOM(Ad);
ReleaseCOM(fac);
ID3D11Texture2D *back = 0;
HR(swapchain->GetBuffer(0,__uuidof(ID3D11Texture2D),reinterpret_cast <void**> (&back)));
HR(d3dDevice->CreateRenderTargetView(back,0,&RenderTarget));
D3D11_TEXTURE2D_DESC Tdesc;
ZeroMemory(&Tdesc,sizeof(D3D11_TEXTURE2D_DESC));
Tdesc.BindFlags = D3D11_BIND_DEPTH_STENCIL;
Tdesc.ArraySize = 1;
Tdesc.Format= DXGI_FORMAT_D24_UNORM_S8_UINT;
Tdesc.Height= height;
Tdesc.Width = width;
Tdesc.Usage = D3D11_USAGE_DEFAULT;
Tdesc.MipLevels=1;
if (m4xMsaaEnable)
{
Tdesc.SampleDesc.Count=4;
Tdesc.SampleDesc.Quality=m4xMsaaQuality-1;
}
else
{
Tdesc.SampleDesc.Count=1;
Tdesc.SampleDesc.Quality=0;
}
HR(d3dDevice->CreateTexture2D(&Tdesc,0,&depthview));
HR(d3dDevice->CreateDepthStencilView(depthview,0,&depth));
d3dDeviceContext->OMSetRenderTargets(1,&RenderTarget,depth);
D3D11_VIEWPORT vp;
vp.TopLeftX=0.0f;
vp.TopLeftY=0.0f;
vp.Width = static_cast <float> (width);
vp.Height= static_cast <float> (height);
vp.MinDepth = 0.0f;
vp.MaxDepth = 1.0f;
d3dDeviceContext -> RSSetViewports(1,&vp);
return true;
SetBuild() Prepare the matrices inside the container for the smaller cubes ....i didnt program it to draw the smaller cubes yet
and this the function that draws the scene
void Game::Render(){
d3dDeviceContext->ClearRenderTargetView(RenderTarget,reinterpret_cast <const float*> (&Colors::LightSteelBlue));
d3dDeviceContext->ClearDepthStencilView(depth,D3D11_CLEAR_DEPTH | D3D11_CLEAR_STENCIL,1.0f,0);
d3dDeviceContext-> IASetInputLayout(_layout);
d3dDeviceContext-> IASetPrimitiveTopology(D3D10_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
d3dDeviceContext->IASetIndexBuffer(indices,DXGI_FORMAT_R32_UINT,0);
UINT strides=sizeof(Vertex),off=0;
d3dDeviceContext->IASetVertexBuffers(0,1,&vertices,&strides,&off);
D3DX11_TECHNIQUE_DESC des;
Tech->GetDesc(&des);
Floor * Lookup; /*is a variable to Lookup inside the matrices structure (Floor Contains XMMATRX Piese[9])*/
std::vector<XMFLOAT4X4> filled; // saves the matrices of the smaller cubes
XMMATRIX V=XMLoadFloat4x4(&View),P = XMLoadFloat4x4(&Proj);
XMMATRIX vp = V * P;XMMATRIX wvp;
for (UINT i = 0; i < des.Passes; i++)
{
d3dDeviceContext->RSSetState(BuildRast);
wvp = XMLoadFloat4x4(&(B.Memory[0].Pieces[0])) * vp; // Loading The Matrix at translation(0,0,0)
HR(ShadeMat->SetMatrix(reinterpret_cast<float*> ( &wvp)));
HR(Tech->GetPassByIndex(i)->Apply(0,d3dDeviceContext));
d3dDeviceContext->DrawIndexed(build_ind_count,build_ind_index,build_vers_index);
d3dDeviceContext->RSSetState(PieseRast);
UINT r1=B.GetSize(),r2=filled.size();
for (UINT j = 0; j < r1; j++)
{
Lookup = &B.Memory[j];
for (UINT r = 0; r < Lookup->filledindeces.size(); r++)
{
filled.push_back(Lookup->Pieces[Lookup->filledindeces[r]]);
}
}
for (UINT j = 0; j < r2; j++)
{
ShadeMat->SetMatrix( reinterpret_cast<const float*> (&filled[i]));
Tech->GetPassByIndex(i)->Apply(0,d3dDeviceContext);
d3dDeviceContext->DrawIndexed(piese_ind_count,piese_ind_index,piese_vers_index);
}
}
HR(swapchain->Present(0,0));}
thanks in Advance

One bug in your program appears to be that you're using i, the index of the current pass, as an index into the filled vector, when you should apparently be using j.
Another apparent bug is that in the loop where you are supposed to be iterating over the elements of filled, you're not iterating over all of them. The value r2 is set to the size of filled before you append anything to it during that pass. During the first pass this means that nothing will be drawn by this loop. If your technique only has one pass then this means that the second DrawIndexed call in your code will never be executed.
It also appears you should be only adding matrices to filled once, regardless of the number of the passes the technique has. You should consider if your code is actually meant to work with techniques with multiple passes.

OpenCV - Display percentage instead of line [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
This is a application that can recognize emotions of happy/sad/anger/neutral/disgust/surprise.
The line will increase based on the current emotion that it recognize.
The link below is image of my current work , but I want to display percentage instead of line, but I not sure how.
http://postimg.org/image/j6hhe4dy7/
Anyone can help me with this would really appreciate.
Here is the part of code :
if(showTrackerGui) {
CvScalar expColor = cvScalar(0,254,0);
cvFlip(img, NULL, 1);
int start=12, step=15, current;
current = start;
for(int i = 0; i < N_EXPRESSIONS; i++, current+=step)
{
//this line display the emotion.
cvPutText(img, EXP_NAMES[i], cvPoint(5, current), &font, expColor);
}
expressions = get_class_weights(features);
current = start - 3;
for(int i = 0; i < N_EXPRESSIONS; i++, current+=step)
{
//this is the line which display the green line but I want to display percentage.
cvLine(img, cvPoint(80, current), cvPoint((int)(80+expressions.at<double>(0,i)*50), current), expColor, 2);
}
current += step + step;
if(showFeatures == 1)
{
for(int i=0; i<N_FEATURES; i++)
{
current += step;
char buf[4];
sprintf(buf, "%.2f", features.at<float>(0,i));
cvPutText(img, buf, cvPoint(5, current), &font, expColor);
}
}
}
}

Well, you can look at the line that prints the emotion titles which shows you exactly the method you want to call:
cvPutText(img, EXP_NAMES[i], cvPoint(5, current), &font, expColor);
And the line you want to change is:
cvLine(img, cvPoint(80, current), cvPoint((int)(80+expressions.at<double>(0,i)*50), current), expColor, 2);
So try replacing the line with cvLine() with the following:
// Assuming expressions.at<double>(0, i) returns a value between 0.0 and
// 1.0, convert it to 0.0 to 100.0.
double percentage = expressions.at<double>(0, i) * 100.0;
// Format the string buffer to hold "{percentage} %" (e.g., "50 %").
char buf[6]; // 3 bytes for the percentage, 1 for the space, 1 for the "%", 1 for the null byte.
sprintf(buf, "%3.f %%", percentage);
// Display percentage.
cvPutText(img, buf, cvPoint(80, current), &font, expColor);

Print cv::Mat opencv

I am trying to print cv::Mat which contains my image. However whenever I print the Mat using cout, a 2D array printed into my text file. I want to print one one pixel in one line only. How can i print line wise pixels from cv::Mat.

A generic for_each loop, you could use it to print your data
/**
*#brief implement details of for_each_channel, user should not use this function
*/
template<typename T, typename UnaryFunc>
UnaryFunc for_each_channel_impl(cv::Mat &input, int channel, UnaryFunc func)
{
int const rows = input.rows;
int const cols = input.cols;
int const channels = input.channels();
for(int row = 0; row != rows; ++row){
auto *input_ptr = input.ptr<T>(row) + channel;
for(int col = 0; col != cols; ++col){
func(*input_ptr);
input_ptr += channels;
}
}
return func;
}
use it like
for_each_channel_impl<uchar>(input, 0, [](uchar a){ std::cout<<(size_t)a<<", "; });
you could do some optimization to continuous channel, then it may looks like
/**
*#brief apply stl like for_each algorithm on a channel
*
* #param
* T : the type of the channel(ex, uchar, float, double and so on)
* #param
* channel : the channel need to apply for_each algorithm
* #param
* func : Unary function that accepts an element in the range as argument
*
*#return :
* return func
*/
template<typename T, typename UnaryFunc>
inline UnaryFunc for_each_channel(cv::Mat &input, int channel, UnaryFunc func)
{
if(input.channels() == 1 && input.isContinuous()){
return for_each_continuous_channels<T>(input, func);
}else{
return for_each_channel_impl<T>(input, channel, func);
}
}
This kind of generic loopsave me a lot of times, I hope you find it helpful.If there are
any bugs, or you have better idea, please tell me.
I would like to design some generic algorithms for opencl too, sadly it do not support
template, I hope one day CUDA will become an open standard, or opencl will support template.
This works for any number of channels as long as the channels type are base on byte, non-byte
channel may not work.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

cuda: involuntary memory changes during kernels [closed] - memory

Related

How to make an operation similar to _mm_extract_epi8 with non-immediate input?

Is it possible to use GPU just for calculations without copying the variables to GPU. Can we use RAM or even HardDisk for variable stoorage? [duplicate]

ID3D11DeviceContext::DrawIndexed() Failed

OpenCV - Display percentage instead of line [closed]

Print cv::Mat opencv

Categories

Resources