why the Nvidia profiler does not show unified memory information - memory

I have a TitanXP installed in a Windows 10 64bit with CUDA 9.2 and Nvidia driver (398.82-desktop-win10-64bit-international-whql), and I have a simple program which uses unified memory like below.
// CUDA kernel to add elements of two arrays
void add(int n, float *x, float *y)
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
int main(void)
int N = 1 << 20;
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add <<< numBlocks, blockSize >>>(N, x, y);
// Wait for GPU to finish before accessing on host
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
return 0;
I compile this code using Visual Studio 2017 community, and run it in the command prompt window with no error.
When I profile it in Nvidia Profiler, it gives me a "Warning" message as below.
"==852== Warning: Unified Memory Profiling is not supported on the
current configuration because a pair of devices without peer-to-peer
support is detected on this multi-GPU setup. When peer mappings are
not available, system falls back to using zero-copy memory. It can
cause kernels, which access unified memory, to run slower. More
details can be found at:
I am pretty sure I only have one GPU installed in the computer, why I can not get the unified memory profiling information?
By the way, I did the exactly same experiment in my another machine which has the same software environment and same GPU, and the profiler does show the unified memory information. Is there any wrong with that specific computer? Is there any hardware related configuration/setting I need to do in order to enable the unified memory feature?

I had faced this problem in the past, but after updating my driver to the last version (Released in 19/9/2018 if I am not mistaken) the problem resolved.
Hope it will resolve your problem as well.
Let me know if it did.

I install the new cuda sdk 10, and it works fine now.


Explicit memory prefetching for Intel Compilers

I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024 and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times
I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?

Cuda kernel with global memory vs Cuda kernel with constant memory

I have two kernels for doing a matrix multiplication, one uses global memory and the second one uses constant memory. I wanted to use the Cuda profiler to test the speed of both kernels.
I tested both on a 1.3 device and on a 2.0 device. I was expecting the kernel with constant memory to be faster on the 1.3 device and the global memory kernel to be faster on the 2.0 device because of the use of cache for global memory on those devices but I found that in both devices the global memory kernel is faster. Is this due to memory coalescing on global memory? If so is there a way to make the constant kernel faster?
I'm using matrixes of 80x80 and Block size of 16.
Here is the global memory kernel
__global__ void MatMulGlobKernel(const Matriz A, const Matriz B, Matriz C) {
float Cvalor = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(fil > A.height || col > B.width) return;
for (int e = 0; e < A.width; ++e)
Cvalor += A.valores[row * A.width + e] * B.valores[e * B.width + col];
C.valores[row * C.width + col] = Cvalor;
A.valores, B.valores and C.valores reside in global memory.
Now here is the constant memory kernel.
__global__ void MatMulConstKernel(const Matriz A, const Matriz B, Matriz C) {
float Cvalor = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(fil > A.height || col > B.width) return;
for (int e = 0; e < A.width; ++e)
Cvalor += A_const_valores[row * A.width + e] * B_const_valores[e * B.width + col];
C.valores[row * C.width + col] = Cvalor;
A_const_valores and B_const_valores reside in constant memory while C.valores resides in global memory.
This is the profiler result for the 1.3 device (Tesla M1060)
Const kernel 101.70us
Global kernel 51.424us
and for the 2.0 device (GTX 650)
Const kernel 178.05us
Global kernel 58.144us
Matrix multiplication usually has some components where adjacent threads are accessing adjacent values from memory. Your kernels have a load that behaves this way:
B.valores[e * B.width + col];
When reading from global memory, this load can be serviced in a single cycle (from the L1 or L2 cache) to the warp. Yes, this is a coalesced load.
Constant memory, on the other hand, can only serve one 32-bit quantity per cycle. Therefore the constant cache will take 32 cycles to deliver the same requested data to the warp.
This would not be a typical use case for constant memory. Constant memory is best used when every thread in the warp is requesting the same location in memory.
As an experiment, you might see what kind of results you get if you keep the A matrix in __constant__ memory and the B matrix in global memory.
If you really want fast matrix multiply, however, use CUBLAS.

using HLSL to invisibly stress a graphics card - How to stress the memory?

I've been developing for a bit an invisible (read: doesn't produce any visual output) stressor to test the capabilities of my graphics card (and as a exploration of DirectCompute in general, with which I'm pretty new). I've got the following code right now that I'm pretty proud of:
RWStructuredBuffer<uint> BufferOut : register(u0);
[numthreads(1, 1, 1)]
void CSMain( uint3 DTid : SV_DispatchThreadID )
uint total = 0;
float p = 0;
while(p++ < 40.0){
float s= 4.0;
float M= pow(2.0,p) - 1.0;
for(uint i=0; i <= p - 2; i++)
s=((s*s) - 2) % M;
if(s < 1.0) total++;
BufferOut[DTid.x] = total;
This runs the Lucas Lehmer Test for the first 40 powers of two. When I dispatch this code in a timed loop and look at my graphics cards stats using GPU-Z, my GPU load shoots to 99% for the duration. I'm pretty happy with this, but I also notice that the heat generation from a fully loaded out GPU is actually pretty minimal (I'm getting about a 5 to 10 degree Celsius jump, nowhere near the heat jump I get when running, say, Borderlands 2). My thought is that most of my heat comes from memory accesses, so I would need to include consistent memory accesses across the run. My initial code looked like this:
RWStructuredBuffer<uint> BufferOut : register(u0);
groupshared float4 memory_buffer[1024];
[numthreads(1, 1, 1)]
void CSMain( uint3 DTid : SV_DispatchThreadID )
uint total = 0;
float p = 0;
while(p++ < 40.0){
[fastop] // to lower compile times - Code efficiency is strangely not what Im looking for right now.
for(uint i = 0; i < 1024; ++i)
float s= 4.0;
float M= pow(2.0,p) - 1.0;
for(uint i=0; i <= p - 2; i++)
s=((s*s) - 2) % M;
if(s < 1.0) total++;
BufferOut[DTid.x] = total;
Read a lot of non-coherent samples in large textures. Try both DXT1 compressed and non-compressed values. And use render to texture. And MRT. All will beat on the GPU memory systems.

How to solve CUDA Thrust library - for_each synchronization error?

I'm trying to modify a simple dynamic vector in CUDA using the thrust library of CUDA. But I'm getting "launch_closure_by_value" error on the screen indicatiing that the error is related to some synchronization process.
A simple 1D dynamic array modification is not possible due to this error.
My code segment which is causing the error is as follows.
from a .cpp file I call setIndexedGrid, which is defined in System.cu
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
The code segment at System.cu:
setIndexedGridInfo(float* a, float*b)
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
float c = 0.0;
grid_functor is defined in _kernel.cu
struct grid_functor
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
void operator()(Tuple t)
volatile float data = thrust::get<0>(t);
float pos = data + 0.1;
thrust::get<1>(t) = pos;
I also get these on the Output window (I use Visual Studio):
First-chance exception at 0x000007fefdc7cacd in Particles.exe:
Microsoft C++ exception: cudaError_enum at memory location
0x0029eb60.. First-chance exception at 0x000007fefdc7cacd in
smokeParticles.exe: Microsoft C++ exception:
thrust::system::system_error at memory location 0x0029ecf0.. Unhandled
exception at 0x000007fefdc7cacd in Particles.exe: Microsoft C++
exception: thrust::system::system_error at memory location
What is causing the problem?
You are trying to use host memory pointers in functions expecting pointers in device memory. This code is the problem:
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
The thrust::device_ptr is intended for "wrapping" a device memory pointer allocated with the CUDA API so that thrust can use it. You are trying to treat a host pointer directly as a device pointer. That is illegal. You could modify your setIndexedGridInfo function like this:
void setIndexedGridInfo(float* a, float*b, const int n)
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
The device_vector constructor will allocate device memory and then copy the contents of your host memory to the device. That should fix the error you are seeing, although I am not sure what you are trying to do with the for_each iterator and whether the functor you have wrttien is correct.
Here is a complete, compilable, runnable version of your code:
#include <cstdlib>
#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/copy.h>
struct grid_functor
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
void operator()(Tuple t)
volatile float data = thrust::get<0>(t);
float pos = data + 0.1f;
thrust::get<1>(t) = pos;
void setIndexedGridInfo(float* a, float*b, const int n)
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
thrust::copy(d_newData.begin(), d_newData.end(), b);
int main(void)
const int n = 8;
float* a= (float*)(malloc(n*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(n*sizeof(float)));
for(int i=0; i<n; i++) {
fprintf(stdout, "%d (%f,%f)\n", i, a[i], b[i]);
return 0;
I can compile and run this code on an OS 10.6.8 host with CUDA 4.1 like this:
$ nvcc -Xptxas="-v" -arch=sm_12 -g -G thrustforeach.cu
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEi12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEj12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
$ ./a.out
0 (0.000000,0.100000)
1 (1.000000,1.100000)
2 (2.000000,2.100000)
3 (3.000000,3.100000)
4 (4.000000,4.100000)
5 (5.000000,5.100000)
6 (6.000000,6.100000)
7 (7.000000,7.100000)

CUDA memory limitations

If I try to send to my CUDA device a struct wich is heavier than the size of memory available, will CUDA give me any kind of warning or error?
I'm asking that because my GPU has 1024 MBytes (1073414144 bytes) Total amount of global memory, but I don't know how I should handle and eventual problem.
That's my code:
#define VECSIZE 2250000
#define WIDTH 1500
#define HEIGHT 1500
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
int width;
int height;
int* elements;
int main()
Matrix M;
M.width = WIDTH;
M.height = HEIGHT;
M.elements = (int *) calloc(VECSIZE,sizeof(int));
int row, col;
// define Matrix M
// Matrix generator:
for (int i = 0; i < M.height; i++)
for(int j = 0; j < M.width; j++)
row = i;
col = j;
if (i == j)
M.elements[row * M.width + col] = INFINITY;
M.elements[row * M.width + col] = (rand() % 2); // because 'rand() % 1' just does not seems to work ta all.
if (M.elements[row * M.width + col] == 0) // can't have zero weight.
M.elements[row * M.width + col] = INFINITY;
else if (M.elements[row * M.width + col] == 2)
M.elements[row * M.width + col] = 1;
// Declare & send device Matrix to Device.
Matrix d_M;
d_M.width = M.width;
d_M.height = M.height;
size_t size = M.width * M.height * sizeof(int);
cudaMalloc(&d_M.elements, size);
cudaMemcpy(d_M.elements, M.elements, size, cudaMemcpyHostToDevice);
int *d_k= (int*) malloc(sizeof(int));
cudaMalloc((void**) &d_k, sizeof (int));
int *d_width=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_width, sizeof(int));
unsigned int *width=(unsigned int*)malloc(sizeof(unsigned int));
width[0] = M.width;
cudaMemcpy(d_width, width, sizeof(int), cudaMemcpyHostToDevice);
int *d_height=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_height, sizeof(int));
unsigned int *height=(unsigned int*)malloc(sizeof(unsigned int));
height[0] = M.height;
cudaMemcpy(d_height, height, sizeof(int), cudaMemcpyHostToDevice);
et cetera .. */
While you may not currently be sending enough data to the GPU to max out it's memory, when you do, your cudaMalloc will return the error code cudaErrorMemoryAllocation which as per the cuda api docs, signals that the memory allocation failed. I note that in your example code you are not checking the return values of the cuda calls. These return codes need to be checked to make sure your program is running correctly. The cuda api does not throw exceptions: you must check the return codes. See this article for info on checking the errors and getting meaningful messages about the errors
If you are using cutil.h, then it provides two very useful macros:
CUDA_SAFE_CALL (used while issuing functions like cudaMalloc, cudaMemcpy etc.)
CUT_CHECK_ERROR (used after executing a kernel to check for errors in kernel execution).
They take care of the errors, if any, by using the error checking mechanism detailed in the article provided by flipchart.
