Cuda, unified memory, data transfers

Cuda, unified memory, data transfers - memory

Does cuda somehow block and transfer all allocated managed memory to the GPU when a kernel is launched? I just played with uma and got strange results. At least in my point of view.
I created 2 arrays and send A to kernel, B is untouched by kernel call but it cannot be accessed. The program just crashes when I touch B.
0 0 0 here1
If I comment out the b[0] = 1; line the code runs fine:
0 0 0 here1 after1 0 here2 1 after2
Why is this happening ?
__global__ void kernel(int* t)
{
t[0]++;
}
int main()
{
int* a;
int* b;
std::cout << cudaMallocManaged(&a,sizeof(int)*100) << std::endl;
std::cout << cudaMallocManaged(&b,sizeof(int)*100) << std::endl;
std::cout << b[0] << std::endl;
kernel<<<1,1,0,0>>>(a);
std::cout << "here1" << std::endl;
b[0] = 1;
std::cout << "after1" << std::endl;
cudaDeviceSynchronize();
std::cout << b[0] << std::endl;
std::cout << "here2" << std::endl;
std::cout << a[0] << std::endl;
std::cout << "after2" << std::endl;
return 0;
}

Does cuda somehow block and transfer all allocated managed memory to
the GPU when a kernel is launched?
Yes, provided your device is of compute capability less than 6.0.
On these devices managed memory works by copying all managed memory to the GPU before kernel launch, and copying all managed memory back to the host on synchronization. During that timespan, accessing managed memory from the host will lead to a segmentation fault.
You can be more specific about which memory to copy for a given kernel by attaching it to a stream using cudaStreamAttachMemAsync() and launching the kernel into that stream.

Related

How to use Crypto++ to perfom DH key exchange (CryptoPP::DH::Agree returns false)

I'm trying to use Crypto++ to perform Diffie-Hellman key exchange. I have written a simple program to check if this is working. As you can guess, it is not.
This program was written based on wiki article: https://www.cryptopp.com/wiki/Diffie-Hellman It is generating public and private keys and then uses them to cal function CryptoPP::DH::Agree. It was working when I was using the same pair of keys for both sides like it is on the wiki. This does not have much practical sense though. However, when I trying to use different keys, CryptoPP::DH::Agree returns false.
I suspect that I'm doing something incorrectly but I have no idea what.
#include <crypto++/cryptlib.h>
#include <crypto++/dh.h>
#include <cryptopp/dh2.h>
#include <crypto++/osrng.h>
#include <crypto++/integer.h>
#include <crypto++/nbtheory.h>
#include <iostream>
static CryptoPP::AutoSeededRandomPool rnd;
static CryptoPP::DH dhA, dhB;
static CryptoPP::SecByteBlock privKeyA, pubKeyA, privKeyB, pubKeyB;
static void createDomainParameters(CryptoPP::DH &dh)
{
CryptoPP::PrimeAndGenerator pg;
pg.Generate(1, rnd, 512, 511);
const CryptoPP::Integer p = pg.Prime();
const CryptoPP::Integer q = pg.SubPrime();
const CryptoPP::Integer g = pg.Generator();
std::cout << "P: " << p << '\n';
std::cout << "Q: " << q << '\n';
std::cout << "G: " << g << '\n';
dh = CryptoPP::DH(p, q, g);
}
static void createAsymetricKey(const CryptoPP::DH &dh, CryptoPP::SecByteBlock &privKey, CryptoPP::SecByteBlock &pubKey)
{
privKey = CryptoPP::SecByteBlock(dh.PrivateKeyLength());
pubKey = CryptoPP::SecByteBlock(dh.PublicKeyLength());
dh.GenerateKeyPair(rnd, privKey, pubKey);
CryptoPP::Integer a, b;
a.Decode(privKey.BytePtr(), privKey.SizeInBytes());
std::cout << "privKey: " << a << std::endl;
b.Decode(pubKey.BytePtr(), pubKey.SizeInBytes());
std::cout << "pubKey: " << b << std::endl;
}
static void createSymetricKey(const CryptoPP::DH &dh, const CryptoPP::SecByteBlock &privKey, const CryptoPP::SecByteBlock &pubKey)
{
CryptoPP::SecByteBlock shared(dh.AgreedValueLength());
if(!dh.Agree(shared, privKey, pubKey))
throw std::runtime_error("Failed to reach shared secret");
CryptoPP::Integer x;
x.Decode(shared.BytePtr(), shared.SizeInBytes());
std::cout << "shared: " << x << std::endl;
}
int main()
{
std::cout << std::hex;
createDomainParameters(dhA);
std::cout << std::endl;
createDomainParameters(dhB);
std::cout << "\n------------------------------\n" << std::endl;
createAsymetricKey(dhA, privKeyA, pubKeyA);
std::cout << std::endl;
createAsymetricKey(dhB, privKeyB, pubKeyB);
if(dhA.AgreedValueLength() != dhB.AgreedValueLength())
throw std::runtime_error("Shared secret size mismatch");
std::cout << "\n------------------------------\n" << std::endl;
createSymetricKey(dhA, privKeyA, pubKeyB);
std::cout << std::endl;
createSymetricKey(dhB, privKeyB, pubKeyA);
return 0;
}
When you change calls of createSymetricKey so it uses key from the same pair, it works.
createSymetricKey(dhA, privKeyA, pubKeyA);
std::cout << std::endl;
createSymetricKey(dhB, privKeyB, pubKeyB);
AFAIK this has no sense though. What is the correct way to use CryptoPP::DH::Agree?

OpenCL can not detect my nVidia GPU via OpenCV

I have two GPUs : Intel HD and nVidia Quadro. Using GPU Caps Viewer, I can detect my both GPUs in the OpenCL tab. However, by executing this code I am only getting the Intel one:
cv::ocl::setUseOpenCL(true);
if (!cv::ocl::haveOpenCL()) {
std::cout << "OpenCL is not available..." << std::endl;
}
cv::ocl::Context context;
if (!context.create(cv::ocl::Device::TYPE_ALL)) {
std::cout << "Failed creating the context..." << std::endl;
}
std::cout << context.ndevices() << " GPU devices are detected." << std::endl;
for (int i = 0; i < context.ndevices(); i++) {
cv::ocl::Device device = context.device(i);
std::cout << "name: " << device.name() << std::endl;
std::cout << "available: " << device.available() << std::endl;
std::cout << "imageSupport: " << device.imageSupport() << std::endl;
std::cout << "OpenCL_C_Version: " << device.OpenCL_C_Version() << std::endl;
std::cout << std::endl;
}
Results:
1 GPU devices are detected.
name: Intel(R) HD Graphics P530
available: 1
imageSupport: 1
OpenCL_C_Version: OpenCL C 2.0
Information:
Windows 10
OpenCV 3.1
Visual studio 2013
nVidia Quadro M4000M
Notes:
I am able to call my nVidia GPU directly using the OpenCV Cuda Interface.
I have just installed the latest driver from nVidia website.

The solution in my case was to add this environment variable:
OPENCV_OPENCL_DEVICE=NVIDIA:GPU:

How can I change the device on which OpenCL-code will be executed with Umat in OpenCV?

As known, OpenCV 3.0 supports new class cv::Umat which provides Transparent API (TAPI) to use OpenCL automaticaly if it can: http://code.opencv.org/projects/opencv/wiki/Opencv3#tapi
There are two introductions to the cv::Umat and TAPI:
Intel: https://software.intel.com/en-us/articles/opencv-30-architecture-guide-for-intel-inde-opencv
AMD: http://developer.amd.com/community/blog/2014/10/15/opencv-3-0-transparent-api-opencl-acceleration/
But if I have:
Intel CPU Core i5 (Haswell) 4xCores (OpenCL Intel CPUs with SSE 4.1, SSE 4.2 or AVX support)
Intel Integrated HD Graphics which supports OpenCL 1.2
1st nVidia GPU GeForce GTX 970 (Maxwell) which supports OpenCL 1.2
and CUDA
2nd nVidia GPU GeForce GTX 970 ...
If I turn on OpenCL in OpenCV, then how can I change the device on which OpenCL-code will be executed: on 8 Cores of CPU, on Integrated HD Graphics, on 1st nVidia GPU or 2nd nVidia GPU?
How can I select one of each of these 4 devices to use OpenCL for parallel execution algorithms with cv::Umat?
For example, how can I use OpenCL acceleration on 4xCores of CPU Core-i5 with cv::Umat?

I use something like this to check versions and hardware being used for OpenCL support.
ocl::setUseOpenCL(true);
if (!ocl::haveOpenCL())
{
cout << "OpenCL is not available..." << endl;
//return;
}
cv::ocl::Context context;
if (!context.create(cv::ocl::Device::TYPE_GPU))
{
cout << "Failed creating the context..." << endl;
//return;
}
cout << context.ndevices() << " GPU devices are detected." << endl; //This bit provides an overview of the OpenCL devices you have in your computer
for (int i = 0; i < context.ndevices(); i++)
{
cv::ocl::Device device = context.device(i);
cout << "name: " << device.name() << endl;
cout << "available: " << device.available() << endl;
cout << "imageSupport: " << device.imageSupport() << endl;
cout << "OpenCL_C_Version: " << device.OpenCL_C_Version() << endl;
cout << endl;
}
Then you can set your preferred hardware to use, using this
cv::ocl::Device(context.device(1));
Hope this helps you.

You can also set a desired OpenCL device from within your code using environment variable method as follows (example is first GPU device):
if (putenv("OPENCV_OPENCL_DEVICE=:GPU:0") != 0 || !cv::ocl::useOpenCL())
{
std::cerr << "Failed to set a desired OpenCL device" << std::endl;
std::cerr << "Press any key to exit..." << std::endl;
getchar();
return 1;
}
Call to cv::ocl::useOpenCL() will force OpenCV to setup a default OpenCL device to the one specified in the environment variable OPENCV_OPENCL_DEVICE which is setup prior to that call.
I checked that this actually happens by setting a break-point at opencv_core310d.dll!cv::ocl::selectOpenCLDevice() Line 2256 (opencv\source\modules\core\src\ocl.cpp):
static cl_device_id selectOpenCLDevice()
{
std::string platform, deviceName;
std::vector<std::string> deviceTypes;
const char* configuration = getenv("OPENCV_OPENCL_DEVICE");
if (configuration &&
(strcmp(configuration, "disabled") == 0 ||
!parseOpenCLDeviceConfiguration(std::string(configuration), platform, deviceTypes, deviceName)
))
return NULL;

OpenCV 3.0 with OpenCL 2.0, each respond with different versions for OpenCL

I recently upgraded to a GPU card with OpenCL 2.0 (R9 390), from one with only OpenCL 1.2 on it. To start using it with OpenCV I created some basic calls to determine what hardware each library thought I had.
cout << "Equipment according to OpenCV:" << endl;
//Setup OpenCV first
cv::ocl::setUseOpenCL(true);
//OpenCV: Platform Info
std::vector<cv::ocl::PlatformInfo> platforms;
cv::ocl::getPlatfomsInfo(platforms);
//OpenCV Platforms
for (size_t i = 0; i < platforms.size(); i++)
{
const cv::ocl::PlatformInfo* platform = &platforms[i];
//Platform Name
std::cout << "Platform Name: " << platform->name().c_str() << "\n";
//Access known device
cv::ocl::Device current_device;
for (int j = 0; j < platform->deviceNumber(); j++)
{
//Access Device
platform->getDevice(current_device, j);
std::cout << "Device Name: " << current_device.name().c_str() << "\n";
}
}
cv::ocl::Device(current_device); // Required?
cout << cvContext.ndevices() << " GPU devices are detected." << endl;
for (int i = 0; i < cvContext.ndevices(); i++)
{
cv::ocl::Device device = cvContext.device(i);
cout << "name: " << device.name() << endl;
cout << "available: " << device.available() << endl;
cout << "imageSupport: " << device.imageSupport() << endl;
cout << "OpenCL_C_Version: " << device.OpenCL_C_Version() << endl;
cout << "Use OpenCL: " << cv::ocl::useOpenCL() << endl;
cout << endl;
}
cv::ocl::Device(cvContext.device(0)); //Here is where you change which GPU to use (e.g. 0 or 1)
// Setup OpenCL
cout << "Equipment according to OpenCL:" << endl;
vector<cl::Platform> clPlatforms;
vector<cl::Device> clPlatformDevices, clAllDevices;//, clCTXdevices;
string clPlatform_name, clDevice_name;
cl_uint i;
cl::Platform::get(&clPlatforms);
for(i=0; i<clPlatforms.size();i++)
{
clPlatform_name = clPlatforms[i].getInfo<CL_PLATFORM_NAME>();
cout<< "Platform: " <<clPlatform_name.c_str()<<endl;
clPlatforms[i].getDevices(CL_DEVICE_TYPE_ALL, &clPlatformDevices);
// Create context and access device names
clContext = cl::Context(clPlatformDevices);
clCTXdevices = clContext.getInfo<CL_CONTEXT_DEVICES>();
for(i=0; i<clCTXdevices.size(); i++) {
clDevice_name = clCTXdevices[i].getInfo<CL_DEVICE_NAME>();
cout << "Device: " << clDevice_name.c_str() << endl;
}
}
cout << "OpenCL Version: "<<clPlatforms[0].getInfo<CL_PLATFORM_VERSION>().c_str() <<endl;
cout << "Vendor: "<<clPlatforms[0].getInfo<CL_PLATFORM_VENDOR>().c_str() <<endl;
cout << "Extensions: "<<clPlatforms[0].getInfo<CL_PLATFORM_EXTENSIONS>().c_str() <<endl;
and the output:
Equipment according to OpenCV:
Platform Name: AMD Accelerated Parallel Processing
Device Name: Hawaii
Device Name: Intel(R) Core(TM)2 Quad CPU Q9450 # 2.66GHz
1 GPU devices are detected.
name: Hawaii
available: 1
imageSupport: 1
OpenCL_C_Version: OpenCL C 1.2
Use OpenCL: 1
Equipment according to OpenCL:
Platform: AMD Accelerated Parallel Processing
Device: Hawaii
Device: Intel(R) Core(TM)2 Quad CPU Q9450 # 2.66GHz
OpenCL Version: OpenCL 2.0 AMD-APP (1729.3)
Vendor: Advanced Micro Devices, Inc.
Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
So OpenCV thinks I have OpenCL 1.2, while OpenCL is a little smarter and returns 2.0...
Any ideas why they would not return the same version of OpenCL? I'm wondering if I need to re-compile OpenCV so it can recognize that there is a newer version of OpenCL available to it? Is OpenCV 3.0 limited to using OpenCL 1.2 calls?
Thanks!

I started completely over and used Ubuntu 14.04 64-bit instead. Now when I run the same code, the OpenCV library does indeed recognize the GPU as OpenCL 2.0

CUDA device memory transactions required

I wrote small cuda code to understand global memory to shared memory transfer transactions. The code is as follows:
#include <iostream>
using namespace std;
__global__ void readUChar4(uchar4* c, uchar4* o){
extern __shared__ uchar4 gc[];
int tid = threadIdx.x;
gc[tid] = c[tid];
o[tid] = gc[tid];
}
int main(){
string a = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
uchar4* c;
cudaError_t e1 = cudaMalloc((void**)&c, 128*sizeof(uchar4));
if(e1==cudaSuccess){
uchar4* o;
cudaError_t e11 = cudaMalloc((void**)&o, 128*sizeof(uchar4));
if(e11 == cudaSuccess){
cudaError_t e2 = cudaMemcpy(c, a.c_str(), 128*sizeof(uchar4), cudaMemcpyHostToDevice);
if(e2 == cudaSuccess){
readUChar4<<<1,128, 128*sizeof(uchar4)>>>(c, o);
uchar4* oFromGPU = (uchar4*)malloc(128*sizeof(uchar4));
cudaError_t e22 = cudaMemcpy(oFromGPU, o, 128*sizeof(uchar4), cudaMemcpyDeviceToHost);
if(e22 == cudaSuccess){
for(int i =0; i < 128; i++){
cout << oFromGPU[i].x << " ";
cout << oFromGPU[i].y << " ";
cout << oFromGPU[i].z << " ";
cout << oFromGPU[i].w << " " << endl;
}
}
else{
cout << "Failed to copy from GPU" << endl;
}
}
else{
cout << "Failed to copy" << endl;
}
}
else{
cout << "Failed to allocate output memory" << endl;
}
}
else{
cout << "Failed to allocate memory" << endl;
}
return 0;
}
This code simply copies data from device memory to shared memory and back to device memory. I have the following three questions:
Is the transfer from device memory to shared memory in this case guaranteed to take 4 memory transactions? I believe it depends on how cudaMalloc allocates memory; if the memory is allocated in a haphazard manner such that the data is scattered over memory, then it will take more than 4 memory transactions. However, if cudaMalloc allocates memory in 128 byte chunks or it allocates memory contiguously, then it should not take more than 4 memory transactions.
Does the above logic also hold for writing data from shared memory to device memory i.e., the transfer will complete in 4 memory transactions.
Can this code cause bank conflicts. I believe that this code will not cause bank conflicts, if threads are assigned ids sequentially. However, if thread 32 and 64 are scheduled to run in the same warp, then this code can cause bank conflicts.

In the code you provided (repeated here) the compiler will completely remove the shared memory store and load since they don't do anything necessary or beneficial for the code.
__global__ void readUChar4(uchar4* c, uchar4* o){
extern __shared__ uchar4 gc[];
int tid = threadIdx.x;
gc[tid] = c[tid];
o[tid] = gc[tid];
}
Assuming you did something with the shared memory so it was not eliminated, then:
The loads and stores from and to global memory in this code would take ONE transaction per warp (assuming Fermi or later GPU), since they are only 32-bits (uchar4 = 4*8 bits) per thread (total 128 bytes per warp). cudaMalloc allocates memory contiguously.
The answer from 1. applies to stores also, yes.
There are no bank conflicts in this code. Threads in a warp are always contiguous, with the first thread a multiple of the warp size. So threads 32 and 64 will never be in the same warp. And since you are loading and storing a 32-bit data type, and the banks are 32 bits wide, there are no conflicts.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Cuda, unified memory, data transfers - memory

Related

How to use Crypto++ to perfom DH key exchange (CryptoPP::DH::Agree returns false)

OpenCL can not detect my nVidia GPU via OpenCV

How can I change the device on which OpenCL-code will be executed with Umat in OpenCV?

OpenCV 3.0 with OpenCL 2.0, each respond with different versions for OpenCL

CUDA device memory transactions required

Categories

Resources