I am trying to build OpenCV 2.4.10 on a Win 8.1 machine with CUDA 6.5. I have other third part libraries as well and they have installed successfully. I ram a simple GPU based program and I got this error No GPU found or the library was compiled without GPU support. I also ran the sample exe files like performance_gpu.exe that were built during the installation and I got the same error. I also had WITH_CUDA flag checked. Following are the flags (related to CUDA) that were set during the CMAKE build.
WITH_CUDA : Checked
WITH_CUBLAS : Checked
WITH_CUFFT : Checked
CUDA_ARCH_BIN : 1.1 1.2 1.3 2.0 2.1(2.0) 3.0 3.5
CUDA_ARCH_PTX : 3.0
CUDA_FAST_MATH : Checked
CUDA_GENERATION : Auto
CUDA_HOST_COMPILER : $(VCInstallDir)bin
CUDA_SPERABLE_COMPILATION : Unchecked
CUDA_TOOLKIT_ROOT_DIR : C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5
Another thing is that in some posts I have read that along with CUDA the built takes a lot of time. My build takes ~ 3 Hrs where maximum time is taken up during the compilation of .cu files. I have not got any errors as far as I know during the compilation of those files.
In some posts I have seen that people talk about a directory names gpu inside the build directory but I don't see any in mine!
I am using Visual Studio 2013.
What could be the issue? Please help!
UPDATE:
I again tried to build opencv and this time before starting the build I added the bin, lib and include directories of CUDA. After the build in E:\opencv\build\bin\Release I ran gpu_perf4au.exe and I got this output
[----------]
[ INFO ] Implementation variant: cuda.
[----------]
[----------]
[ GPU INFO ] Run test suite on GeForce GTX 860M GPU.
[----------]
Time compensation is 0
OpenCV version: 2.4.10
OpenCV VCS version: unknown
Build type: release
Parallel framework: tbb
CPU features: sse sse2 sse3 ssse3 sse4.1 sse4.2 avx avx2
[----------]
[ GPU INFO ] Run on OS Windows x64.
[----------]
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***
Device count: 1
Device 0: "GeForce GTX 860M"
CUDA Driver Version / Runtime Version 6.50 / 6.50
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
GPU Clock Speed: 1.02 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3
D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16
384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simul
taneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.50, CUDA Runtime Ver
sion = 6.50, NumDevs = 1
I thought that every thing was fine but after running this program where I had included all opencv and CUDA directories in its property files,
#include <cv.h>
#include <highgui.h>
#include <iostream>
#include <opencv2\opencv.hpp>
#include <opencv2\gpu\gpu.hpp>
using namespace std;
using namespace cv;
char key;
Mat thresholder (Mat input) {
gpu::GpuMat dst, src;
src.upload(input);
gpu::threshold(src, dst, 128.0, 255.0, CV_THRESH_BINARY);
Mat result_host(dst);
return result_host;
}
int main(int argc, char* argv[]) {
cvNamedWindow("Camera_Output", 1);
CvCapture* capture = cvCaptureFromCAM(CV_CAP_ANY);
while (1){
IplImage* frame = cvQueryFrame(capture);
IplImage* gray_frame = cvCreateImage(cvGetSize(frame), IPL_DEPTH_8U, 1);
cvCvtColor(frame, gray_frame, CV_RGB2GRAY);
Mat temp(gray_frame);
Mat thres_temp;
thres_temp = thresholder(temp);
//cvShowImage("Camera_Output", frame); //Show image frames on created window
imshow("Camera_Output", thres_temp);
key = cvWaitKey(10);
if (char(key) == 27){
break; //If you hit ESC key loop will break.
}
}
cvReleaseCapture(&capture);
cvDestroyWindow("Camera_Output");
return 0;
}
I got the error:
OpenCV Error: No GPU support (The library is compiled without CUDA support) in E
mptyFuncTable::mallocPitch, file C:\builds\2_4_PackSlave-win64-vc12-shared\openc
v\modules\dynamicuda\include\opencv2/dynamicuda/dynamicuda.hpp, line 126
Thanks to #BeRecursive for giving me a lead to solve my issue. The CMAKE build log has three unavailable opencv modules namely androidcamera, dynamicuda and viz. I could not find any information on dynamicuda i.e. the module whose unavailability might have caused the error that I mentioned in the question. Instead I searched for viz module and checked how is it installed.
After going through some blogs and forums I found out that viz module has not been included in the pre-built versions of OpenCV. It was recommended to build from source version 2.4.9. I thought to give it a try and I installed it with VS 2013 and CMAKE 3.0.1 but there were many build failures and warnings. Upon further search I found that CMAKE versions 3.0.x aren't recommended for building OpenCV as they are producing many warnings.
At last I decided to switch to VS 2010 and CMAKE 2.8.12.2 and after building the source I got no error and luckily the after adding all executables, libraries and DLLs in the PATH, when I ran my program that I have mentioned above I got no errors but it is running very slowly! So I ran this program:
#include <cv.h>
#include <highgui.h>
#include <iostream>
#include <opencv2\opencv.hpp>
#include <opencv2\core\core.hpp>
#include <opencv2\gpu\gpu.hpp>
#include <opencv2\highgui\highgui.hpp>
using namespace std;
using namespace cv;
Mat thresholder(Mat input) {
cout << "Beginning thresholding using GPU" << endl;
gpu::GpuMat dst, src;
src.upload(input);
cout << "upload done ..." << endl;
gpu::threshold(src, dst, 128.0, 255.0, CV_THRESH_BINARY);
Mat result_host(dst);
cout << "Thresolding complete!" << endl;
return result_host;
}
int main(int argc, char** argv) {
Mat image, gray_image;
image = imread("desert.jpg", CV_LOAD_IMAGE_COLOR); // Read the file
if (!image.data) {
cout << "Could not open or find the image" << endl;
return -1;
}
cout << "Orignal image loaded ..." << endl;
cvtColor(image, gray_image, CV_BGR2GRAY);
cout << "Original image converted to Grayscale" << endl;
Mat thres_image;
thres_image = thresholder(gray_image);
namedWindow("Original Image", WINDOW_AUTOSIZE);// Create a window for display.
namedWindow("Gray Image", WINDOW_AUTOSIZE);
namedWindow("GPU Threshed Image", WINDOW_AUTOSIZE);
imshow("Original Image", image);
imshow("Gray Image", gray_image);
imshow("GPU Threshed Image", thres_image);
waitKey(0);
return 0;
}
Later I even tested the build on VS 2013 and it also worked.
The GPU based programs are slow due to reasons mentioned here.
So three important things I want to point out:
BUILD from source only
Use a little older version of CMAKE
Prefer VS 2010 for building the binaries.
NOTE:
This might sound weird but all my first BUILDS failed due to some linker error. So, I don't know whether this is work around or not but try to build opencv_gpu before anything and all other modules one by one after that and then build ALL_BUILDS and INSTALL projects.
When you build this way in DEBUG mode you might get an error iff you are building opencv with Python support i.e. "python27_d.lib" otherwise all projects will be built successfully.
WEB SOURCES:
Following are web sources that helped me in solving my problem:
http://answers.opencv.org/question/32502/opencv-249-viz-module-not-there/
http://home.eps.hw.ac.uk/~cgb7/opencv/opencv_tutorial.pdf
http://perso.uclouvain.be/allan.barrea/opencv/opencv.html
http://eavise.wikispaces.com/Building+OpenCV+yourself+on+Windows+7+x64+with+OpenCV+2.4.5+and+CUDA+5.0
https://devtalk.nvidia.com/default/topic/767647/how-do-i-enable-cuda-when-installing-opencv-/
So that is a run time error, being thrown by OpenCV. If you take a look at your CMake log fro your previous question, you can see that one of the Unavailable packages was dynamiccuda, which appears to be what that error is complaining about.
However, I don't have a lot of experience with Windows OpenCV so that could be a red herring. My gut feeling says that you don't have all the libraries correctly on the path. Have you made sure that you have the CUDA lib/include/bin on the PATH? Have you made sure that you have your OpenCV build lib/include directory on the path. Windows has a very simple linking order that essentially just includes the current directory, anything on the PATH and the main Windows directories. So, I would try making sure everything was correctly on the PATH/that you have copied all the correct libraries into the folder.
A note: this is different from a compiling/linking error because it is at RUNTIME. So setting the compiler paths will not help with runtime linking errors.
Related
I've recently been looking into using OpenCL to reduce a bottleneck in our application, which is a call to cv::HOGDescriptor::compute(). Using the OpenCL version seems to be a simple case of using UMat instead of Mat and I've already had some success with the following:
// Replaced the following...
// _descriptor_calculator.compute(image, descriptor_flat);
// with...
cv::UMat input = image.getUMat(cv::ACCESS_RW);
_descriptor_calculator.compute(input, descriptor_flat);
This works for images with a resolution (up to and including) 2048x1536. However for an image of 2592x1936 I get an unhandled exception, and my debugger breaks in gshandlereh.c.
GSUnwindInfo = *(PULONG)GSHandlerData;
if (IS_DISPATCHING(ExceptionRecord->ExceptionFlags)
? (GSUnwindInfo & UNW_FLAG_EHANDLER)
: (GSUnwindInfo & UNW_FLAG_UHANDLER))
{
Disposition = __CxxFrameHandler3( // <--------- breaks here
ExceptionRecord,
EstablisherFrame,
ContextRecord,
DispatcherContext
);
}
else
{
Disposition = ExceptionContinueSearch;
}
Two important points to mention:
I can fix the error by downsampling the image, so it appears to be related to the size of the image
The CPU version of the code works just fine with 2592x1936 images, so I'm reasonably confident that there isn't an error with any of the input parameters
Downsampling larger images is my fallback option, but I find the conclusion that the OpenCL version of the HOG descriptor compute function can't handle images beyond a certain size a bit unsatisfactory - the fact that there is an unhandled exception rather than an assert and a sensible error message leads me to believe it is more complicated/sinister than this..!
Thanks in advance
Can anybody shed any light on this? Any input would be appreciated.
EDIT:
As requested, a self-contained example. I was able to reproduce the problem by feeding in a random google image that was big enough (searched for jpgs of size 2592x1936):
cv::UMat img = cv::imread("image3.jpg", cv::IMREAD_COLOR).getUMat(cv::ACCESS_RW);
uint16_t image_width = img.cols;
uint16_t image_height = img.rows;
int down_scale = 1;
uint16_t _image_width = (uint16_t)ceil((float)(image_width / down_scale) / 16) * 16;
uint16_t _image_height = (uint16_t)ceil((float)(image_height / down_scale) / 16) * 16;
std::cout << _image_height << " x " << _image_width << std::endl;
std::cout << (_image_width - 16) % 16 << std::endl;
std::cout << (_image_height - 16) % 16 << std::endl;
auto descriptor_calculator = cv::HOGDescriptor(cv::Size(_image_width, _image_height), cv::Size(16, 16), cv::Size(16, 16), cv::Size(16, 16), 9);
std::vector<float> descriptor_flat;
descriptor_calculator.compute(img, descriptor_flat);
Similar to the original problem, works for small enough images, fails for large enough images. Unfortunately did not manage to glean much more information from running outside of the IDE or setting the IDE to break on all exceptions - the only clue that I've got is that the exception is an access violation. I wondered if it was the call to ceil that was causing an access violation, but the resolution is divisible by 16 so not surprisingly changing to floor made no difference.
Additional info as requested:
Running in Windows
Compiler/IDE is Visual Studio 2015
OpenCV version is 3.4.0
Thanks for the help, happy to provide more info
I have currently used OpenCV and ImageMagick for some throughput benchmarking and I am not finding working with GPU to be much faster than CPUs. Our usecase on site is to resize dynamically to the size requested from a master copy based on a service call and trying to evaluate if having GPU makes sense to resize per service call dynamically.
Sharing the code I wrote for OpenCV. I am running the following function for all the images stored in a folder serially and Ultimately I am running N such processes to achieve X number of image resizes.I want to understand if my approach is incorrect to evaluate or if the usecase doesn't fit typical GPU usecases. And what exactly might be limiting GPU performance. I am not even maximizing the utilization to anywhere close to 100%
resizeGPU.cpp:
{
cv::Mat::setDefaultAllocator(cv::cuda::HostMem::getAllocator (cv::cuda::HostMem::AllocType::PAGE_LOCKED));
auto t_start = std::chrono::high_resolution_clock::now();
Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
auto t_end_read = std::chrono::high_resolution_clock::now();
if(!src.data){
std::cout<<"Image Not Found: "<< input_file << std::endl;
return;
}
cuda::GpuMat d_src;
d_src.upload(src,stream);
auto t_end_h2d = std::chrono::high_resolution_clock::now();
cuda::GpuMat d_dst;
cuda::resize(d_src, d_dst, Size(400, 400),0,0, CV_INTER_AREA,stream);
auto t_end_resize = std::chrono::high_resolution_clock::now();
Mat dst;
d_dst.download(dst,stream);
auto t_end_d2h = std::chrono::high_resolution_clock::now();
std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",host2device,"<<std::chrono::duration<double, std::milli>(t_end_h2d-t_end_read).count()
<<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_h2d).count()
<<",device2host,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_end_resize).count()
<<",total,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_start).count()<<endl;
}
resizeCPU.cpp:
auto t_start = std::chrono::high_resolution_clock::now();
Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
auto t_end_read = std::chrono::high_resolution_clock::now();
if(!src.data){
std::cout<<"Image Not Found: "<< input_file << std::endl;
return;
}
Mat dst;
resize(src, dst, Size(400, 400),0,0, CV_INTER_AREA);
auto t_end_resize = std::chrono::high_resolution_clock::now();
std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_read).count()
<<",total,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_start).count()<<endl;
Compiling : g++ -std=c++11 resizeCPU.cpp -o resizeCPU pkg-config --cflags --libs opencv
I am running each program N number of times controlled by following code : runMultipleGPU.sh
#!/bin/bash
echo $1
START=1
END=$1
for (( c=$START; c<=$END; c++ ))
do
./resizeGPU "$c" &#>/dev/null #&disown;
done
wait
echo All done
Run : ./runMultipleGPU.sh
Those timers around lead to following aggregate data
No_processes resizeCPU resizeGPU memcpyGPU totalresizeGPU
1 1.51 0.55 2.13 2.68
10 5.67 0.37 2.43 2.80
15 6.35 2.30 12.45 14.75
20 6.30 2.05 10.56 12.61
30 8.09 4.57 23.97 28.55
No of images run per process : 267
Average size of the image: 624Kb
According to data above, as we increase the number of processes(leading to increased number of simultaneous resizes) the resize perform
ance(which includes actual resize + host to device and device to host copy) increases significantly on GPU vs CPU.
Similar results after using ImageMagick which uses OpenCL beneath
Code :
setenv("MAGICK_OCL_DEVICE","OFF",1); //Turn in ON to use GPU acceleration
Image image;
auto t_start_read = std::chrono::high_resolution_clock::now();
image.read( full_path );
auto t_end_read = std::chrono::high_resolution_clock::now();
image.resize( Geometry(400,400) );
auto t_end_resize = std::chrono::high_resolution_clock::now();
Results :
No_procs resizeCPU resizeGPU
1 63.23 8.54
10 76.16 31.04
15 76.56 50.79
20 76.58 71.68
30 86.29 140.17
Test Machine configuration:
4 GPU (Tesla P100) - but test only utilizes 1 GPU
64 CPU cores (over Intel Xeon 2680 v4 CPU )
OpenCV version : 3.4.0
ImageMagick version : 6.9.9-26 Q16 x86_64 2018-01-17
Cuda Toolkit : 9.0
Highly propable this is too late to help you. However for people looking at this answer this is my suggestion to improve performance. The way you are setting pinned memory does not give you the boost you are looking for.
This is: Using
//method 1
cv::Mat::setDefaultAllocator(cv::cuda::HostMem::getAllocator(cv::cuda::HostMem::AllocType::PAGE_LOCKED));
In the comments of this discussion. Somebody suggested doing as you. The person answering said that it was slower. I was timing the implementation of a sobel derivatives close to the one in Coldvision.io Sobel The main steps are:
Read color image
gaussian blurring of the color image using a
radius of 3 and a delta of 1;
grayscale conversion;
computing the x and y gradiants
merging them into the final output image.
Instead I implemented a version swaping the order of step 2 and 3. Converting to gray scale first and then denoising the result by passing a gaussian.
I was running openCV 3.4 in windows 10. Cuda 9.0. My CPU is an i7-6820HQ. GPU is a Quadro M1200.
I try your method and this one:
//Method 2
//allocate pinned memory
cv::cuda::HostMem memory(siz.height, siz.width, CV_8U, cv::cuda::HostMem::PAGE_LOCKED);
//Read input image from the disk
Mat input = imread(input_file, CV_LOAD_IMAGE_COLOR);
if (input.empty())
{
std::cout << "Image Not Found: " << input_file << std::endl;
return;
}
input.copyTo(memory);
// copy the input image from CPU to GPU memory
cuda::GpuMat gpuInput;
cv::cuda::Stream stream;
gpuInput.upload(memory, stream);
//Do your processing...
//allocate pinned memory for output
cv::cuda::HostMem outMemory(siz.height, siz.width, CV_8U, cv::cuda::HostMem::PAGE_LOCKED);
gpuOutput.download(outMemory, stream);
cv::Mat output = outMemory.createMatHeader();
I calculated the gain as: (t1-t2)/t1*100. Where t1 is the time running the code normally. t2 running it using pinned memory. The negative values is when the method is slower than running in non-pinned memory.
image size Gain % Method 1 Gain % Method 2
800x600 2.9 8.2
1280x1024 2.5 15.3
1600x1200 0.2 7.0
2048x1536 -2.3 14.6
4096x3072 -1.0 17.2
I am working with opencv 3.4 and CLion 2017.3. I have built opencv with mingw and cmake, and i can use the library in my code without problem. Howewher, when i try to run this test code in main.cpp that just prints the matrix, I get a crash at some step of execution of << operator:
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <iostream>
using namespace cv;
using namespace std;
Mat img(7,6, CV_8UC3, Scalar(126,0,255));
cout << img << endl << endl << endl;
cout << "end!";
If i run the code more than once i get these results:
If i add dimentions up to some big number, the code crashes on some ~30 line or earlier
EDIT: Apparently, the problem is not related to opencv, i get this output when printing numbers from 1 to 1000 in a loop:
After googling anything console-related about CLion i found other corrupted console output problems.
The proposed solution to them was to tweak the run.processes.with.pty setting in the idea.properties file. see this YouTrack answer for details.
That solved my issue, too.
I've been using OpenNI+PrimeSense+NiTE with OpenCV on my project to segment objects according to their distances. However I meant to deploy it in a NVIDIA Jetson TX1 board and it couldn't manage to compile OpenNI+PrimeSense+NiTE with OpenCV on it.
I endded up with libfreenect. However the depth map provided by libfreenect is very, very wrong. I'll share some examples.
Here is the working depth map of OpenNI:
OpenNI Depth Map
The libfreenect wrong depth map is here: Libfreenect Depth Map
I based my libfreenect code on the default C++ wrapper at OpenKinect website.
Can someone help me here? Thank you so much.
Well, for those who are working with libfreenect on ARM or AARCH64 architectures (mainly Jetson TX1) because OpenNI and SensorKinect are problematic to build, I made some adjustments on OpenNI and SensorKinect sources to run with Aarch64 and avoid having to use libfreenect.
Links: OpenNI for TX1 and SensorKinect for TX1
It looks like different mappings of the depth data.
You can try to put the libfreenect data into a cv::Mat and scale that:
const float scaleFactor = 0.05f;
depth.convertTo(depthMat8UC1, CV_8UC1, scaleFactor);
imshow("depth gray",depthMat8UC1);
You can also checkout this article and on building OpenNI2 on a Jetson TK1.
Once you have OpenNI setup and working, you should be able to compile OpenCV from source enabling WITH_OPENNI with cmake. After that you should be able to grab the depth data directly in OpenCV:
#include "opencv2/core/core.hpp"
#include "opencv2/highgui/highgui.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include <iostream>
using namespace cv;
using namespace std;
const float scaleFactor = 0.05f;
int main(){
cout << "opening device(s)" << endl;
VideoCapture sensor;
sensor.open(CV_CAP_OPENNI);
if( !sensor.isOpened() ){
cout << "Can not open capture object 1." << endl;
return -1;
}
for(;;){
Mat depth,depthScaled;
if( !sensor.grab() ){
cout << "Sensor1 can not grab images." << endl;
return -1;
}else if( sensor.retrieve( depth, CV_CAP_OPENNI_DEPTH_MAP ) ) {
depth.convertTo(depthScaled, CV_8UC1, scaleFactor);
imshow("depth",depth);
imshow("depth scaled",depthScaled);
}
if( waitKey( 30 ) == 27 ) break;
}
}
I am trying to learn how to use the GPU programs in OpenCV. I have built everything with CUDA and if I run
cout << " Number of devices " << cv::gpu::getCudaEnabledDeviceCount() << endl;
I get the answer 1 device so at least something seems to work. However, I try the following peace of code, it just prints out the message and then nothing happens. It gets stuck on
cv::gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAY);
Here is the code
#include<iostream>
#include<opencv2/core/core.hpp>
#include<opencv2/highgui/highgui.hpp>
#include<opencv2/imgproc/imgproc.hpp>
#include<opencv2/gpu/gpu.hpp>
#include <opencv2/opencv.hpp>
using std::cout;
using std::endl;
int main(void){
cv::Mat input = cv::imread("image.jpg");
if (input.empty()){
cout << "Image Not Found" << endl;
return -1;
}
cv::Mat output;
// Declare the input and output GpuMat
cv::gpu::GpuMat input_gpu;
cv::gpu::GpuMat output_gpu;
cout << "Number of devices: " << cv::gpu::getCudaEnabledDeviceCount() << endl;
// Copy the input cv::Mat to device.
// Device memory will be allocated automatically according to the parameters of input image
input_gpu.upload(input);
// Convert the input image to grayScale on GPU
cv::gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAY);
//// Copy the result from GPU back to host
output_gpu.download(output);
cv::imshow("Input", input);
cv::imshow("Output", output);
cv::waitKey(0);
return 0;
}
I just found this issue and it seems to be a problem with the Maxwell architecture, but that post is over a year old. Has anybody else experienced the same problem? I am using windows 7, Visual Studio 2013 and an Nvidia Geforce GTX 770.
/ Erik
Ok, I do not really know what the problem was, but in CMake, I changed the CUDA_GENERATION to Kepler, which is the micro architecture of my GPU, then I recompiled it, and now the code works as it should.
Interestingly, there were only Fermi and Kepler to choose, so I do not know if one will get problem with Maxwell.
/ Erik