dgemm nvblas gpu offload - armadillo

I had test application that performs matrix multiplication and tried to offload to gpu with nvblas.
#include <armadillo>
#include <iostream>
using namespace arma;
using namespace std;
int main(int argc, char *argv[]) {
int m = atoi(argv[1]);
int k = atoi(argv[2]);
int n = atoi(argv[3]);
int t = atoi(argv[4]);
std::cout << "m::" << m << "::k::" << k << "::n::" << n << std::endl;
mat A;
A = randu<mat>(m, k);
mat B;
B = randu<mat>(k, n);
mat C;
C.zeros(m, n);
cout << "norm c::" << arma::norm(C, "fro") << std::endl;
tic();
for (int i = 0; i < t; i++) {
C = A * B;
}
cout << "time taken ::" << toc()/t << endl;
cout << "norm c::" << arma::norm(C, "fro") << std::endl;
}
I compiled the code as follows.
CPU
g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -o a.cpu.out
GPU
g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -lnvblas -L$CUDATOOLKIT_HOME/lib64 -o a.cuda.out
When I run the a.cpu.out and a.cuda.out with 4096 4096 4096 both of them taking same time around 11 seconds. I am not seeing a reduction in time with a.gpu.out. In the nvblas.conf, I am leaving everything to default except (a)changing the path for the openblas (b)auto_pin memory enabled. I am the seeing nvblas.log saying using "Devices 0" and no other output. The nvidia-smi is not showing any increase in the gpu activity and nvprof shows a bunch of cudaMalloc's, cudamemcpy, query device capability etc. But any gemm call is not present.
The ldd on the a.cuda.out shows it is linked with nvblas, cublas, cudart and the cpu openblas library. Am I making any mistakes here?

The order of the linking was a problem there. The problem got resolved when I did the following for the gpu.
GPU
g++ testmm.cpp -lnvblas -L$CUDATOOLKIT_HOME/lib64 -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -o a.cuda.out
With the above, when I dumped the symbol tables, I see the following output.
nm a.cuda.out | grep -is dgemm
U cblas_dgemm
U dgemm_##libnvblas.so.9.1 <-- this shows correct linking and ability to offload to gpu.
If it is not linked properly, a problematic linking will be as follows.
nm a.cuda.out | grep -is dgemm
U cblas_dgemm
U dgemm_ <-- there will not be a libnvblas here showing it is a problem.
Even though ldd will show nvblas, cublas, cudart, openblas in both the above cases, when executing the program, dgemm will always be openblas.

Related

Data transfer between LibTorch C++ and Eigen

Data transfer between LibTorch C++ and Eigen (Questions and Help)
Hello all,
I'm developing a Data Transfer Tools for C++ Linear Algebra Libraries, as you can see here:
https://github.com/andrewssobral/dtt
(considering bi-dimensional arrays or matrices)
and I'm wondering if you can help me on the following code for data transfer between LibTorch and Eigen:
std::cout << "Testing LibTorch to Eigen:" << std::endl;
// LibTorch
torch::Device device(torch::cuda::is_available() ? torch::kCUDA : torch::kCPU);
torch::Tensor T = torch::rand({3, 3});
std::cout << "LibTorch:" << std::endl;
std::cout << T << std::endl;
// Eigen
float* data = T.data_ptr<float>();
Eigen::Map<Eigen::MatrixXf> E(data, T.size(0), T.size(1));
std::cout << "EigenMat:\n" << E << std::endl;
// re-check after changes
E(0,0) = 0;
std::cout << "EigenMat:\n" << E << std::endl;
std::cout << "LibTorch:" << std::endl;
std::cout << T << std::endl;
This is the output of the code:
--------------------------------------------------
Testing LibTorch to Eigen:
LibTorch:
0.6232 0.5574 0.6925
0.7996 0.9860 0.1471
0.4431 0.5914 0.8361
[ Variable[CPUFloatType]{3,3} ]
EigenMat (after data transfer):
0.6232 0.7996 0.4431
0.5574 0.986 0.5914
0.6925 0.1471 0.8361
# Modifying EigenMat, set element at (0,0) = 0
EigenMat:
0 0.7996 0.4431
0.5574 0.986 0.5914
0.6925 0.1471 0.8361
# Now, the LibTorch matrix was also modified (OK), but the rows and columns were switched.
LibTorch:
0.0000 0.5574 0.6925
0.7996 0.9860 0.1471
0.4431 0.5914 0.8361
[ Variable[CPUFloatType]{3,3} ]
Do someone knows what's happening ?
There's a better way to do that?
I need also to do the same for Armadillo, ArrayFire and OpenCV (cv::Mat).
Thanks in advance!
The reason for the switched rows and columns is that LibTorch (apparently) uses row-major storage, while Eigen by default uses column-major storage. I don't know if you can change the behavior of LibTorch, but with Eigen you can also use row-major storage, like so:
typedef Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatrixXf_rm; // same as MatrixXf, but with row-major memory layout
and then use it like this:
Eigen::Map<MatrixXf_rm> E(data, T.size(0), T.size(1));

<< operator is crashing program with exit code 0 at some random step of printing

I am working with opencv 3.4 and CLion 2017.3. I have built opencv with mingw and cmake, and i can use the library in my code without problem. Howewher, when i try to run this test code in main.cpp that just prints the matrix, I get a crash at some step of execution of << operator:
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <iostream>
using namespace cv;
using namespace std;
Mat img(7,6, CV_8UC3, Scalar(126,0,255));
cout << img << endl << endl << endl;
cout << "end!";
If i run the code more than once i get these results:
If i add dimentions up to some big number, the code crashes on some ~30 line or earlier
EDIT: Apparently, the problem is not related to opencv, i get this output when printing numbers from 1 to 1000 in a loop:
After googling anything console-related about CLion i found other corrupted console output problems.
The proposed solution to them was to tweak the run.processes.with.pty setting in the idea.properties file. see this YouTrack answer for details.
That solved my issue, too.

gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAy)

I am trying to learn how to use the GPU programs in OpenCV. I have built everything with CUDA and if I run
cout << " Number of devices " << cv::gpu::getCudaEnabledDeviceCount() << endl;
I get the answer 1 device so at least something seems to work. However, I try the following peace of code, it just prints out the message and then nothing happens. It gets stuck on
cv::gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAY);
Here is the code
#include<iostream>
#include<opencv2/core/core.hpp>
#include<opencv2/highgui/highgui.hpp>
#include<opencv2/imgproc/imgproc.hpp>
#include<opencv2/gpu/gpu.hpp>
#include <opencv2/opencv.hpp>
using std::cout;
using std::endl;
int main(void){
cv::Mat input = cv::imread("image.jpg");
if (input.empty()){
cout << "Image Not Found" << endl;
return -1;
}
cv::Mat output;
// Declare the input and output GpuMat
cv::gpu::GpuMat input_gpu;
cv::gpu::GpuMat output_gpu;
cout << "Number of devices: " << cv::gpu::getCudaEnabledDeviceCount() << endl;
// Copy the input cv::Mat to device.
// Device memory will be allocated automatically according to the parameters of input image
input_gpu.upload(input);
// Convert the input image to grayScale on GPU
cv::gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAY);
//// Copy the result from GPU back to host
output_gpu.download(output);
cv::imshow("Input", input);
cv::imshow("Output", output);
cv::waitKey(0);
return 0;
}
I just found this issue and it seems to be a problem with the Maxwell architecture, but that post is over a year old. Has anybody else experienced the same problem? I am using windows 7, Visual Studio 2013 and an Nvidia Geforce GTX 770.
/ Erik
Ok, I do not really know what the problem was, but in CMake, I changed the CUDA_GENERATION to Kepler, which is the micro architecture of my GPU, then I recompiled it, and now the code works as it should.
Interestingly, there were only Fermi and Kepler to choose, so I do not know if one will get problem with Maxwell.
/ Erik

CUDA not running in OpenCV even after successful build

I am trying to build OpenCV 2.4.10 on a Win 8.1 machine with CUDA 6.5. I have other third part libraries as well and they have installed successfully. I ram a simple GPU based program and I got this error No GPU found or the library was compiled without GPU support. I also ran the sample exe files like performance_gpu.exe that were built during the installation and I got the same error. I also had WITH_CUDA flag checked. Following are the flags (related to CUDA) that were set during the CMAKE build.
WITH_CUDA : Checked
WITH_CUBLAS : Checked
WITH_CUFFT : Checked
CUDA_ARCH_BIN : 1.1 1.2 1.3 2.0 2.1(2.0) 3.0 3.5
CUDA_ARCH_PTX : 3.0
CUDA_FAST_MATH : Checked
CUDA_GENERATION : Auto
CUDA_HOST_COMPILER : $(VCInstallDir)bin
CUDA_SPERABLE_COMPILATION : Unchecked
CUDA_TOOLKIT_ROOT_DIR : C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5
Another thing is that in some posts I have read that along with CUDA the built takes a lot of time. My build takes ~ 3 Hrs where maximum time is taken up during the compilation of .cu files. I have not got any errors as far as I know during the compilation of those files.
In some posts I have seen that people talk about a directory names gpu inside the build directory but I don't see any in mine!
I am using Visual Studio 2013.
What could be the issue? Please help!
UPDATE:
I again tried to build opencv and this time before starting the build I added the bin, lib and include directories of CUDA. After the build in E:\opencv\build\bin\Release I ran gpu_perf4au.exe and I got this output
[----------]
[ INFO ] Implementation variant: cuda.
[----------]
[----------]
[ GPU INFO ] Run test suite on GeForce GTX 860M GPU.
[----------]
Time compensation is 0
OpenCV version: 2.4.10
OpenCV VCS version: unknown
Build type: release
Parallel framework: tbb
CPU features: sse sse2 sse3 ssse3 sse4.1 sse4.2 avx avx2
[----------]
[ GPU INFO ] Run on OS Windows x64.
[----------]
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***
Device count: 1
Device 0: "GeForce GTX 860M"
CUDA Driver Version / Runtime Version 6.50 / 6.50
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
GPU Clock Speed: 1.02 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3
D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16
384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simul
taneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.50, CUDA Runtime Ver
sion = 6.50, NumDevs = 1
I thought that every thing was fine but after running this program where I had included all opencv and CUDA directories in its property files,
#include <cv.h>
#include <highgui.h>
#include <iostream>
#include <opencv2\opencv.hpp>
#include <opencv2\gpu\gpu.hpp>
using namespace std;
using namespace cv;
char key;
Mat thresholder (Mat input) {
gpu::GpuMat dst, src;
src.upload(input);
gpu::threshold(src, dst, 128.0, 255.0, CV_THRESH_BINARY);
Mat result_host(dst);
return result_host;
}
int main(int argc, char* argv[]) {
cvNamedWindow("Camera_Output", 1);
CvCapture* capture = cvCaptureFromCAM(CV_CAP_ANY);
while (1){
IplImage* frame = cvQueryFrame(capture);
IplImage* gray_frame = cvCreateImage(cvGetSize(frame), IPL_DEPTH_8U, 1);
cvCvtColor(frame, gray_frame, CV_RGB2GRAY);
Mat temp(gray_frame);
Mat thres_temp;
thres_temp = thresholder(temp);
//cvShowImage("Camera_Output", frame); //Show image frames on created window
imshow("Camera_Output", thres_temp);
key = cvWaitKey(10);
if (char(key) == 27){
break; //If you hit ESC key loop will break.
}
}
cvReleaseCapture(&capture);
cvDestroyWindow("Camera_Output");
return 0;
}
I got the error:
OpenCV Error: No GPU support (The library is compiled without CUDA support) in E
mptyFuncTable::mallocPitch, file C:\builds\2_4_PackSlave-win64-vc12-shared\openc
v\modules\dynamicuda\include\opencv2/dynamicuda/dynamicuda.hpp, line 126
Thanks to #BeRecursive for giving me a lead to solve my issue. The CMAKE build log has three unavailable opencv modules namely androidcamera, dynamicuda and viz. I could not find any information on dynamicuda i.e. the module whose unavailability might have caused the error that I mentioned in the question. Instead I searched for viz module and checked how is it installed.
After going through some blogs and forums I found out that viz module has not been included in the pre-built versions of OpenCV. It was recommended to build from source version 2.4.9. I thought to give it a try and I installed it with VS 2013 and CMAKE 3.0.1 but there were many build failures and warnings. Upon further search I found that CMAKE versions 3.0.x aren't recommended for building OpenCV as they are producing many warnings.
At last I decided to switch to VS 2010 and CMAKE 2.8.12.2 and after building the source I got no error and luckily the after adding all executables, libraries and DLLs in the PATH, when I ran my program that I have mentioned above I got no errors but it is running very slowly! So I ran this program:
#include <cv.h>
#include <highgui.h>
#include <iostream>
#include <opencv2\opencv.hpp>
#include <opencv2\core\core.hpp>
#include <opencv2\gpu\gpu.hpp>
#include <opencv2\highgui\highgui.hpp>
using namespace std;
using namespace cv;
Mat thresholder(Mat input) {
cout << "Beginning thresholding using GPU" << endl;
gpu::GpuMat dst, src;
src.upload(input);
cout << "upload done ..." << endl;
gpu::threshold(src, dst, 128.0, 255.0, CV_THRESH_BINARY);
Mat result_host(dst);
cout << "Thresolding complete!" << endl;
return result_host;
}
int main(int argc, char** argv) {
Mat image, gray_image;
image = imread("desert.jpg", CV_LOAD_IMAGE_COLOR); // Read the file
if (!image.data) {
cout << "Could not open or find the image" << endl;
return -1;
}
cout << "Orignal image loaded ..." << endl;
cvtColor(image, gray_image, CV_BGR2GRAY);
cout << "Original image converted to Grayscale" << endl;
Mat thres_image;
thres_image = thresholder(gray_image);
namedWindow("Original Image", WINDOW_AUTOSIZE);// Create a window for display.
namedWindow("Gray Image", WINDOW_AUTOSIZE);
namedWindow("GPU Threshed Image", WINDOW_AUTOSIZE);
imshow("Original Image", image);
imshow("Gray Image", gray_image);
imshow("GPU Threshed Image", thres_image);
waitKey(0);
return 0;
}
Later I even tested the build on VS 2013 and it also worked.
The GPU based programs are slow due to reasons mentioned here.
So three important things I want to point out:
BUILD from source only
Use a little older version of CMAKE
Prefer VS 2010 for building the binaries.
NOTE:
This might sound weird but all my first BUILDS failed due to some linker error. So, I don't know whether this is work around or not but try to build opencv_gpu before anything and all other modules one by one after that and then build ALL_BUILDS and INSTALL projects.
When you build this way in DEBUG mode you might get an error iff you are building opencv with Python support i.e. "python27_d.lib" otherwise all projects will be built successfully.
WEB SOURCES:
Following are web sources that helped me in solving my problem:
http://answers.opencv.org/question/32502/opencv-249-viz-module-not-there/
http://home.eps.hw.ac.uk/~cgb7/opencv/opencv_tutorial.pdf
http://perso.uclouvain.be/allan.barrea/opencv/opencv.html
http://eavise.wikispaces.com/Building+OpenCV+yourself+on+Windows+7+x64+with+OpenCV+2.4.5+and+CUDA+5.0
https://devtalk.nvidia.com/default/topic/767647/how-do-i-enable-cuda-when-installing-opencv-/
So that is a run time error, being thrown by OpenCV. If you take a look at your CMake log fro your previous question, you can see that one of the Unavailable packages was dynamiccuda, which appears to be what that error is complaining about.
However, I don't have a lot of experience with Windows OpenCV so that could be a red herring. My gut feeling says that you don't have all the libraries correctly on the path. Have you made sure that you have the CUDA lib/include/bin on the PATH? Have you made sure that you have your OpenCV build lib/include directory on the path. Windows has a very simple linking order that essentially just includes the current directory, anything on the PATH and the main Windows directories. So, I would try making sure everything was correctly on the PATH/that you have copied all the correct libraries into the folder.
A note: this is different from a compiling/linking error because it is at RUNTIME. So setting the compiler paths will not help with runtime linking errors.

Opencv Error on Ubuntu Webcam (Logitech C270) Capture -> HIGHGUI ERROR: V4L/V4L2: VIDIOC_S_CROP

this erorr message appears on running simple camera capture on Ubuntu with logitech C270 (OpenCV 2.4.2/C++):
HIGHGUI ERROR: V4L/V4L2: VIDIOC_S_CROP
and further:
Corrupt JPEG data: 2 extraneous bytes before marker 0xd1
Corrupt JPEG data: 1 extraneous bytes before marker 0xd6
Corrupt JPEG data: 1 extraneous bytes before marker 0xd0
Corrupt JPEG data: 1 extraneous bytes before marker 0xd0
I get frames but the values of frame width and height swapped when writing to a Mat object see below:
Mat frame;
videoCapture = new VideoCapture(camId);
if(!videoCapture->isOpened()) throw Exception();
cout << "Frame width: " << videoCapture->get(CV_CAP_PROP_FRAME_WIDTH) << endl;
cout << "Frame height: " << videoCapture->get(CV_CAP_PROP_FRAME_HEIGHT) << endl;
(*videoCapture) >> frame;
cout << "Mat width: " << frame.rows << endl;
cout << "Mat height: " << frame.cols << endl;
Output:
Frame width: 640
Frame height: 480
Mat width: 480
Mat height: 640
If you don't feel like debugging the problem, and the frames from your webcam are being displayed without any issues, your option is to just shoot the messenger. The instructions below work if you have built OpenCV from source, as opposed to installing pre-built binaries.
Start with grep -R "Corrupt JPEG data" ~/src/opencv-2.4.4/ and go deeper into the rabbit hole until you find what you want. In my case the culprit is at opencv-2.4.4/thirdparty/libjpeg/jdmarker.c:908:
if (cinfo->marker->discarded_bytes != 0) {
WARNMS2(cinfo, JWRN_EXTRANEOUS_DATA, cinfo->marker->discarded_bytes, c);
cinfo->marker->discarded_bytes = 0;
}
The WARNMS2 macro is what's causing the error messages about extraneous data to be printed. Just comment it out, rebuild OpenCV and carry on with your work. I also have a C270, run Ubuntu 12.04, and experienced the same nagging error message until I did what I described above.
About issue:
Corrupt JPEG data: 2 extraneous bytes before marker 0xd1 Corrupt JPEG
data: 1 extraneous bytes before marker 0xd6 Corrupt JPEG data: 1
extraneous bytes before marker 0xd0 Corrupt JPEG data: 1 extraneous
bytes before marker 0xd0
Looks like, the issue is in libjpeg library. For some unknown reason it works incorrect under OpenCV library. I tried to compile without support of JPEG and it solved this issue.
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_SHARED_LIBS=OFF -D BUILD_EXAMPLES=OFF -D BUILD_TESTS=OFF -D BUILD_PERF_TESTS=OFF -D WITH_JPEG=OFF -D WITH_IPP=OFF ..
You can find all details in my blog:
http://privateblog.by/linux/opencv-i-corrupt-jpeg-data-na-linux/
The width of an image is given by its number of columns. Your code should be
cout << "Mat width: " << frame.cols << endl;
cout << "Mat height: " << frame.rows << endl;
So there is no swap between width and height.
I would post this as a comment (not enough reputation), still, I got stuck here and the solution I found, though not elegant, was:
python my_app.py 2<&1 | grep -v "Corrupt JPEG data"
Note: To replicate normal python print statements behaviour I'm using os.system(f'echo {my_string}')
If you just want to get rid of the output quickly and grep -v Corrupt does not work for somehow - like for me - you could also redirect stderr to nothing, e.g.
./my_app 2> /dev/null
python my_app.py 2> /dev/null
This will of course hide other error messages, too.

Resources