Does Tegra K1 support RenderScript on GPU ? I used Mipad and wrote a sample RS kernel and ran it, but the cpu usage can reach 95% on average. Kernel like this:
#pragma version(1)
#pragma rs java_package_name(com.example.android.rs.hellocomputendk)
#pragma rs_fp_relaxed
void root(const uchar4 *v_in, uchar4 *v_out) {
v_out->xyzw = v_in->xyzw;
}
The allocation's flag like this:
RS_ALLOCATION_USAGE_SHARED | RS_ALLOCATION_USAGE_SCRIPT,
Official pdf said Tegra K1 GPU support RS, i don't know where i am wrong.
Thanks
Did you check the GPU utilization? You could try nVidia nSight Tegra.
The high CPU utilization is per core or per processor? If per processor this might indicate that RS has parallelized the task among cores.
Are you using Tegra Android Development Pack?
It may be that nVidia supports RenderScript just for the CPU side. Since K1 has a CUDA based GPU, the logic for putting any type of code on the GPU may not be implemented.
GPU may be used in kernels that do image processing stuff like here.
Related
When I tried to verify that the GPU does matrix operations over the CPU, I got unexpected results.CPU performs better than GPU according to my experience result, it makes me confused.
I used cpu and gpu to do matrix multiplication respectively.Programming environment is MXNet and cuda-10.1.
with gpu:
import mxnet as mx
from mxnet import nd
x = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
y = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
%timeit nd.dot(x,y)
50.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
with cpu:
x1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
y1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
%timeit nd.dot(x1,y1)
33.4 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Why CPU faster? My CPU model is i5-6300HQ and GPU model is Nividia GTX 950M.
TLDR: Your matrix multiplication is actually not running :)
MXNet is an asynchronous framework that piles work requests in a queue treated asynchronously on a need-to-run basis by its execution engine. So what you're measuring is only the time it took to send the request, not to execute it. That's why it is so small (microseconds on a 100k*100k matrix would be surprisingly fast) and roughly equal for both CPU and GPU. To force execution, you need to add a call that forces production of a result, for example a print or a nd.dot(x, y).wait_to_read(). See here a code very similar to your benchmark https://github.com/ThomasDelteil/MXNetParisWorkshop/blob/master/FromNDArrayToTrainedModel.ipynb
Extra comments:
The gain of using GPU vs CPU comes with the size of the
parallelism opportunity. On simple tasks, that gain can be small to
non existent. CPU core frequencies are actually 2 to 3 times bigger
than GPU frequencies (your i5-6300HQ does 2.3GHz with 3.2GHz boost
ability while your GTX 950M does 0.9GHz with 1.1GHz boost ability).
MXNet ndarray is very fast at matrix algebra on CPU, because (1) its asynchronous paradigm optimizes the order of computation (2) its C++ backend runs things in parallel and (3) I believe the default MXNet build comes with Intel MKL, which significantly boosts algebra capacities of Intel CPUs
(https://medium.com/apache-mxnet/mxnet-boosts-cpu-performance-with-mkl-dnn-b4b7c8400f98). Its ability to run compute on GPU within the same API is also a big strength over Numpy for example.
I don't think your test will run on GPU: instantiating such a big matrix on
an NVIDIA Tesla V100 (16GB men, 4x more than a GTX 950M) runs in a
"large tensor size error"
I don't know the module you're using but your CPU can access your memory way quicker and also saves a lot of stuff in cache. Your GPU has longer times to load the data into GPU memory and also takes longer to be called from your CPU.
Thats always the downside of GPU computation. When you can load a bunch of data into GPU memory, there's a good chance of being faster.
Btw, thats why deep learning frameworks work in batches. When you can't work with batches, I'd always use the CPU. You also got some potential for performance improvements with multiprocessing.
I am testing the performance of some samples in the opencv source tree depending on if halide is used or not.
Surprisingly, the performance is worse if halide is used for the computation:
squeezenet_halide: ~24ms with halide and ~16ms without halide.
resnet_ssd_face: ~84ms with halide and ~36ms without halide.
I have compiled halide and opencv following the instructions in this tutorial. The opencv code was downloaded from the master branch of the opencv git repository.
I have tested the performance using the sample files 'resnet_ssd_face.cpp' and 'squeezenet_halide.cpp'. In both cases I include one of these code lines just before call 'forward', to activate or deactivate halide:
net.setPreferableBackend(DNN_BACKEND_HALIDE); // use Halide
net.setPreferableBackend(DNN_BACKEND_DEFAULT); // NOT use Halide
The time is measured using this code just after the call to 'forward' function:
std::vector<double> layersTimings;
double freq = cv::getTickFrequency() / 1000;
double time = net.getPerfProfile(layersTimings) / freq;
std::cout << "Time: " << time << " ms" << std::endl;
Is there anything missed in the tutorial? Should Halide be compiled with different parameters?
My setup is:
OS: Linux (Ubuntu 16.04)
CPU: Intel(R) Core(TM) i5-4570 CPU # 3.20GHz
GPU: nVidia GeForce GT 730 (Driver Version: 384.90)
Cuda: CUDA Version 9.0.176
Taking into account the comment by Dmitry Kurtaev and looking the wiki in the OpenCV GitHub account, I found a page where a benchmark comparing different approaches is included (I missed the links in the tutorial).
Also, there is a merge request where a similar benchmark is included.
In both of them, the time measurement shows that the performance using Halide is worse than with the original c++ approach.
I can assume that the Halide integration is in an early stage. Moreover, as Zalman Stern comments, the Halide scheduling is a work in progress and the original optimizations in dnn module of opencv could be more accurate than the included scheduling for Halide.
I hope this measures could change in future versions of OpenCV, but for now, this is the performance.
My answer is slightly unrelated but helpful
For face detection + Face alignment :
Normal SSD detection time : 50 - 55ms
Using Openvino inference engine : 40 - 45 ms
My CPU is Intel Core2 Duo T5550, GPU is GeForce 8400M G. CUDA version 5.5.22, OpenCV version 2.4.8.
The test code is as follows:
double t = (double)getTickCount();
gpu::threshold(src, dst, thres, binMax, THRESH_BINARY);
t = ((double)getTickCount() - t)/getTickFrequency();
cout << "Times passed in seconds: " << t << endl;
For a 3648*2736 image, the result is
CPU: Times passed in seconds: 0.0136336
GPU: Times passed in seconds: 0.0217714
Thanks!
Perhaps this is not suprising.
You GeForce 8400M G is a old mobile card having only 8 cores, see the GeForce 8M series specifications, so you cannot extract much parallelism out of it.
Brutally speaking, GPUs are advantageous over multicore CPUs when you are capable of massively extracting parallelism by a large number of cores. In other words, to fastly build up an Egyptian pyramid by slow slaves (GPU cores) you need a large number of slaves. If you have only very few slow slaves (8 in your case), then perhaps it is better to have even fewer (2 CPU cores, for example), but much faster, slaves.
EDIT
I remembered just now to have bumped into this post
Finding minimum in GPU slower than CPU
which may help convince you that bad implementations (as underlined by Abid Rahman and Mailerdaimon) may lead to GPU codes that are slower than CPU ones. The situation is even worse if, as pointed out in the answer to the post above, you are hosting also the X display on your already limited GeForce 8400M G card.
Additionally to what #JackOLantern said:
Every Copy operation involving the GPU takes Time! A lot of time compared to just computing with the CPU. This is why #Abid Rahman K comment is a good Idea, he suggested to test again with more complex Code. The advantage of the GPU is in fast parallel processing, on off it disadvantages is the relatively slow transfer rate while copying data to and from the GPU.
On my system, for a 5 MP image with a large window size (75px) it takes a whopping 140 ms (roughly 20 times as much as linear operations) to complete and I am looking to optimize it. I have noticed that the OpenCV gpu module does not implement a gpu version of the adaptiveThreshold so I have been thinking of implementing that algorithm for the GPU myself.
Can I hope for any speedup if I implement an adaptive threshold algorithm in CUDA, based on a large window size (50px+) and a large image (5 MP+), ignoring the overhead for loading memory into the GPU?
adaptiveThreshold documentation on opencv.org:
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold
Building on Eric's answer:
The Npp CUDA library does not implement adaptiveThreshold but it seems beneficial to getting an adaptive threshold in a VERY straightforward way (just tested it and anecdotally works):
Run a box filter on src (i.e. compute mean window value for every pixel),
store in an intermediate image tmp.
Subtract a number K from each pixel in tmp
Run a compare function between src and
tmp into dst. The end.
The code may look like this (here K=0, 2nd step omitted):
nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceIntermediate.data(), oDeviceDst.pitch(),
oSizeROI, oAdapThreshWindowSize,oAnchor);
nppiCompare_8u_C1R(oDeviceSrc.data(),oDeviceSrc.pitch(),
oDeviceDst.data(),oDeviceDst.pitch(),
oDeviceResult.data(),oDeviceResult.pitch(),
oSizeROI,NPP_CMP_LESS);
Also, wikipedia claims that applying a box filter 3 times in a row approximates a Gaussian filter to 97% accuracy.
Yes, this algorithm can be optimized on the GPU. I would expect to see an excellent speedup.
For ADAPTIVE_THRESH_MEAN_C, you could use a standard parallel reduction to calculate the arithmetic mean. For ADAPTIVE_THRESH_GAUSSIAN_C, you might use a kernel that performs per-pixel gaussian attenuation combined with a standard parallel reduction for the sum.
Implementation by CUDA should give you a satisfied performance gain.
Since your window size is large, this operation should be compute-bounded. The theoretical peak performance of a 5 MP image with 75px window on a Tesla K20X GPU should be about
5e6 * 75 * 75 / 3.95 Tflops = 7ms
Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Nvidia cuNPP library also provides a function nppiFilterBox(), which can be used to implement ADAPTIVE_THRESH_MEAN_C directly.
http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp
For ADAPTIVE_THRESH_GAUSSIAN_C, the function nppiFilter() with a proper mask could be used.
NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf
I am trying to implement computer vision algorithm on my NVidia GPU with openCV. I am using openCV 2.4 and I am currently writing very simple programs to get accustomed to openCV. I wrote a simple code of transposing a matrix and also to implementing canny edge detection on GPU. The program is running perfectly but I need to deallocate the memory in both the CPU and GPU. So I am posting my code below :
int main(int argc,char *argv[])
{
int k;
cv::Mat src;
cv::Mat dest;
cv::Mat dest_1;
cv::gpu::GpuMat im_source;
cv::gpu::GpuMat im_dest;
cv::gpu::GpuMat im_dest_1;
cv::gpu::Stream::Null;
k = cv::gpu :: getCudaEnabledDeviceCount();
printf("%d\n",k);
src = cv::imread("lena.jpg",0);
cv::imshow("lena_org",src);
im_source.upload(src);
cv::gpu::transpose(im_source,im_dest);
im_dest.download(dest);
cv::imshow("lena_trans",dest);
cv::gpu::Canny(im_source,im_dest_1,100,100,3,false);
im_dest_1.download(dest_1);
cv::imshow("lena_edge",dest_1);
cv::waitKey();
}
So from the code above I believe the memory is not getting freed in both the CPU and GPU. I was searching the internet a bit and I came across with cv::Mat::Release for cpu and cv::gpu::GpuMat::Release for the GPU side. But I am not getting how to use them or how I should use this functions in my code so that I could free bot my CPU and GPU memories. It would be very much helpful if someone could guide me through correct usage of the Release apis so that I could free the memory successfully. Thanks for all your support.
The destructor for cv::Mat objects automatically frees the memory, making calls to the release function you describe. At the level of your code, you shouldn't have to worry about that. Once the matrix leaves scope, it gets destroyed.
If you want to manually destroy your reference to the data, you can call, for example, src.release(). There is a good tutorial on memory management in the OpenCV documentation, available here