Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN

Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN - memory

The convolutional layers are most computationally intense parts of Convolutional neural networks (CNNs).Currently the common approach to impement convolutional layers is to expand the image into a column matrix(im2col) and perform and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. However im2col operation need load and store the image data, and also need another memory block to hold the intermediate data.
If I need to optimize the convolutional implementation, I may choose to direct implementation with SIMD instructions. Such method will not incur any memory operation overhead.
the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.
From the following link, at the end of the link
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
So I hope to know the reason. May floating point operations require more instruction cycle? or the input image is not much large, so it may residue in the cache and the memory operations don't need access DDR and consume less cycles.

Cache-blocking a GEMM is possible so you get mostly L1 cache hits (see also What Every Programmer Should Know About Memory?).
Fitting in the large shared L3 cache on typical x86 CPUs is not sufficient to make things efficient. The per-core L2 caches are typically 256kiB, and even that's slower than the 32kiB L1d cache.
Memory latency is very slow compared to a CPU core clock, but memory/cache bandwidth is not terrible these days with fast DDR4, or L3 cache hits. (But like I said, for a matmul with good cache blocking / loop tiling you can reuse data while it's still hot in L1d if you only transpose parts of the input matrix on the fly. Reducing off-core bandwidth requirements is also important for efficient matmul, not just transposing one so its columns are sequential in memory.)
Beyond that, sequential access to memory is essential for efficient SIMD (loading a vector of multiple contiguous elements, letting you multiply / add / whatever 4 or 8 packed float elements with one CPU instruction). Striding down columns in a row-major matrix would hurt throughput even if the matrix was small enough to fit in L1d cache (32kiB).

Related

Analysis of 3d image data as 2d

I have a tif image of size around ~10 Gb. I need to perform object classification or pixel classification in this image. The dimension of image data has zyx form. My voxel size in x=0.6, y=0.6 and z=1.2. Z is the depth of the object. My RAM can not take whole image.
If I do classification of pixels in each Z plane separately and then merge to get the final shape and volume of object.
Would I loose any information and my final shape or volume of object will be wrong?

#ankit agrawal you probably have found the answer but my advice would be definitely not to say you need more memory.
I have had a similar problem and if anyone else comes across them the options below will help.
Options
The answer about splitting into just z planes is correct. You could lose information in the z plane. The idea isn't a bad one but you could take Regions of Interest (ROIs) /split your image into chunks. So that they could be more manageable but say split the x/2, y/2 and z/2. Then you get a bunch of chunks that be used in memory. Then stack the data back up later
use the library [Dask] (https://dask.org/) , it creates all this for you. It's designed for parallelism and can be scaled on a single computer or a cluster. Using the dask.array part lets you create lots of chunked numpy arrays. Even better, go use dask-image (there should be a link to this in dask). It's a wrapper of dask.array and many scipy ndimage functions. Lastly when the file is split appropriately the computation can be faster because of parallelism. Not always but I have easily worked with 20GB data sets on a laptop with 16GB. The files were 8bit so when many libraries and function upfloat blowing your memory up. This allows you keep a handle on it all. If you stick to the core functions it will work fine. Gets harder when you work with mapped blocks.
If you still have this issue.

The issue with doing a classification in each z-plane individually is that you might not be able to classify objects with that sort of restricted information.
You can easily think of that the same way for a 2D face detection problem where you would try to detect the face in each row individually - that is probably not going to be very robust and you will loose valuable spatial information. In the end you'll probably end up with no detections to merge.
Solution proposal:
My advice would be to increase the size of your voxels until it can be processed by your processing unit, saying decrease the resolution of your data and do a classification with a low confidence threshold. Then come back and do another classification on the volumes with detections in them, this time aim for a higher confidence threshold. This can be done iteratively as need be.

I think breaking the image any (x/y/z) plane kind of defeats the point of the voxel concept because the representation of a three-dimensional object is flattened and you lose the spatial relational data.
I think a couple of options are:
Use a distributed computing cluster, like Hadoop.
Look into storing that image in a geospatial database like GeoMesa, so it may be queried efficiently, then you can just hold in memory what you need to train locally.
10GB isn't so large, so perhaps upgrade your memory capacity?

Split RNN Memory Consumption Evenly Between GPUs in TensorFlow

I'm trying to figure out the most strategic way to evenly split the memory load of a seq2seq network between two GPUs.
With convolutional networks, the task is much easier. However, I'm trying to figure out how to maximize the memory usage of 2 Titan X's. The goal is to build the largest network that the combined 24GB of memory will allow.
One idea was to place each RNN layer in a separate GPU.
GPU1 --> RNN Layer 1 & Backward Pass
GPU2 --> RNN Layer 2,3,4
However, the backprop computations require a significant amount of memory. Therefore, another idea is to do the entire forward pass on one GPU and the backward pass on the separate GPU.
GPU1 --> Forward Pass
GPU2 --> Backward Pass
(However, GPU2 still takes most of the memory load)
Is there any way to measure how much of the GPU memory is being used? This would allow us to figure out how to maximize each GPU before it's "filled up".
Once 2 GPUs are used, I would eventually want to use four. However, I think maximizing 2 GPUs is the first step.

Setting "colocate_gradients_with_ops" as TRUE maybe work. It allows GPU memory to allocated evenly.
optimizer = tf.train.AdamOptimizer(learning_rate)
gvs = optimizer.compute_gradients(loss, colocate_gradients_with_ops=True)
train_op = optimizer.apply_gradients(gvs, global_step=self.global_step)

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?

For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

Why multiple kernel does‘t improve the accuracy when some of kernel get weigths?

Why multiple kernel does‘t improve the accuracy when some of kernel get weigths? it means multi-kernel can't always improve the accuracy when we combine some kernel

There are plenty of aspects of such question, I will list few of most common reasons:
your data is to small to learn complex function without heavy overfitting
your kernel weights assignment technique (optimization) is overfitting due to optimization nature and/or to flexible kernel ensemble definitions
not optimizing something (like weights, kernels) is actually kind of regularization - if you start to fit this part, you increase variance of your method, thus you will need to compensate by some variance reduction technique (like additional regularizers)
poor selection of kernels - too heavily correlated kernels, kernels with internal characteristic not well suited to the data might be a problem - weights assignment technique will struggle to remove redundant kernels
In general:
Each additional level of abstraction in the optimization process leads to non-linear growth in the problem complexity, thus in order to fit it correctly you need:
more data
better optimiztaion techniques
better knowledge of underlying mathematical models
as compared to the use of simple model.

Are GPUs good for case-based image filtering?

I am trying to figure out whether a certain problem is a good candidate for using CUDA to put the problem on a GPU.
I am essentially doing a box filter that changes based on some edge detection. So there are basically 8 cases that are tested for for each pixel, and then the rest of the operations happen - typical mean calculations and such. Is the presence of these switch statements in my loop going to cause this problem to be a bad candidate to go to GPU?
I am not sure really how to avoid the switch statements, because this edge detection has to happen at every pixel. I suppose the entire image could have the edge detection part split out from the processing algorithm, and you could store a buffer corresponding to which filter to use for each pixel, but that seems like it would add a lot of pre-processing to the algorithm.
Edit: Just to give some context - this algorithm is already written, and OpenMP has been used to pretty good effect at speeding it up. However, the 8 cores on my development box pales in comparison to the 512 in the GPU.

Edge detection, mean calculations and cross-correlation can be implemented as 2D convolutions. Convolutions can be implemented on GPU very effectively (speed-up > 10, up to 100 with respect to CPU), especially for large kernels. So yes, it may make sense rewriting image filtering on GPU.
Though I wouldn't use GPU as a development platform for such a method.

typically, unless you are on the new CUDA architecture, you will want to avoid branching. because GPUs are basically SIMD machines, the pipleline is extremely vurnurable to, and suffers tremendously from, pipeline stalls due to branch misprediction.
if you think that there are significant benefits to be garnered by using a GPU, do some preliminary benchmarks to get a rough idea.
if you want to learn a bit about how to write non-branching code, head over to http://cellperformance.beyond3d.com/ and have a look.
further, investigating into running this problem on multiple CPU cores might also be worth it, in which case you will probably want to look into either OpenCL or the Intel performance libraries (such as TBB)
another go-to source for problems targeting the GPU be it graphics, computational geometry or otherwise, is IDAV, the Institute for Data Analysis and Visualization: http://idav.ucdavis.edu

Branching is actually not that bad, if there is spatial coherence in the branching. In other words, if you are expecting chunks of pixels next to each other in the image to go through the same branch, the performance hit is minimized.

Using a GPU for processing can often be counter-intuitive; things that are obviously inefficient if done in normal serial code, are actually the best way to do it in parallel using the GPU.
The pseudo-code below looks inefficient (since it computes 8 filtered values for every pixel) but will run efficiently on a GPU:
# Compute the 8 possible filtered values for each pixel
for i = 1...8
# filter[i] is the box filter that you want to apply
# to pixels of the i'th edge-type
result[i] = GPU_RunBoxFilter(filter[i], Image)
# Compute the edge type of each pixel
# This is the value you would normally use to 'switch' with
edge_type = GPU_ComputeEdgeType(Image)
# Setup an empty result image
final_result = zeros(sizeof(Image))
# For each possible switch value, replace all pixels of that edge-type
# with its corresponding filtered value
for i = 1..8
final_result = GPU_ReplacePixelIfTrue(final_result, result[i], edge_type==i)
Hopefully that helps!

Yep, control flow usually has performance penalty on GPU, be it if's / switch'es / ternary operator's, because with control flow operations GPU can't optimally run threads. So usual tactics is to avoid branching as possible. In some cases IF's can be replaced by some formula, where IF conditions maps to formula coefficients. But concrete solution/optimization depends on concrete GPU kernel... Maybe you can show exact code, to be analyzed further by stackoverflow community.
EDIT:
Just in case you are interested here is convolution pixel shader that i wrote.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart