Vector processor operates on all its elements at the same time? - vectorization

Vector processor operates on all vectors elements at the same time?
If yes dont we require as many threads as the no. of elements in the vector (VLEN)?
And if my understanding is wrong, then if this processor works on each data point consecutively, then is this not same as scalar processor ?

To answer with the same level of specificity as your question, then yes - vectorization, in the SIMD sense, operates on all vector elements at the same time.
If yes dont we require as many threads as the no. of elements in the
vector (VLEN)?
Why would we? The operation happens within the same processing thread, leveraging the processor's multiple registers to operate concurrently.

Related

Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN

The convolutional layers are most computationally intense parts of Convolutional neural networks (CNNs).Currently the common approach to impement convolutional layers is to expand the image into a column matrix(im2col) and perform and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. However im2col operation need load and store the image data, and also need another memory block to hold the intermediate data.
If I need to optimize the convolutional implementation, I may choose to direct implementation with SIMD instructions. Such method will not incur any memory operation overhead.
the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.
From the following link, at the end of the link
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
So I hope to know the reason. May floating point operations require more instruction cycle? or the input image is not much large, so it may residue in the cache and the memory operations don't need access DDR and consume less cycles.
Cache-blocking a GEMM is possible so you get mostly L1 cache hits (see also What Every Programmer Should Know About Memory?).
Fitting in the large shared L3 cache on typical x86 CPUs is not sufficient to make things efficient. The per-core L2 caches are typically 256kiB, and even that's slower than the 32kiB L1d cache.
Memory latency is very slow compared to a CPU core clock, but memory/cache bandwidth is not terrible these days with fast DDR4, or L3 cache hits. (But like I said, for a matmul with good cache blocking / loop tiling you can reuse data while it's still hot in L1d if you only transpose parts of the input matrix on the fly. Reducing off-core bandwidth requirements is also important for efficient matmul, not just transposing one so its columns are sequential in memory.)
Beyond that, sequential access to memory is essential for efficient SIMD (loading a vector of multiple contiguous elements, letting you multiply / add / whatever 4 or 8 packed float elements with one CPU instruction). Striding down columns in a row-major matrix would hurt throughput even if the matrix was small enough to fit in L1d cache (32kiB).

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?
For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

Audio Processing with Accelerate and vDSP_desamp()

I am totally new to the vdsp framework and I am trying to learn by building. My goal is for the signal to be processed in the following way:
100th order Band Pass FIR
Downsampling by factor: 2
From what I could understand from Apple's documentation the function vDSP_desamp() is what I am looking for (it can do both steps at the same time, right?)
How would I use this correctly?
Here are my thoughts:
Given an AudioBufferList *audio and an array of filter coefficients filterCoeffs with length [101]:
vDSP_desamp((float*)audio->mBuffers[0].mData, 2, &filterCoeffs, (float*)audio->mBuffers[0].mData, frames, 101);
would this be a correct use of the method?
Do I need to implement a circular buffer for this process?
Any guidance /direction /pointer to something to read would be very welcome.
thanks
Reading the documentation, vDSP_desamp() is indeed a compound decimation and FIR operation. Doing both together is a good idea as it reduces memory access and there is scope for eliminating a lot of computation.
The assumption here is FIR filter has been recast with (P-1)/2 group delay. The consequence of this is that to calculate C(n) the function needs access to A(n*I+p)
Where (using the terminology of the documentation):
`A[0..x-1]`: input sample array
`C[0..n-1]`: output sample array
`P`: number of filter coefficients
`I`: Decimation factor
Clearly if you pass a CoreAudio buffer to this, it'll run off the end of the buffer by 200 input samples. At best, yielding 100 garbage samples, and at worst a SIGSEGV.
So, the simple answer is NO. You cannot use vDSP_desamp() alone.
Your options are:
Assemble the samples needed into a buffer and then call vDSP_desamp() for N output samples. This involves copying samples from two CoreAudio buffers. If you're worried about latency, you recast the FIR to use 100 previous samples, alternatively, they could come from the next buffer.
Use vDSP_desamp() for what you can, and calculate the more complex case when the filter wraps over the two buffers.
Two calls to vDSP_desamp() - one with the easy case, and another with an assembled input buffer where samples wrap adjacent CoreAudio buffers
I don't see how you can use circular buffer to solve this problem: You still have the case where the buffer wraps to deal with, and still need to copy all samples into it.
Which is faster rather depends on the size of the audio buffers presented by CoreAudio. My hunch is that for small buffers, and a small filter length, vDSP_desamp() possibly isn't worth it, but you're going to need to measure to be sure.
When I've implemented this kind of thing in the past on iOS, I've found a
hand-rolled decimation and filter operation to be fairly insignificant in the grand scheme of things, and didn't bother optimizing further.

Why are the elements of the matrix and vector types in the F# Powerpack mutable?

F# is often promoted as a functional language where data is immutable by default, however the elements of the matrix and vector types in the F# Powerpack are mutable. Why is this?
Furthermore, for which reason are sparse matrices implemented as immutable as opposed to normal matrices?
The standard array type ('T[]) in F# is also mutable. You're mostly correct -- F# is a functional language where data immutability is encouraged, but not required. Basically, F# allows you to do write both mutable/imperative code and immutable/functional code; it's up to you to decide the best way to implement the code for your specific application.
Another reason for having mutable arrays and matrices is performance -- it is possible to implement very fast algorithms with immutable types, but users writing scientific computations usually only care about one thing: achieving maximum performance. That being that case, it follows that the arrays and matrices should be mutable.
For truly high performance, mutability is required, in one specific case : Provided that your code is perfectly optimized and that you master everything it is doing down to the cache (L1, L2) pattern of access of your program, then nothing beats a low level, to the metal approach.
This happens mostly only when you have one well specified problem that stays constant for 20 years, aka mostly in scientific tasks.
As soon as you depart from this specific case, in 99.99% the bottlenecks arise from having a too low level representation (induced by a low level langage) in which you can't express the final, real-world optimization trade-off of your problem at hand.
Bottom line, for performance, the following approach is the only way (i think) :
High level / algorithmic optimization first
Once every high level ways has been explored, low level optimization
You can see how as a consequence of that :
You should never optimize anything without FIRST measuring the impact : improvements should only be made if they yield enormous performance gains and/or do not degrade your domain logic.
You eventually will reach, if your problem is stable and well defined, the point where you will have no choice but to go to the low level, and play with memory/mutability

Are GPUs good for case-based image filtering?

I am trying to figure out whether a certain problem is a good candidate for using CUDA to put the problem on a GPU.
I am essentially doing a box filter that changes based on some edge detection. So there are basically 8 cases that are tested for for each pixel, and then the rest of the operations happen - typical mean calculations and such. Is the presence of these switch statements in my loop going to cause this problem to be a bad candidate to go to GPU?
I am not sure really how to avoid the switch statements, because this edge detection has to happen at every pixel. I suppose the entire image could have the edge detection part split out from the processing algorithm, and you could store a buffer corresponding to which filter to use for each pixel, but that seems like it would add a lot of pre-processing to the algorithm.
Edit: Just to give some context - this algorithm is already written, and OpenMP has been used to pretty good effect at speeding it up. However, the 8 cores on my development box pales in comparison to the 512 in the GPU.
Edge detection, mean calculations and cross-correlation can be implemented as 2D convolutions. Convolutions can be implemented on GPU very effectively (speed-up > 10, up to 100 with respect to CPU), especially for large kernels. So yes, it may make sense rewriting image filtering on GPU.
Though I wouldn't use GPU as a development platform for such a method.
typically, unless you are on the new CUDA architecture, you will want to avoid branching. because GPUs are basically SIMD machines, the pipleline is extremely vurnurable to, and suffers tremendously from, pipeline stalls due to branch misprediction.
if you think that there are significant benefits to be garnered by using a GPU, do some preliminary benchmarks to get a rough idea.
if you want to learn a bit about how to write non-branching code, head over to http://cellperformance.beyond3d.com/ and have a look.
further, investigating into running this problem on multiple CPU cores might also be worth it, in which case you will probably want to look into either OpenCL or the Intel performance libraries (such as TBB)
another go-to source for problems targeting the GPU be it graphics, computational geometry or otherwise, is IDAV, the Institute for Data Analysis and Visualization: http://idav.ucdavis.edu
Branching is actually not that bad, if there is spatial coherence in the branching. In other words, if you are expecting chunks of pixels next to each other in the image to go through the same branch, the performance hit is minimized.
Using a GPU for processing can often be counter-intuitive; things that are obviously inefficient if done in normal serial code, are actually the best way to do it in parallel using the GPU.
The pseudo-code below looks inefficient (since it computes 8 filtered values for every pixel) but will run efficiently on a GPU:
# Compute the 8 possible filtered values for each pixel
for i = 1...8
# filter[i] is the box filter that you want to apply
# to pixels of the i'th edge-type
result[i] = GPU_RunBoxFilter(filter[i], Image)
# Compute the edge type of each pixel
# This is the value you would normally use to 'switch' with
edge_type = GPU_ComputeEdgeType(Image)
# Setup an empty result image
final_result = zeros(sizeof(Image))
# For each possible switch value, replace all pixels of that edge-type
# with its corresponding filtered value
for i = 1..8
final_result = GPU_ReplacePixelIfTrue(final_result, result[i], edge_type==i)
Hopefully that helps!
Yep, control flow usually has performance penalty on GPU, be it if's / switch'es / ternary operator's, because with control flow operations GPU can't optimally run threads. So usual tactics is to avoid branching as possible. In some cases IF's can be replaced by some formula, where IF conditions maps to formula coefficients. But concrete solution/optimization depends on concrete GPU kernel... Maybe you can show exact code, to be analyzed further by stackoverflow community.
EDIT:
Just in case you are interested here is convolution pixel shader that i wrote.

Resources