SSE vector realign? - sse

Is there a way to realign data that has been loaded into SSE/AVX vector registers (say to implement a sliding window)? Or do I need to shift the bytes myself and reload into vector registers from memory again?

For 128-bit vectors, SSSE3 / AVX [v]palignr xmm works for arbitrary byte-windows on a pair of registers. For AVX2 ymm registers, the 2x 128-bit lane behaviour is nearly useless for this. _mm_alignr_epi8 (PALIGNR) equivalent in AVX2
Sometimes reloading from memory is better, though: 2/clock load throughput with no penalty if you don't cross a cache-line boundary (on Intel) vs. 1/clock shuffle throughput. And the throughput / latency penalty for cache-line splits isn't terrible. If one palignr is sufficient, usually use it, but it's usually better to do unaligned loads instead of trying to emulate it for AVX2.

Related

Why accessing non naturally aligned memory is not efficient?

Let's assume we have a 64bit cpu which will always read 8 bytes memory at a time and I want to store a 4 bytes int. According to the definition of natural alignment, a 4-byte object is aligned to an address that's a multiple of 4 (e.g. 0x0000, 0x0004). But here is the problem, why cannot I store it at address 0x0001 for example? To my understanding, since the CPU will always read 8 bytes data, reading from address 0x0000 can still get the int stored at 0x0001 in one go. So, why natural alignment is needed in this case?
Modern CPUs (Intel, Arm) will quite happily read from unaligned addresses. The CPUs are architected typically to read much more than 8 bytes per cycle: perhaps 16 bytes or 32 bytes, and the deep pipelines of the CPUs manage quite nicely to extract the wanted 8 bytes from arbitrary addresses without any visible penalties.
Often, but not always, algorithms can be written without much concern about the alignment of arrays (or the start of each row of 2-dimensional array).
The pipelined architectures possibly read aligned blocks of 16-bytes at a time, meaning that when 8 bytes are read from address 0x0009, the CPU actually needs to read 2 16-byte blocks, combine those and extract the middle 8 bytes. Things become even more complicated, when the memory is not available at first level cache and a full cache line of 64 bytes needs to be fetched from next level cache or from main memory.
In my experience (writing and optimising image processing algorithms for SIMD), many Arm64 implementations hide the cost of loading from unaligned addresses almost perfectly for algorithms with simple and linear memory access. Things become worse, if the algorithm needs to read heavily from many unaligned addresses, such as when filtering with kernel of 3x3 or larger, or when calculating high-radix FFTs, suggesting that the CPUs capabilities of transferring memory and combining the become soon exhausted.

Moving data between SSE and AVX512 registers?

I want to move four xmm registers into a zmm register, perform some computations using AVX512 instructions and get the result back to the XMM registers. What is the most efficient way to do so without going through memory?

Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN

The convolutional layers are most computationally intense parts of Convolutional neural networks (CNNs).Currently the common approach to impement convolutional layers is to expand the image into a column matrix(im2col) and perform and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. However im2col operation need load and store the image data, and also need another memory block to hold the intermediate data.
If I need to optimize the convolutional implementation, I may choose to direct implementation with SIMD instructions. Such method will not incur any memory operation overhead.
the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.
From the following link, at the end of the link
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
So I hope to know the reason. May floating point operations require more instruction cycle? or the input image is not much large, so it may residue in the cache and the memory operations don't need access DDR and consume less cycles.
Cache-blocking a GEMM is possible so you get mostly L1 cache hits (see also What Every Programmer Should Know About Memory?).
Fitting in the large shared L3 cache on typical x86 CPUs is not sufficient to make things efficient. The per-core L2 caches are typically 256kiB, and even that's slower than the 32kiB L1d cache.
Memory latency is very slow compared to a CPU core clock, but memory/cache bandwidth is not terrible these days with fast DDR4, or L3 cache hits. (But like I said, for a matmul with good cache blocking / loop tiling you can reuse data while it's still hot in L1d if you only transpose parts of the input matrix on the fly. Reducing off-core bandwidth requirements is also important for efficient matmul, not just transposing one so its columns are sequential in memory.)
Beyond that, sequential access to memory is essential for efficient SIMD (loading a vector of multiple contiguous elements, letting you multiply / add / whatever 4 or 8 packed float elements with one CPU instruction). Striding down columns in a row-major matrix would hurt throughput even if the matrix was small enough to fit in L1d cache (32kiB).

Split RNN Memory Consumption Evenly Between GPUs in TensorFlow

I'm trying to figure out the most strategic way to evenly split the memory load of a seq2seq network between two GPUs.
With convolutional networks, the task is much easier. However, I'm trying to figure out how to maximize the memory usage of 2 Titan X's. The goal is to build the largest network that the combined 24GB of memory will allow.
One idea was to place each RNN layer in a separate GPU.
GPU1 --> RNN Layer 1 & Backward Pass
GPU2 --> RNN Layer 2,3,4
However, the backprop computations require a significant amount of memory. Therefore, another idea is to do the entire forward pass on one GPU and the backward pass on the separate GPU.
GPU1 --> Forward Pass
GPU2 --> Backward Pass
(However, GPU2 still takes most of the memory load)
Is there any way to measure how much of the GPU memory is being used? This would allow us to figure out how to maximize each GPU before it's "filled up".
Once 2 GPUs are used, I would eventually want to use four. However, I think maximizing 2 GPUs is the first step.
Setting "colocate_gradients_with_ops" as TRUE maybe work. It allows GPU memory to allocated evenly.
optimizer = tf.train.AdamOptimizer(learning_rate)
gvs = optimizer.compute_gradients(loss, colocate_gradients_with_ops=True)
train_op = optimizer.apply_gradients(gvs, global_step=self.global_step)

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?
For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

Resources