Why accessing non naturally aligned memory is not efficient? - memory

Let's assume we have a 64bit cpu which will always read 8 bytes memory at a time and I want to store a 4 bytes int. According to the definition of natural alignment, a 4-byte object is aligned to an address that's a multiple of 4 (e.g. 0x0000, 0x0004). But here is the problem, why cannot I store it at address 0x0001 for example? To my understanding, since the CPU will always read 8 bytes data, reading from address 0x0000 can still get the int stored at 0x0001 in one go. So, why natural alignment is needed in this case?

Modern CPUs (Intel, Arm) will quite happily read from unaligned addresses. The CPUs are architected typically to read much more than 8 bytes per cycle: perhaps 16 bytes or 32 bytes, and the deep pipelines of the CPUs manage quite nicely to extract the wanted 8 bytes from arbitrary addresses without any visible penalties.
Often, but not always, algorithms can be written without much concern about the alignment of arrays (or the start of each row of 2-dimensional array).
The pipelined architectures possibly read aligned blocks of 16-bytes at a time, meaning that when 8 bytes are read from address 0x0009, the CPU actually needs to read 2 16-byte blocks, combine those and extract the middle 8 bytes. Things become even more complicated, when the memory is not available at first level cache and a full cache line of 64 bytes needs to be fetched from next level cache or from main memory.
In my experience (writing and optimising image processing algorithms for SIMD), many Arm64 implementations hide the cost of loading from unaligned addresses almost perfectly for algorithms with simple and linear memory access. Things become worse, if the algorithm needs to read heavily from many unaligned addresses, such as when filtering with kernel of 3x3 or larger, or when calculating high-radix FFTs, suggesting that the CPUs capabilities of transferring memory and combining the become soon exhausted.

Related

Does packing BooleanTensor's to ByteTensor's affect training of LSTM (or other ML models)?

I am working on an LSTM to generate music. My input data will be a BooleanTensor of size 88xLx3, 88 being the amount of available notes, L being the length of each "piece" which will be in the order of 1k - 10k (TBD), and 3 being the parts for "lead melody", "accompaniment", and "bass". A value of 0 would symbolize that that specific note is not being played by that part (instrument) at that time, and a 1 would symbolize that it is.
The problem is that each entry of a BooleanTensor takes 1 byte of space in memory instead of 1 bit, which wastes a lot of valuable GPU memory.
As a solution I thought of packing each BooleanTensor to a ByteTensor (uint8) of size 11xLx3 or 88x(L/8)x3.
My question is: Would packing the data as such have an effect on the learning and generation of the LSTM or would the ByteTensor-based data and model be equivalent to their BooleanTensor-based counterparts in practice?
I wouldn't really care about the fact that the input is taking X instead of Y number of bits, at least when it comes to GPU memory. Most of it is occupied by the network's weights and intermediate outputs, which will likely be float32 anyway (maybe float16). There is active research on training with lower precision (even binary training), but based on your question, it seems completely unnecessary. Lastly, you can always try Quantization to your production models, if you really need it.
With regards to the packing: it can have an impact, especially if you do it naively. The grouping you're suggesting doesn't seem to be a natural one, therefore it may be harder to learn patterns from the grouped data than otherwise. There'll always be workarounds, but then this answer become an opinion because it is almost impossible to antecipate what could work; an opinion-based questions/answer are off-topic around here :)

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?
For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

Difference between a Byte, Word, Long and a Long Word?

I'm aware that a Byte is 8 bits, but what do the others represent? I'm taking an assembly course which uses a Motorola 68k Architecture, and I'm confused on the vocabulary present.
As mentioned on the first page of the operator's manual for the 68k Architecture, in your case a word is 16 bits and a long word is 32 bits.
In an assembly language, a word is the CPU's natural working size. Each instruction, as well as addresses in memory, tend to be one word in length. Whereas a byte is always 8 bits, the size of a word depends on the architecture you're working in.

If each piece of data take 128 bit or more, is there any advantage of grouping them in memory?

I've read in the CUDA Programming Guide that the global memory in a CUDA device is accessed by transaction on 32, 64 or 128 bit. Knowing that, is there any advantage of, say, having an set of float4 (128 bit) close together in memory? As I understand it, whether the float4 are distributed in memory or in a sequence, the number of transaction will be the same. Or will all access be coalesced in one gigantic transaction?
Coalescing refers to combining memory requests from individual threads in a warp into a single memory transaction.
A single memory transaction is typically a 128 byte cache line, therefore it would consist of eight 128 bit (e.g. float4) quantities.
So, yes, there is a benefit to having multiple threads requesting adjacent 128 bit quantities, because these can still be coalesced into a single (128 byte) cache line request to memory.

Best 8-bit supplemental checksum for CRC8-protected packet

I'm looking at designing a low-level radio communications protocol, and am trying to decide what sort of checksum/crc to use. The hardware provides a CRC-8; each packet has 6 bytes of overhead in addition to the data payload. One of the design goals is to minimize transmission overhead. For some types of data, the CRC-8 should be adequate, for for other types it would be necessary to supplement that to avoid accepting erroneous data.
If I go with a single-byte supplement, what would be the pros and cons of using a CRC8 with a different polynomial from the hardware CRC-8, versus an arithmetic checksum, versus something else? What about for a two-byte supplement? Would a CRC-16 be a good choice, or given the existence of a CRC-8, would something else be better?
In 2004 Phillip Koopman from CMU published a paper on choosing the most appropriate CRC, http://www.ece.cmu.edu/~koopman/crc/index.html
This paper describes a polynomial selection process for embedded
network applications and proposes a set of good general-purpose
polynomials. A set of 35 new polynomials in addition to 13 previously
published polynomials provides good performance for 3- to 16-bit CRCs
for data word lengths up to 2048 bits.
That paper should help you analyze how effective that 8 bit CRC actually is, and how much more protection you'll get from another 8 bits. A while back it helped me to decide on a 4 bit CRC and 4 bit packet header in a custom protocol between FPGAs.

Resources