Difference between a Byte, Word, Long and a Long Word? - memory

I'm aware that a Byte is 8 bits, but what do the others represent? I'm taking an assembly course which uses a Motorola 68k Architecture, and I'm confused on the vocabulary present.

As mentioned on the first page of the operator's manual for the 68k Architecture, in your case a word is 16 bits and a long word is 32 bits.
In an assembly language, a word is the CPU's natural working size. Each instruction, as well as addresses in memory, tend to be one word in length. Whereas a byte is always 8 bits, the size of a word depends on the architecture you're working in.

Related

Why accessing non naturally aligned memory is not efficient?

Let's assume we have a 64bit cpu which will always read 8 bytes memory at a time and I want to store a 4 bytes int. According to the definition of natural alignment, a 4-byte object is aligned to an address that's a multiple of 4 (e.g. 0x0000, 0x0004). But here is the problem, why cannot I store it at address 0x0001 for example? To my understanding, since the CPU will always read 8 bytes data, reading from address 0x0000 can still get the int stored at 0x0001 in one go. So, why natural alignment is needed in this case?
Modern CPUs (Intel, Arm) will quite happily read from unaligned addresses. The CPUs are architected typically to read much more than 8 bytes per cycle: perhaps 16 bytes or 32 bytes, and the deep pipelines of the CPUs manage quite nicely to extract the wanted 8 bytes from arbitrary addresses without any visible penalties.
Often, but not always, algorithms can be written without much concern about the alignment of arrays (or the start of each row of 2-dimensional array).
The pipelined architectures possibly read aligned blocks of 16-bytes at a time, meaning that when 8 bytes are read from address 0x0009, the CPU actually needs to read 2 16-byte blocks, combine those and extract the middle 8 bytes. Things become even more complicated, when the memory is not available at first level cache and a full cache line of 64 bytes needs to be fetched from next level cache or from main memory.
In my experience (writing and optimising image processing algorithms for SIMD), many Arm64 implementations hide the cost of loading from unaligned addresses almost perfectly for algorithms with simple and linear memory access. Things become worse, if the algorithm needs to read heavily from many unaligned addresses, such as when filtering with kernel of 3x3 or larger, or when calculating high-radix FFTs, suggesting that the CPUs capabilities of transferring memory and combining the become soon exhausted.

One Hot encoding for large number of values

How do we use one hot encoding if the number of values which a categorical variable can take is large ?
In my case it is 56 values. So as per usual method I would have to add 56 columns (56 binary features) in the training dataset which will immensely increase the complexity and hence the training time.
So how do we deal with such cases ?
Use a compact encoding. This trades space for time, although one-hot encodings can often enjoy a very small time penalty.
The most accessible idea is a vector of 56 booleans, if your data format supports that. The one with the most direct mapping is to use a 64-bit integer, each bit being a boolean. This is how we implement one-hot vectors in hardware design. Most 4G languages (and mature 3G languages) include fast routines for bit manipulation. You will need get, set, clear, and find bits.
Does that get you moving?

How are numbers represented in a computer and what is the role of floating-point and twos-complement?

I have very general question about how computers work with numbers.
In general computer systems only know binary - 0 and 1. So in memory any number is a sequence of bits. It does not matter if the number represented is a int or float.
But when does things like floating-point-numbers based on IEEE 754 standard and the twos-complement enter the game? Is this only a thing of the compilers (C/C++,...) and VMs (.NET/Java)?
Is it true that all integer numbers are represented by using the twos-complement?
I have read about CPUs that use co-processors for performing the floating-point-arithmetic. To tell a CPU to use it special assembler commands exists like add.s (single precision) and add.d (double precision). When I have some C++ code where a float is use, will such assembler commands be in the output?
I am totally confused at the moment. Would be great if you can help me with that.
Thank you!
Stefan
In general computer systems only know binary - 0 and 1. So in memory any number is a sequence of bits. It does not matter if the number represented is a int or float.
This is correct for the representation in memory. But computers execute instructions and store data currently being worked on in registers. Both instructions and registers are specialized, for some of them, for representations of signed integers in two's complement, and for others, for IEEE 754 binary32 and binary64 arithmetics (on a typical computer).
So to answer your first question:
But when does things like floating-point-numbers based on IEEE 754 standard and the twos-complement enter the game? Is this only a thing of the compilers (C/C++,...) and VMs (.NET/Java)?
Two's complement and IEEE 754 binary floating-point are very much choices made by the ISA, which provides specialized instructions and registers to deal with these formats in particular.
Is it true that all integer numbers are represented by using the twos-complement?
You can represent integers however you want. But if you represent your signed integers using two's complement, the typical ISA will provide instructions to operate efficiently on them. If you make another choice, you will be on your own.

If each piece of data take 128 bit or more, is there any advantage of grouping them in memory?

I've read in the CUDA Programming Guide that the global memory in a CUDA device is accessed by transaction on 32, 64 or 128 bit. Knowing that, is there any advantage of, say, having an set of float4 (128 bit) close together in memory? As I understand it, whether the float4 are distributed in memory or in a sequence, the number of transaction will be the same. Or will all access be coalesced in one gigantic transaction?
Coalescing refers to combining memory requests from individual threads in a warp into a single memory transaction.
A single memory transaction is typically a 128 byte cache line, therefore it would consist of eight 128 bit (e.g. float4) quantities.
So, yes, there is a benefit to having multiple threads requesting adjacent 128 bit quantities, because these can still be coalesced into a single (128 byte) cache line request to memory.

Best 8-bit supplemental checksum for CRC8-protected packet

I'm looking at designing a low-level radio communications protocol, and am trying to decide what sort of checksum/crc to use. The hardware provides a CRC-8; each packet has 6 bytes of overhead in addition to the data payload. One of the design goals is to minimize transmission overhead. For some types of data, the CRC-8 should be adequate, for for other types it would be necessary to supplement that to avoid accepting erroneous data.
If I go with a single-byte supplement, what would be the pros and cons of using a CRC8 with a different polynomial from the hardware CRC-8, versus an arithmetic checksum, versus something else? What about for a two-byte supplement? Would a CRC-16 be a good choice, or given the existence of a CRC-8, would something else be better?
In 2004 Phillip Koopman from CMU published a paper on choosing the most appropriate CRC, http://www.ece.cmu.edu/~koopman/crc/index.html
This paper describes a polynomial selection process for embedded
network applications and proposes a set of good general-purpose
polynomials. A set of 35 new polynomials in addition to 13 previously
published polynomials provides good performance for 3- to 16-bit CRCs
for data word lengths up to 2048 bits.
That paper should help you analyze how effective that 8 bit CRC actually is, and how much more protection you'll get from another 8 bits. A while back it helped me to decide on a 4 bit CRC and 4 bit packet header in a custom protocol between FPGAs.

Resources