Dealing with 25 16bit integer using avx2 - avx2

Avx2 has 256 bit vector registers ymm0-ymm15, these registers can deal with 4 integer of 64 bit, or 8 integer of 32bit, or 16 integer of 16 bit. So the parallelism of one avx2 instruction is 4, 8 or 16. My question is: if I want to operate integers which are not multiple times of the parallelism, for example: There are 25 integers, each of them is 16 bit, if I load 16 of them into ymm0, how to deal with the rest 9 integer, since they can't fit exactly into one 256 bit vector register.

Related

Diff between bit and byte, and exact meaning of byte

This is just basic theoretical question. so I read that a bit consist of 0 or 1. and a byte consists of 8 bits. and in 8 bit we can store 2^8 nos.
similarly in 10 bits we store 2^10 (1024). but then why do we say that 1024 is 1 kilo bytes, its actually 10 bits which just 1.25 byte to be exact.
please share some knowledge on it
just a concrete explanation.
Bit means like there are 8 bits in 1 byte, bit is the smallest unit of any storage or you can say the system and 8 bits sums up to 1 byte.
A bit, short for binary digit, is the smallest unit of measurement used in computers for information storage. A bit is represented by a 1 or a 0 with the value true or false, also known as on or off. A single byte of information, also known as an octet, is made up of eight bits. The size, or amount of information stored, distinguishes a bit from a byte.
A kilobit is 1,000 bits, but it is designated as 1024 bits in the binary system due to the amount of space required to store a kilobit using common operating systems and storage schemes. Most people, however, think of kilo as referring to 1,000 in order to remember what a kilobit is. A kilobyte then, would be 1,000 bytes.

HOW does a 8 bit processor interpret the 2 bytes of a 16 bit number to be a single piece of info?

Assume the 16 bit no. to be 256.
So,
byte 1 = Some binary no.
byte 2 = Some binary no.
But byte 1 also represents a 8 bit no.(Which could be an independent decimal number) and so does byte 2..
So how does the processor know that bytes 1,2 represent a single no. 256 and not two separate numbers
The processor would need to have another long type for that. I guess you could implement a software equivalent, but for the processor, these two bytes would still have individual values.
The processor could also have a special integer representation and machine instructions that handle these numbers. For example, most modern machines nowadays use twos-complement integers to represent negative numbers. In twos-complement, the most significant bit is used to differentiate negative numbers. So a twos-complement 8-bit integer can have a range of -128 (1000 0000) to 127 (0111 111).
You could easily have the most significant bit mean something else, so for example, when MSB is 0 we have integers from 0 (0000 0000) to 127 (0111 1111); when MSB is 1 we have integers from 256 (1000 0000) to 256 + 127 (1111 1111). Whether this is efficient or good architecture is another history.

Do the Airmont cores on Knight's Landing Xeon Phi's support SIMD instructions?

According to the source of the Wikipedia page on the Knight's Landing chip, it has Airmont cores. According to this page, those cores support SSE4.2 instructions, that is, SIMD instructions on SIMD registers. Is that really the case? If so, what's the actual maximum width of, say, arithmetic instructions on these Airmont cores? (In terms of total width of the register, or width of a lane or element within the register x number of lanes).
Each core has two vector units which, as well as 512 bit AVX-512, also support all SSE variants (at 128 bits of course), and likewise AVX/AVX2 (at 256 bits).
The 512 bit ZMM registers can be used as 256 bit AVX (YMM) registers or 128 bit SSE (XMM) registers. If you want to do anything with 8 or 16 bit vector elements though you are limited to SSE/AVX2, since AVX-512BW support is lacking.

Understanding an SDRAM Datasheet

I am currently working on a reverse engineering project. I am checking out an SDRAM data sheet and it says that the memory chip is organized as 2,097,152 Words × 4 banks × 16 bits
When these values are multiplied the result is: 134217728.
Does the result mean that the SDRAM can store 134217728 bytes?
Another way I thought of it is that, the Chip contains 2,097,152 Bytes x 4 banks, which are 16 Bit Aligned.
Does the result mean that the SDRAM can store 134217728 bytes?
No, that's bits, not bytes. And your brain will find it much easier to work in 'Kilo', or 'Mega'.
134217728 bits = 128 Megabits (Mb) = 16 Megabytes (MB)

Efficient way to create a bit mask from multiple numbers possibly using SSE/SSE2/SSE3/SSE4 instructions

Suppose I have 16 ascii characters (hence 16 8 bit numbers) in a 128 bit variable/register. I want to create a bit mask in which those bits will be high whose bit positions (indexes) are represented by those 16 characters.
For example, if the string formed from those 16 characters is "CAD...", in the bit mask 67th bit, 65th bit, 68th bit and so on should be 1. The rest of the bits should be 0. What is the efficient way to do it specially using SIMD instructions?
I know that one of the technique is addition like this: 2^(67-1)+2^(65-1)+2^(68-1)+...
But this will require a large number of operations. I want to do it in one/two operations/instructions if possible.
Please let me know a solution.
SSE4.2 contains one instruction, that performs almost what you want: PCMPISTRM with immediate operand 0. One of its operands should contain your ASCII characters, other - a constant vector with values like 32, 33, ... 47. You get the result in 16 least significant bits of XMM0. Since you need 128 bits, this instruction should be executed 8 times with different constant vectors (6 times if you need only printable ASCII characters). After each PCMPISTRM, use bitwise OR to accumulate the result in some XMM register.
There are 2 disadvantages of this method: (1) you need to read the Intel's architectures software developer's manual to understand PCMPISTRM's details because that's probably the most complicated SSE instruction ever, and (2) this instruction is pretty slow (throughput of 1/2 on Nehalem, 1/3 on Sandy Bridge, 1/4 on Bulldozer), so you'll hardly get any significant speed improvement over 'brute force' method.

Resources