Do the Airmont cores on Knight's Landing Xeon Phi's support SIMD instructions? - sse

According to the source of the Wikipedia page on the Knight's Landing chip, it has Airmont cores. According to this page, those cores support SSE4.2 instructions, that is, SIMD instructions on SIMD registers. Is that really the case? If so, what's the actual maximum width of, say, arithmetic instructions on these Airmont cores? (In terms of total width of the register, or width of a lane or element within the register x number of lanes).

Each core has two vector units which, as well as 512 bit AVX-512, also support all SSE variants (at 128 bits of course), and likewise AVX/AVX2 (at 256 bits).
The 512 bit ZMM registers can be used as 256 bit AVX (YMM) registers or 128 bit SSE (XMM) registers. If you want to do anything with 8 or 16 bit vector elements though you are limited to SSE/AVX2, since AVX-512BW support is lacking.

Related

Diff between bit and byte, and exact meaning of byte

This is just basic theoretical question. so I read that a bit consist of 0 or 1. and a byte consists of 8 bits. and in 8 bit we can store 2^8 nos.
similarly in 10 bits we store 2^10 (1024). but then why do we say that 1024 is 1 kilo bytes, its actually 10 bits which just 1.25 byte to be exact.
please share some knowledge on it
just a concrete explanation.
Bit means like there are 8 bits in 1 byte, bit is the smallest unit of any storage or you can say the system and 8 bits sums up to 1 byte.
A bit, short for binary digit, is the smallest unit of measurement used in computers for information storage. A bit is represented by a 1 or a 0 with the value true or false, also known as on or off. A single byte of information, also known as an octet, is made up of eight bits. The size, or amount of information stored, distinguishes a bit from a byte.
A kilobit is 1,000 bits, but it is designated as 1024 bits in the binary system due to the amount of space required to store a kilobit using common operating systems and storage schemes. Most people, however, think of kilo as referring to 1,000 in order to remember what a kilobit is. A kilobyte then, would be 1,000 bytes.

Gigabyte or Gibibyte (1000 or 1024)?

This may be a duplicate and I apologies if that is so but I really want a definitive answer as that seems to change depending upon where I look.
Is it acceptable to say that a gigabyte is 1024 megabytes or should it be said that it is 1000 megabytes? I am taking computer science at GCSE and a typical exam question could be how many bytes in a kilobyte and I believe the exam board, AQA, has the answer for such a question as 1024 not 1000. How is this? Are both correct? Which one should I go with?
Thanks in advance- this has got me rather bamboozled!
The sad fact is that it depends on who you ask. But computer terminology is slowly being aligned with normal terminology, in which kilo is 103 (1,000), mega is 106 (1,000,000), and giga is 109 (1,000,000,000).
This is reflected in the International System of Quantities and the International Electrotechnical Commission, which define gigabyte as 109 and use gibibyte for the computer-specific 1024 x 1024 x 1024 value.
The reason it "depends who you ask," is that for many years, specifically in relation to "bytes" of storage, the prefixes kilo, mega, and giga meant 1024, 10242, and 10243. But that flies in the face of normal convention with regard to these prefixes. So again, computer terminology is being aligned with non-computer terminology.
The term gigabyte is commonly used to mean either 10003 bytes or 10243 bytes depending on the context. Disk manufacturers prefer the decimal term while memory manufacturers use the binary.
Decimal definition
1 GB = 1,000,000,000 bytes (= 10003 B = 109 B)
Based on powers of 10, this definition uses the prefix as defined in the International System of Units (SI). This is the recommended definition by the International Electrotechnical Commission (IEC). This definition is used in networking contexts and most storage media, particularly hard drives, flash-based storage, and DVDs, and is also consistent with the other uses of the SI prefix in computing, such as CPU clock speeds or measures of performance.
Binary definition
1 GiB = 1,073,741,824 bytes (= 10243 B = 230 B).
The binary definition uses powers of the base 2, as is the architectural principle of binary computers. This usage is widely promulgated by some operating systems, such as Microsoft Windows in reference to computer memory (e.g., RAM). This definition is synonymous with the unambiguous unit gibibyte.
The difference between units based on decimal and binary prefixes increases as a semi-logarithmic (linear-log) function—for example, the decimal kilobyte value is nearly 98% of the kibibyte, a megabyte is under 96% of a mebibyte, and a gigabyte is just over 93% of a gibibyte value. This means that a 300 GB (279 GiB) hard disk might be indicated variously as 300 GB, 279 GB or 279 GiB, depending on the operating system.
The Wikipedia article https://en.wikipedia.org/wiki/Gigabyte has a good writeup of the confusion surrounding the usage of the term

Maximum memory size a system can support

Suppose that I have a computer with an address register of size 16 bits (MAR, for example). The smallest addressable unit in this computer is a word and each word is of size 2 bytes. What is the maximum memory size (in bytes) this system can support?
I thought it would be 2^16 = 65536 bytes, but the part about the smallest addressable unit implies that this is not the way to solve it.
Thanks in advance
There is no direct correlation to the maximum amount of memory a system can support, and the size of address registers.
16bit computers 30 years ago could very well support more than 64 kilobytes. On the other hand, modern 64bit processors typilcally only have lanes for 52 bits (or less), but even so a typical computer cannot nearly support 2^52 bytes of memory.
Typical 64bit computers today could in theory address 16 exibytes, but present-time CPUs only support 4 petabytes of phyisical and 256 terabytes of per-process virtual memory. Typical desktop mainboards support 128GiB maximum, if you buy extra expensive DIMMS. With affordable DIMMS, you're limited to about half as much (there are only so and so many slots).
Operating systems typically allow for main memory sizes in the hundreds of gigabytes only (e.g. 512 GiB for Windows 8 enterprise/professional, and 128GiB otherwise, or as little as 16GiB for Windows 7 Home Premium)
Generally the smallest addressable size is one byte, as you have calculated it, if it were one byte it would be 2^16*1 = 65536 bytes. However, because on this system there are two bytes per address, it is actually 2^16*2 = 131072 bytes.

Efficient way to create a bit mask from multiple numbers possibly using SSE/SSE2/SSE3/SSE4 instructions

Suppose I have 16 ascii characters (hence 16 8 bit numbers) in a 128 bit variable/register. I want to create a bit mask in which those bits will be high whose bit positions (indexes) are represented by those 16 characters.
For example, if the string formed from those 16 characters is "CAD...", in the bit mask 67th bit, 65th bit, 68th bit and so on should be 1. The rest of the bits should be 0. What is the efficient way to do it specially using SIMD instructions?
I know that one of the technique is addition like this: 2^(67-1)+2^(65-1)+2^(68-1)+...
But this will require a large number of operations. I want to do it in one/two operations/instructions if possible.
Please let me know a solution.
SSE4.2 contains one instruction, that performs almost what you want: PCMPISTRM with immediate operand 0. One of its operands should contain your ASCII characters, other - a constant vector with values like 32, 33, ... 47. You get the result in 16 least significant bits of XMM0. Since you need 128 bits, this instruction should be executed 8 times with different constant vectors (6 times if you need only printable ASCII characters). After each PCMPISTRM, use bitwise OR to accumulate the result in some XMM register.
There are 2 disadvantages of this method: (1) you need to read the Intel's architectures software developer's manual to understand PCMPISTRM's details because that's probably the most complicated SSE instruction ever, and (2) this instruction is pretty slow (throughput of 1/2 on Nehalem, 1/3 on Sandy Bridge, 1/4 on Bulldozer), so you'll hardly get any significant speed improvement over 'brute force' method.

Block dimensions in CUDA

I have a NVIDIA GTX 570 compute capability 2.0 running cuda-4.0.
The deviceQuery executable in the CUDA SDK gives me information on my CUDA device and its various properties. Two of the lines in the output are
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Why is the 3rd dimension of the block restricted to be upto 64 threads only wheras the X and the Y dimension can vary upto 1024 threads?
EDIT2: ALso, please take this with a grain of salt; this is a purely hypothetical answer, or a guess. There may indeed be a clear hardware-based reason why 64 is the maximum. Frankly I don't know, and my answer is based on an assumption that there is no such hardware limit, per se.
It's probably a combination of three things: first, there is a limit to the number of threads which can be resident inside a block; second, block dimensions are typically in multiples of 32, and even more often in powers of 2 greater than 32; third, coordinate systems used in the solution of multi-dimensional problems are most often oriented so that you're looking at the scene directly (i.e., with the important bits more distributed in X and Y than in Z).
CUDA naturally has to support 1D access, as this is an immensely common and efficient access pattern when it is applicable. TO support this, the X dimension must be allowed to vary over the entire range of 1024 threads.
To support 2D access, which is less common, CUDA should minimally support up to 512 in the X dimension (using the convention that the X dimension should be oriented in the coordinate system so that it measures the biggest spread) and 32 in the Y dimension. It must support up to 1024 in the X dimension, and I suppose they relax the requirement that the X dimension be no smaller than the Y dimension and allow the full 1024 range of Y values. However, in my understanding, 32 would have been plenty big for the Y dimension maximum.
To support 3D access, maintaining X, Y >= Z and trying to reach 1024, it seems to be that in the best case X=Y=Z=10; so there's no real argument for allowing Z to be greater than 10, given my assumptions
In summary, I don't see why they couldn't have made the maximums (1024, 32, 10). My question is why make them (1024, 1024, 64)? The only answer I keep coming back to is to allow some flexibility to programmers to violate the X>=Y>=Z coordinate system convention.
Edit: given my summary and hypothetical answer, the real answer to your question is this: it's an arbitary decision.
My wild guess is that because threadIdx.x, threadIdx.y and threadIdx.z are kept in a special single 32-bit register, possibly with even some other additional data. Maybe warp id? Or maybe multiprocessor-block id to identify which block given thread handles, if given multiprocessor runs more than one?
This is purely speculative, I have no data to support it, but I would imagine that they want to have as few special registers as possible.

Resources