Parallel programming using Haswell architecture [closed] - sse

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I want to learn about parallel programming using Intel's Haswell CPU microarchitecture.
About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?.
Can you recommend books, tutorials, internet resources, courses?
Thanks!

It sounds to me like you need to learn about parallel programming in general on the CPU. I started looking into this about 10 months ago before I ever used SSE, OpenMP, or intrinsics so let me give a brief summary of some important concepts I have learned and some useful resources.
There are several parallel computing technologies that can be employed: MIMD, SIMD, instruction level parallelism, multi-level cahces, and FMA. With Haswell there is also computing on the IGP.
I recommend picking a topic like matrix multiplication or the Mandelbrot set. They can both benefit from all these technologies.
MIMD
By MIMD I am referring to computing using multiple physical cores. I recommend OpenMP for this. Go through this tutorial
http://bisqwit.iki.fi/story/howto/openmp/#Abstract
and then use this as a reference https://computing.llnl.gov/tutorials/openMP/. Two of the most common problems using MIMD are race conditions and false sharing. Follow OpenMP on SO reguarly.
SIMD
Many compilers can do auto-vectorization so I would look into that. MSVC's auto-vectorization is quite primitive but GCC's is really good.
Learn intrinsics. The best resource to know what a intrinsic does is http://software.intel.com/sites/landingpage/IntrinsicsGuide/
Another great resource is Agner Fog's vectorclass. 95% of the questions on SO on SSE/AVX can be answered by looking at the source code of the vectorclass. On top of that you could use the vectorclass for most SIMD and still get the full speed and skip intrinsics.
A lot of people use SIMD inefficiently. Read about Array of Structs (AOS) and Struct of Arrays (SOA) and Array of struct of Arrays (AOSOA). Also look into Intel strip mining Calculating matrix product is much slower with SSE than with straight-forward-algorithm
See Ingo Wald's PhD thesis for a interesting way to implement SIMD in ray tracing. I used this same idea for the Mandelbrot set to calculate 4(8) pixels at once using SSE(AVX).
Also read this paper "Extending a C-like Language for Portable SIMD Programming" by Wald http://www.cdl.uni-saarland.de/papers/leissa_vecimp_tr.pdf to get a better idea how to use SIMD.
FMA
FMA3 is new since Haswell. It's so new that there is not much discussion on it on SO yet. But this answer (to my question) is good
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX. FMA3 doubles the peak FLOPS so potentially matrix multiplication is twice as fast on Haswell compared to Ivy Bridge.
According to this answer the most important aspect of FMA is not the fact that it's one instructions instead of two to do multiplication and addition it's the "(virtually) infinite precision of the intermediate result." For example implementing double-double multiplication without FMA it takes 6 multiplications and several additions whereas with FMA it's only two operations.
Instruction level parallelism
Haswell has 8 ports which it can send μ-ops to (though not every port can take the same mirco-op; see this AnandTech review). This means Haswell can do, for example two 256-bit loads, one 256-bit store, two 256-bit FMA operations, one scalar addition, and a condition jump at the same time (six μ-ops per clock cycle).
For the most part you don't have to worry about this since it's done by the CPU. However, there are cases where your code can limit the potential instruction level parallelism. The most common is a loop carried dependency. The following code has a loop carried dependency
for(int i=0; i<n; i++) {
sum += x(i)*y(i);
}
The way to fix this is to unroll the loop and do partial sums
for(int i=0; i<n; i+=2) {
sum1 += x(i)*y(i);
sum2 += x(i+1)*y(i+1);
}
sum = sum1 + sum2;
Multi-level Caches:
Haswell has up to four levels of caches. Writing your code to optimally take advantage of the cache is by far the most difficult challenge in my opinion. It's the topic I still struggle the most with and feel the most ignorant about but in many cases improving cache usage gives better performance than any of the other technologies. I don't have many recommendations for this.
You need to learn about sets and cache lines (and the critical stride) and on NUMA systems about pages. To learn a little about sets and the critical stride see Agner Fog's http://www.agner.org/optimize/optimizing_cpp.pdf and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Another very useful topic for the cache is loop blocking or tiling. See my answer (the one with the highest votes) at What is the fastest way to transpose a matrix in C++? for an example.
Computing on the IGP (with Iris Pro).
All Haswell consumer processors (Haswell-E is not out yet) have an IGP. The IGP uses at least 30% of the silicon to over 50%. That's enough for at least 2 more x86 cores. This is wasted computing potential for most programmers. The only way to program the IGP is with OpenCL. Intel does not have OpenCL Iris Pro drivers for Linux so you can only do with with Windows (I'm not sure how good Apple's implementation of this is). Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL.
One advantage of the Iris Pro compared to Nvidia and AMD is that double floating point is only one quarter the speed of single floating point with the Iris Pro (however fp64 is only enabled in Direct Compute and not with OpenCL). NVIDIA and AMD (recently) cripple double floating point so much that it makes GPGPU double floating point computing not very effective on their consumer cards.

Related

Are there any problems for which SIMD outperforms Cray-style vectors?

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds:
SIMD. This is conceptually straightforward, e.g. instead of just having a set of 64-bit registers and operations thereon, you have a second set of 128-bit registers and you can operate on a short vector of two 64-bit values at the same time. It becomes complicated in the implementation because you also want to have the option of operating on four 32-bit values, and then a new CPU generation provides 256-bit vectors which requires a whole new set of instructions etc.
The older Cray-style vector instructions, where the vectors start off large e.g. 4096 bits, but the number of elements operated on simultaneously is transparent, and the number of elements you want to use in a given operation is an instruction parameter. The idea is that you bite off a little more complexity upfront, in order to avoid creeping complexity later.
It has been argued that option 2 is better, and the arguments seem to make sense, e.g. https://www.sigarch.org/simd-instructions-considered-harmful/
At least at first glance, it looks like option 2 can do everything option 1 can, more easily and generally better.
Are there any workloads where the reverse is true? Where SIMD instructions can do things Cray-style vectors cannot, or can do something faster or with less code?
The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".
The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.
The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.
None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.
The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems.
As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.
By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.
The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.
So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.
Cray-style vectors are great for pure-vertical problems, the kind of problem that some people think SIMD is limited to. They make your code forward compatible with future CPUs with wider vectors.
I've never worked with Cray-style vectors, so I don't know how much scope there might be for getting them to do horizontal shuffles.
If you don't limit things to Cray specifically, modern instruction-sets like ARM SVE and RISC-V extension V also give you forward-compatible code with variable vector width, and are clearly designed to avoid that problem of short-fixed-vector SIMD ISAs like AVX2 and AVX-512, and ARM NEON.
I think they have some shuffling capability. Definitely masking, but I'm not familiar enough with them to know if they can do stuff like left-pack (AVX2 what is the most efficient way to pack left based on a mask?) or prefix-sum (parallel prefix (cumulative) sum with SSE).
And then there are problems where you're working with a small fixed amount of data at a time, but more than fits in an integer register. For example How to convert a binary integer number to a hex string? although that's still basically doing the same stuff to every element after some initial broadcasting.
But other stuff like Most insanely fastest way to convert 9 char digits into an int or unsigned int where a one-off custom shuffle and horizontal pairwise multiply can get just the right work done with a few single-uop instructions is something that requires tight integration between SIMD and integer parts of the core (as on x86 CPUs) for maximum performance. Using the SIMD part for what it's good at, then getting the low two 32-bit elements of a vector into an integer register for the rest of the work. Part of the Cray model is (I think) a looser coupling to the CPU pipeline; that would defeat use-cases like that. Although some 32-bit ARM CPUs with NEON have the same loose coupling where mov from vector to integer is slow.
Parsing text in general, and atoi, is one use-case where short vectors with shuffle capabilities are effective. e.g. https://www.phoronix.com/scan.php?page=article&item=simdjson-avx-512&num=1 - 25% to 40% speedup from AVX-512 with simdjson 2.0 for parsing JSON, over the already-fast performance of AVX2 SIMD. (See How to implement atoi using SIMD? for a Q&A about using SIMD for JSON back in 2016).
Many of those tricks depend on x86-specific pmovmskb eax, xmm0 for getting an integer bitmap of a vector compare result. You can test if it's all zero or all-1 (cmp eax, 0xffff) to stay in the main loop of a memcmp or memchr loop for example. And if not then bsf eax,eax to find the position of the first difference, possibly after a not.
Having vector width limited to a number of elements that can fit in an integer register is key to this, although you could imagine an instruction-set with compare-into-mask with scalable width mask registers. (Perhaps ARM SVE is already like that? I'm not sure.)

Evaluating high level algorithm fitness to an embedded platform [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
What is the process you would consider to evaluate high level algorithm (mainly computer vision algorithms, written in Matlab, python etc.) to run real time on an embedded CPU.
The idea is to have a reliable assessment/calculations at early stage when you cannot implement or profile it on the target HW.
To put things in focus lets assume that your input is a grayscale QVGA frame, 8bpp # 30fps and you have to perform a full canny edge detection on each and every input frame. How can we find or estimate the minimum processing power needed to perform this successfully?
A generic assessment isn't quite possible and what you request is tedious manual work.
There are however a few generic steps you could follow to arrive at a rough idea
Estimate the run-time complexity of your algorithm in terms of basic math operations like additions and multiplications (best/average/worst ? your choice). Do you need floating point support? Also track high level math operations like saturating add/subtract (Why ? see point 3).
Devour the ISA of the target processor and focus especially on the math and branching instructions. How many cycles does a multiplication take? Or, does your processor dispatch several per cycle ?
See if your processor supports features like,
Saturating math. ARM Cortex-M4 does. PIC18 micro-controller does not, incurring additional execution overhead.
Hardware floating point operations.
Branch prediction.
SIMD.Will provide significant speed boost if your algorithm could be tailored to it.
Since you explicitly asked for a CPU, see if yours has a GPU attached. Image processing algorithms generally benefit from the presence of one.
Map your operations (from step 1) to what the target processor supports (in step 3) to arrive at an estimate.
Other factors (out of a zillion other) that you need to take into account
Do you plan to run an OS on the target or is it bare-bone ?
Is your algorithm bound by IO bottlenecks ?
If your processor has a cache, how efficient is your algorithm in utilizing it ?

Too slow or out of memory problems in Machine Learning/Data Mining [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
EDIT: An attempt to rephrase myself:
Tools like R, Weka are feature-rich but are slow and limited in size of data they can work with. Tools like Mahout, Vowpal Wabbit (VW) and its extension AllReduce are targeted for 1K node clusters, and they are limited in their capabilities. VW, for example, can only "learn" by minimizing some loss function.
What I haven't seen in any popular software is use of parallel programming (good ol' pthreads, MPI etc) for speeding up. I suppose it is useful for the kinds of problem where cluster may be an overkill, but waiting for the program to finish while the other processor cores are idle is just too bad. [As an example, one can get 26-core machine and 88-core cluster at AWS, and a good parallel algorithm can give speed up of, say, 20X and 60X without resorting to heavy-weight hadoop like systems]
What I meant to learn from the community: (subjectively) your real life problems/algorithms that are not TOO large to be called big data, but still big enough where you feel faster would have been better. (Objectively) your experiences along the lines of "algorithm X on data with characteristics C and size D took T time, and I had M processor cores that I could have thrown at it, if parallel version of the algorithm X was available".
And the reason I ask, of course, is to learn the need for parallel programming in this field, and perhaps have a community driven effort to address it. My experiences with few problems are detailed below.
What are some of the problems in machine learning, data mining and related fields that you have difficulties with because they are too slow or need excessively large memory?
As a hobby research project we built an out-of-core programming model to handle data larger than system memory and it natively supports parallel/distributed execution. It showed good performance on some problems (see below) and we wish to expand this technology (hopefully community-driven) for the real life problems.
Some benchmarks (against Weka, Mahout and R):
a) Apriori Algorithm for frequent itemset mining [CPU-bound but average memory]
Webdocs dataset with 1.7M transactions over 5.2M unique items (1.4GB). The algorithm finds sets of items that appear frequently in transactions. For 10% support, Weka3 could not complete in 3 days. Our version completed in 4hr 24 min (although to be fair, we used Tries instead of hashtables as in Weka). More importantly though, on one 8-core machine it took 39min, on 8 machines -> 6min 30sec (=40x)
b) SlopeOne recommendation engine [High memory usage]
MovieLens dataset with 10M ratings from 70K for 10K movies. SlopeOne recommends new movies based on Collaborative Filtering. Apache Mahout's "Taste" non-distributed recommender would fail for less than 6GB memory. To benchmark the out-of-core performance, we restricted our version to 1/10th of this limit (600MB) and it completed with 11% overhead (due to out-of-core) in execution time.
c) Dimensionality Reduction with Principal Component Analysis (PCA) [Both CPU and Memory bound]
Mutants "p53" protein dataset of 32K samples with 5400 attributes each (1.1GB). PCA is used to reduce the dimension of dataset by dropping variables with very small variances. Although our version could process data larger than system virtual memory, we benchmarked this dataset since the R software can process it. R completed the job in 86 min. Our out-of-core version had no additional overhead; in fact, it completed in 67min on single-core and 14 min on 8-core machine.
The excellent software today either work for data in Megabytes range by loading them in memory (R, Weka, numpy) or tera/petabytes range for data centers (Mahout, SPSS, SAS). There seems to be a gap in the Gigabytes range -- large than virtual memory but less than "big data". Although, projects like numpy's Blaze, R bigmemory, scalapack etc are addressing this need.
From your experiences, can you relate examples where such a faster and out-of-core tool can benefit the data mining/machine learning community?
This is really an open ended question. From what I can tell you are asking to things:
Using machine learning on big big datasets.
Would a faster tool benefit the community
For the first question, one of the best tools that is used in many production environments with big, big data is Vowpal Wabbit (VW). Head over to hunch.net to take a look.
For you second question if you can beat VW then that would absolutely benefit the community. However VW is pretty good :)

Use Digital Signal Processors to accelerate calculations in the same fashion than GPUs

I read that several DSP cards that process audio, can calculate very fast Fourier Transforms and some other functions involved in Sound processing and others. There are some scientific problems (not many), like Quantum mechanics, that involver Fourier Transform calculations. I wonder if DSP could be used to accelerate calculations in this fashion, like GPUs do in some other cases, and if you know succcessful examples.
Thanks
Any linear operations are easier and faster to do on DSP chips. Their architecture allows you to perform a linear operation (take two numbers, multiply each of them by a constant and add the results) in a single clock cycle. This is one of the reasons FFT can be calculated so quickly on a DSP chip. This is also a reason many other linear operations can be accelerated with their use. I guess I have three main points to make concerning performance and code optimization for such processors.
1) Perhaps less relevant, but I'd like to mention it nonetheless. In order to take full advantage of DSP processor's architecture, you have to code in Assembly. I'm pretty sure that regular C code will not be fully optimized by the compiler to do what you want. You literally have to specify each register, etc. It does pay off, however. The same way, you are able to make use of circular buffers and other DSP-specific things. Circular buffers are also very useful for calculating the FFT and FFT-based (circular) convolution.
2) FFT can be found in solutions to many problems, such as heat flow (Fourier himself actually came up with the solution back in the 1800s), analysis of mechanical oscillations (or any linear oscillators for that matter, including oscillators in quantum physics), analysis of brain waves (EEG), seismic activity, planetary motion and many other things. Any mathematical problem that involves convolution can be easily solved via the Fourier transform, analog or discrete.
3) For some of the applications listed above, including audio processing, other transforms other than FFT are constantly being invented, discovered, and applied to processing, such as Mel-Cepstrum (e.g. MPEG codecs), wavelet transform (e.g. JPEG2000 codecs), discrete cosine transform (e.g. JPEG codecs) and many others. In quantum physics, however, the Fourier Transform is inherent in the equation of angular momentum. It arises naturally, not just for the purposes of analysis or easy of calculations. For this reason, I would not necessarily put the reasons to use Fourier Transform in audio processing and quantum mechanics into the same category. For signal processing, it's a tool; for quantum physics, it's in the nature of the phenomenon.
Before GPUs and SIMD instruction sets in mainstream CPUs this was the only way to get performance for some applications. In the late 20th Century I worked for a company that made PCI cards to place extra processors in a PCI slot. Some of these were DSP cards using a TI C64x DSP, others were PowerPC cards to provide Altivec. The processor on the cards would typically have no operating system to give more predicatable real-time scheduling than the host. Customers would buy an industrial PC with a large PCI backplace, and attach multiple cards. We would also make cards in form factors such as PMC, CompactPCI, and VME for more rugged environments.
People would develop code to run on these cards, and host applications which communicated with the add-in card over the PCI bus. These weren't easy platforms to develop for, and the modern libraries for GPU computing are much easier.
Nowadays this is much less common. The price/performance ratio is so much better for general purpose CPUs and GPUs, and DSPs for scientific computing are vanishing. Current DSP manufacturers tend to target lower power embedded applications or cost sensitive high volume devices like digital cameras. Compare GPUFFTW with these Analog Devices benchmarks. The DSP peaks at 3.2GFlops, and the Nvidia 8800 reachs 29GFlops.

How do qubits work and what are their pros and cons? What impact will they have on programming languages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What can we do more with qubits than normal bits, and how do they work? I read about them some time ago, and it appears that qubits can store not just 0 or 1, but also 0 and 1 at the same time. I don't really understand how they work. Can someone please explain this to me?
What are their pros and cons, and what impact will they have on programming languages like C after quantum computers are actually invented?
How would we manage memory when a bit (which is also a quantum) can take multiple values at once? How can we determine if something is true or false, when there is more than just 1 and 0?
Any "classical" (as it will be called once the technology is in wider use) problem which is solved by "classical" code can be solved using some sort of quantum processor by transforming the problem. For example, to do a database search, instead of using an index-based search/binary search, or a linear search for an unsorted database, you can use Grover's algorithm. Also, to take a step back from the previous poster's mention of BQP problems, problems with a classical "solution" that runs in NP-time can be sped up considerably by Grover's algorithm (a speedup in the time to search through every possible solution). RSA cryptography is also made much more insecure by the advent of Shor's algorithm, since it makes factorising large numbers into their prime factors (the hinge upon which RSA sits) solvable in logarithmic time.
EDIT: Shor's algorithm actually runs in O((log N)^3), which is polynomial-over-logarithmic time.
The conclusion of this sort of thing is that pre-existing programming languages like C will not be able to be used on a quantum computer due to the nature of quantum algorithms (applying certain functions to quantum states), unless someone invents a way to map quantum gates and so forth to logical gates (EDIT: This has apparantly been mostly addressed here), in which case about all we get is a very very fast logical processor when using languages like C.
PS: I'm sure there'll be OpenGL bindings for quantum computing eventually :P
If we can make a working quantum computer (still an open question) then it can efficiently solve certain algorithmic problems that (we think) a classical computer cannot efficiently solve. These are the problems in the complexity class BQP but not in P. One big one is integer factorization. As Will A mentioned, if you can factor enormous integers quickly, you can break a lot of modern ciphers.
The catch is that nobody knows for sure if BQP is actually "bigger" than P — it might be that anything a quantum computer can do quickly, so can a classical computer.
We also don't know if BQP is as big as NP — for instance, nobody has found an efficient way to solve the Traveling Salesman Problem on a quantum computer. This is a common misconception about quantum computers. They might be able to do NP-complete problems quickly, and then again they might not. Nobody knows.
http://scottaaronson.com/blog/?p=208 be good readin' on this topic (as is the rest of the blog).
Regarding what can be solved with quantum computers: A quantum computer would break current asymmetric encryption schemes. It is a common misconception, that quantum computers can solve most optimization problems. They cannot. See
this article for more details what can be solved using quantum computers and what cannot.
qubits doesn't store 0 and 1 simultaneously, actually they are made from the superposition of the 0 and 1 at a time.
so if a normal bit can represent 0 or 1 at a time, but qubits contain 0 and 1 at a time. three normal bits can store any one of the following....
000,001,010,...,111. but qubit can represent all of them at a time(which are in superposition). so a 'n' qubits store 2^n bits simultaneously!
Suppose a qubit an electron and it spins just like dipole momentum particle and when it spins it create an amplitude of multiple intensity and frequencies that minor amplitude can create spin vibration or momentum of particle that momentum can store thousand bits of information !!! (that's called quantum information processing) which is future !

Resources