Most common page replacement algorithm used in CPUs - memory

What is the most common page replacement algorithm used in CPUs today? Is there a de facto algorithm that major CPU designers like Intel and ARM use? A lot of the text I'm finding describes the various algorithms but not how popular they are.

Related

If a computer can be Turing complete with one instruction what is the purpose of having many instructions?

I understand the concept of a computer being Turing complete ( having a MOV or command or a SUBNEG command and being able to therefore "synthesize" other instructions such as ). If that is true what is the purpose of having 100s of instructions like x86 has for example? Is to increase efficiency?
Yes.
Equally, any logical circuit can be made using just NANDs. But that doesn't make other components redundant. Crafting a CPU from NAND gates would be monumentally inefficient, even if that CPU performed only one instruction.
An OS or application has a similar level of complexity to a CPU.
You COULD compile it so it just used a single instruction. But you would just end up with the world's most bloated OS.
So, when designing a CPU's instruction set, the choice is a tradeoff between reducing CPU size/expense, which allows more instructions per second as they are simpler, and smaller size means easier cooling (RISC); and increasing the capabilities of the CPU, including instructions that take multiple clock-cycles to complete, but making it larger and more cumbersome to cool (CISC).
This tradeoff is why math co-processors were a thing back in the 486 days. Floating point math could be emulated without the instructions. But it was much, much faster if it had a co-processor designed to do the heavy lifting on those floating point things.
Remember that a Turing Machine is generally understood to be an abstract concept, not a physical thing. It's the theoretical minimal form a computer can take that can still compute anything. Theoretically. Heavy emphasis on theoretically.
An actual Turing machine that did something so simple as decode an MP3 would be outrageously complicated. Programming it would be an utter nightmare as the machine is so insanely limited that even adding two 64-bit numbers together and recording the result in a third location would require an enormous amount of "tape" and a whole heap of "instructions".
When we say something is "Turing Complete" we mean that it can perform generic computation. It's a pretty low bar in all honesty, crazy things like the Game of Life and even CSS have been shown to be Turing Complete. That doesn't mean it's a good idea to program for them, or take them seriously as a computational platform.
In the early days of computing people would have to type in machine codes by hand. Adding two numbers together and storing the result is often one or two operations at most. Doing it in a Turing machine would require thousands. The complexity makes it utterly impractical on the most basic level.
As a challenge try and write a simple 4-bit adder. Then if you've successfully tackled that, write a 4-bit multiplier. The complexity ramps up exponentially once you move to things like 32 or 64-bit values, and when you try and tackle division or floating point values you're quickly going to drown in the outrageousness of it all.
You don't tell the CPU which transistors to flip when you're typing in machine code, the instructions act as macros to do that for you, but when you're writing Turing Machine code it's up to you to command it how to flip each and every single bit.
If you want to learn more about CPU history and design there's a wealth of information out there, and you can even implement your own using transistor logic or an FPGA kit where you can write it out using a higher level design language like Verilog.
The Intel 4004 chip was intended for a calculator so the operation codes were largely geared towards that. The subsequent 8008 built on that, and by the time the 8086 rolled around the instruction set had taken on that familiar x86 flavor, albeit a 16-bit version of same.
There's an abstraction spectrum here between defining the behaviour of individual bits (Turing Machine) and some kind of hypothetical CPU with an instruction for every occasion. RISC and CISC designs from the 1980s and 1990s differed in their philosophy here, where RISC generally had fewer instructions, CISC having more, but those differences have largely been erased as RISC gained more features and CISC became more RISC-like for the sake of simplicity.
The Turing Machine is the "absolute zero" in terms of CPU design. If you can come up with something simpler or more reductive you'd probably win a prize.

Parallel programming using Haswell architecture [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I want to learn about parallel programming using Intel's Haswell CPU microarchitecture.
About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?.
Can you recommend books, tutorials, internet resources, courses?
Thanks!
It sounds to me like you need to learn about parallel programming in general on the CPU. I started looking into this about 10 months ago before I ever used SSE, OpenMP, or intrinsics so let me give a brief summary of some important concepts I have learned and some useful resources.
There are several parallel computing technologies that can be employed: MIMD, SIMD, instruction level parallelism, multi-level cahces, and FMA. With Haswell there is also computing on the IGP.
I recommend picking a topic like matrix multiplication or the Mandelbrot set. They can both benefit from all these technologies.
MIMD
By MIMD I am referring to computing using multiple physical cores. I recommend OpenMP for this. Go through this tutorial
http://bisqwit.iki.fi/story/howto/openmp/#Abstract
and then use this as a reference https://computing.llnl.gov/tutorials/openMP/. Two of the most common problems using MIMD are race conditions and false sharing. Follow OpenMP on SO reguarly.
SIMD
Many compilers can do auto-vectorization so I would look into that. MSVC's auto-vectorization is quite primitive but GCC's is really good.
Learn intrinsics. The best resource to know what a intrinsic does is http://software.intel.com/sites/landingpage/IntrinsicsGuide/
Another great resource is Agner Fog's vectorclass. 95% of the questions on SO on SSE/AVX can be answered by looking at the source code of the vectorclass. On top of that you could use the vectorclass for most SIMD and still get the full speed and skip intrinsics.
A lot of people use SIMD inefficiently. Read about Array of Structs (AOS) and Struct of Arrays (SOA) and Array of struct of Arrays (AOSOA). Also look into Intel strip mining Calculating matrix product is much slower with SSE than with straight-forward-algorithm
See Ingo Wald's PhD thesis for a interesting way to implement SIMD in ray tracing. I used this same idea for the Mandelbrot set to calculate 4(8) pixels at once using SSE(AVX).
Also read this paper "Extending a C-like Language for Portable SIMD Programming" by Wald http://www.cdl.uni-saarland.de/papers/leissa_vecimp_tr.pdf to get a better idea how to use SIMD.
FMA
FMA3 is new since Haswell. It's so new that there is not much discussion on it on SO yet. But this answer (to my question) is good
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX. FMA3 doubles the peak FLOPS so potentially matrix multiplication is twice as fast on Haswell compared to Ivy Bridge.
According to this answer the most important aspect of FMA is not the fact that it's one instructions instead of two to do multiplication and addition it's the "(virtually) infinite precision of the intermediate result." For example implementing double-double multiplication without FMA it takes 6 multiplications and several additions whereas with FMA it's only two operations.
Instruction level parallelism
Haswell has 8 ports which it can send μ-ops to (though not every port can take the same mirco-op; see this AnandTech review). This means Haswell can do, for example two 256-bit loads, one 256-bit store, two 256-bit FMA operations, one scalar addition, and a condition jump at the same time (six μ-ops per clock cycle).
For the most part you don't have to worry about this since it's done by the CPU. However, there are cases where your code can limit the potential instruction level parallelism. The most common is a loop carried dependency. The following code has a loop carried dependency
for(int i=0; i<n; i++) {
sum += x(i)*y(i);
}
The way to fix this is to unroll the loop and do partial sums
for(int i=0; i<n; i+=2) {
sum1 += x(i)*y(i);
sum2 += x(i+1)*y(i+1);
}
sum = sum1 + sum2;
Multi-level Caches:
Haswell has up to four levels of caches. Writing your code to optimally take advantage of the cache is by far the most difficult challenge in my opinion. It's the topic I still struggle the most with and feel the most ignorant about but in many cases improving cache usage gives better performance than any of the other technologies. I don't have many recommendations for this.
You need to learn about sets and cache lines (and the critical stride) and on NUMA systems about pages. To learn a little about sets and the critical stride see Agner Fog's http://www.agner.org/optimize/optimizing_cpp.pdf and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Another very useful topic for the cache is loop blocking or tiling. See my answer (the one with the highest votes) at What is the fastest way to transpose a matrix in C++? for an example.
Computing on the IGP (with Iris Pro).
All Haswell consumer processors (Haswell-E is not out yet) have an IGP. The IGP uses at least 30% of the silicon to over 50%. That's enough for at least 2 more x86 cores. This is wasted computing potential for most programmers. The only way to program the IGP is with OpenCL. Intel does not have OpenCL Iris Pro drivers for Linux so you can only do with with Windows (I'm not sure how good Apple's implementation of this is). Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL.
One advantage of the Iris Pro compared to Nvidia and AMD is that double floating point is only one quarter the speed of single floating point with the Iris Pro (however fp64 is only enabled in Direct Compute and not with OpenCL). NVIDIA and AMD (recently) cripple double floating point so much that it makes GPGPU double floating point computing not very effective on their consumer cards.

SIFT hardware accelerator for smartphones

I'm a fresh graduate electronics engineer and I've an experience on computer vision.I want to ask if it's feasible to make a hardware accelerator of SIFT algorithm - or any other openCV algorithms - to be used on smartphones instead of the current software implementation?
What are the advantages (much low computation, lower power, more complex applications will appear, ...) and the disadvantages(isn't better than the current software implementation, ...)?
Do you have an insight of that?
Thanks
You might be interested to check NEON optimizations - a type of SIMD instructions supported by Nvidia Tegra 3 architectures. Some OpenCV functions are NEON optimized.
Start by reading this nice article Realtime Computer Vision with OpenCV, it has performance comparisons about using NEON, etc.
I also recommend you to start here and here, you will find great insights.
Opencv supports both cuda and (experimentally) opencl
There are specific optimizations for Nvidia's Tegra chipset used in a lot of phones/tablets. I don't know if any phone's use opencl

Use Digital Signal Processors to accelerate calculations in the same fashion than GPUs

I read that several DSP cards that process audio, can calculate very fast Fourier Transforms and some other functions involved in Sound processing and others. There are some scientific problems (not many), like Quantum mechanics, that involver Fourier Transform calculations. I wonder if DSP could be used to accelerate calculations in this fashion, like GPUs do in some other cases, and if you know succcessful examples.
Thanks
Any linear operations are easier and faster to do on DSP chips. Their architecture allows you to perform a linear operation (take two numbers, multiply each of them by a constant and add the results) in a single clock cycle. This is one of the reasons FFT can be calculated so quickly on a DSP chip. This is also a reason many other linear operations can be accelerated with their use. I guess I have three main points to make concerning performance and code optimization for such processors.
1) Perhaps less relevant, but I'd like to mention it nonetheless. In order to take full advantage of DSP processor's architecture, you have to code in Assembly. I'm pretty sure that regular C code will not be fully optimized by the compiler to do what you want. You literally have to specify each register, etc. It does pay off, however. The same way, you are able to make use of circular buffers and other DSP-specific things. Circular buffers are also very useful for calculating the FFT and FFT-based (circular) convolution.
2) FFT can be found in solutions to many problems, such as heat flow (Fourier himself actually came up with the solution back in the 1800s), analysis of mechanical oscillations (or any linear oscillators for that matter, including oscillators in quantum physics), analysis of brain waves (EEG), seismic activity, planetary motion and many other things. Any mathematical problem that involves convolution can be easily solved via the Fourier transform, analog or discrete.
3) For some of the applications listed above, including audio processing, other transforms other than FFT are constantly being invented, discovered, and applied to processing, such as Mel-Cepstrum (e.g. MPEG codecs), wavelet transform (e.g. JPEG2000 codecs), discrete cosine transform (e.g. JPEG codecs) and many others. In quantum physics, however, the Fourier Transform is inherent in the equation of angular momentum. It arises naturally, not just for the purposes of analysis or easy of calculations. For this reason, I would not necessarily put the reasons to use Fourier Transform in audio processing and quantum mechanics into the same category. For signal processing, it's a tool; for quantum physics, it's in the nature of the phenomenon.
Before GPUs and SIMD instruction sets in mainstream CPUs this was the only way to get performance for some applications. In the late 20th Century I worked for a company that made PCI cards to place extra processors in a PCI slot. Some of these were DSP cards using a TI C64x DSP, others were PowerPC cards to provide Altivec. The processor on the cards would typically have no operating system to give more predicatable real-time scheduling than the host. Customers would buy an industrial PC with a large PCI backplace, and attach multiple cards. We would also make cards in form factors such as PMC, CompactPCI, and VME for more rugged environments.
People would develop code to run on these cards, and host applications which communicated with the add-in card over the PCI bus. These weren't easy platforms to develop for, and the modern libraries for GPU computing are much easier.
Nowadays this is much less common. The price/performance ratio is so much better for general purpose CPUs and GPUs, and DSPs for scientific computing are vanishing. Current DSP manufacturers tend to target lower power embedded applications or cost sensitive high volume devices like digital cameras. Compare GPUFFTW with these Analog Devices benchmarks. The DSP peaks at 3.2GFlops, and the Nvidia 8800 reachs 29GFlops.

OpenCL compliant DSP

On the Khronos website, OpenCL is said to be open to DSP. But when I look on the website of DSP making companies, like Texas Instrument, Freescale, NXP or Analog Devices, I can't find any mention about OpenCL.
So does anyone knows if a OpenCL compliant DSP exists?
Edit: As this question seems surprising, I add the reason why I asked it. From the khronos.org page:
"OpenCL 1.0 at a glance
OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs"
So I think it would be interesting to know if it's true, if DSPs, which are particulary suited for some complex calculations, can really be programmed using OpenCL.
The OpenCL spec seems to support using a chip that has one or more programmable GPU shader cores as an expensive DSP. It does not appear that the spec makes many allowances for DSP chips that were not designed to support being used as a programmable GPU shader in a graphics pipeline.
I finally found one: The SNU-Samsung OpenCL Framework is able to use Texas Instrument C64x DSPs. More infos here:
http://aces.snu.ac.kr/Center_for_Manycore_Programming/SNU-SAMSUNG_OpenCL_Framework.html

Resources