How to understand the speedup in optimization report from icc compiler? - vectorization

environment is:
icc version 19.0.0.117 (gcc version 5.4.0 compatibility)
Intel parallel studio XE cluster edition 2019
Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
Ubuntu 16.04
compiler flags are:
-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all
I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:
LOOP BEGIN at get_forces.c(3668,3)
remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 0.773
remark #15300: LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 3
remark #15451: unmasked unaligned unit stride stores: 2
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 21
remark #15477: vector cost: 11.000
remark #15478: estimated potential speedup: 1.050
remark #15488: --- end vector cost summary ---
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25015: Estimate of max trip count of loop=1
LOOP END
My question is:
I do not understand how the speedup is calculated from
normalized vectorization overhead 0.773
scalar cost: 21
vector cost: 11.000
Another more extreme and puzzled case could be
LOOP BEGIN at get_forces.c(2690,8)
<Distributed chunk3>
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 1.857
remark #15448: unmasked aligned unit stride loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 7
remark #15477: vector cost: 3.500
remark #15478: estimated potential speedup: 0.770
remark #15488: --- end vector cost summary ---
remark #25436: completely unrolled by 3
LOOP END
Now, 3.5+1.857=5.357 < 7
So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?
How to understand the speedup in optimization report from icc compiler?

The “scalar cost” means “cost of one iteration of scalar loop”.
The “vector cost” means “cost of one iteration of vectorized loop divided by
vector_length*unroll_factor”, i.e. cost of somewhat equivalent to one scalar iteration.
The “vectorization overhead” shows normalized (by vector iteration cost) cost of vector initializations/finalizations before/after the loop.
The “estimated potential speedup” is calculated for the whole loop execution. It shows the normalized (by scalar iteration cost) potential gain of vectorized loop execution – including peel, remainder, and main loop for the estimated loop trip count. It can’t be derived explicitly from the scalar and vector cost shown above.

Related

Neural net fails to generalize a simple bitwise AND

After taking bunch of online courses and reading many papers I started playing with neural-net but to my surprise it fails to generalize a simple bitwise AND operation.
Inputs:
Inp#1 - randomly generated number between 0-15, scaled down to (0,1)
Inp#2 - 16 bit randomly generated unsigned int scaled down to (0,1)
# Code snippet
int in1 = (int)rand()%16;
int in2 = (int)rand()%(0x0010000);
in[0] = (fann_type)(in1/100.0); // not to worry about float roundup
in[1] = (fann_type)(in2/100000.0); // not to worry about float roundup
Outputs:
Out#1 = -1 if the corresponding bit specified by index inp#1 in inp#2 value is 0, otherwise 1
# Code snippet
int out1 = (in2 & (1<<in1)) ? 1 : -1;
out[0] = (fann_type)out1;
Network: tried many different variations, below is example
A. 1 hidden layer with 30 neurons,
Activation Function (hidden): sigmoid,
Activation Function (output): sigmoid_symmetric (tanh),
Training method: RPROP
Learning rate: 0.7 (default)
Momentum: 0.0 (default)
RPROP Increase factor: 1.2 (default)
RPROP Decrease factor: 0.5 (default)
RPROP Minimum Step-size: 0 (default)
RPROP Maximum Step-size: 50 (default)
B. 3 hidden layers each having 30 neurons, with the same params as in A
C. tried the same networks also with scaling inputs to (-1,1) and using tanh for also hidden layer.
Data Sets: 5000 samples for training, 5000 for testing and 5000 for validation. Tried even bigger datasets, no success
# examples from training set
0.040000 0.321600
-1
0.140000 0.625890
1
0.140000 0.039210
-1
0.010000 0.432830
1
0.100000 0.102220
1
Process: the network trained with training set and monitored the MSE of test data in parallel to avoid possible overfitting.
Libraries: used multiple, but mostly tried with fann and used fanntool for gui.
Any ideas? Can upload the datasets if any particular interest.
If I understand your setup, you try to do something like:
have a network of architecture 2-X-X-X-1 (where X - hidden units) - thus 2 inputs, one output
model bitwise function over inputs
If this is true, this is extremely peculiar problem, and a very bad choice of architecture. Neural networks are not magical hats, they are very big family of models. What you try to do has no characteristics, which is expected from function to model by NN. It is completely non smooth in the input, it has lots of discontinuities, it is actually a bunch of if-else clauses.
What you should do? You should express your inputs as bits, thus you should have 32 inputs, 16 binary inputs per number, then it will learn your function without any problems. You encoded inputs in a very specific manner (by taking its decimal representation) and expect your network to model decomposition to binary and then operation on top of it. NN will learn it, but you might need quite complex network to achieve such operation - again, the whole reason is the fact that you provided your network with suboptimal representation and build a very simple network, which was originally designed to approximate smooth functions.

about CUFFT input sizes

It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.
How could they managed to do that?
For as far as I know, FFT must provide best perfomance only for 2^a input size.
This means that input sizes with prime factors larger than 7 would go slower.
The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.
As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).
However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)
Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.
http://mathworld.wolfram.com/FastFourierTransform.html
https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm

Difference between the AVX instructions vxorpd and vpxor

According to the Intel Intrinsics Guide,
vxorpd ymm, ymm, ymm: Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
vpxor ymm, ymm, ymm: Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.
What is the difference between the two? It appears to me that both instructions would do a bitwise XOR on all 256 bits of the ymm registers. Is there any performance penalty if I use vxorpd for integer data (and vice versa)?
Combining some comments into an answer:
Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).
On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.
Some CPUs have a "bypass delay" between integer and FP "domains". Agner Fog's microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the "wrong" type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.
On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel's optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).
I'd recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).
On AMD Bulldozer / Piledriver / Steamroller there is no "FP" version of the boolean ops. (see pg. 182 of Agner Fog's microarch manual.) There's a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can't avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)
In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren't part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.
On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.

LS-SVM training : Out of memory

I try to train an LS-SVM classifier on a dataset having the following size:
Training dataset: TS = 48000x12 (double)
Groups: G = 48000x1 (double)
Matlab training code is:
class = svmtrain(TS,G,'method','LS',...
'kernel_function','rbf','boxconstraint',C,'rbf_sigma',sigma);
Then, I got this error message:
Error using svmtrain (line 516)
Error evaluating kernel function 'rbf_kernel'.
Caused by:
Error using repmat
Out of memory. Type HELP MEMORY for your options.
Note that the size of the physical memory is 4Gb, and it works when I decrease dataset training size. So if there are any solution with the same data size and of course without adding physical memory.
It seems, that the implementation requires computation of the whole Gram matrix, which is the size of N x N (where N - number of sampels) in your case it is 2,304,000,000, now each is represented by the 32bit float, meaning it requires at least 4 bytes which gives as 9,216,000,000 bytes required, which is roughly 9GB of data just for a Gram (Kernel) matrix.
There are two options:
Find implementation which for RBF kernel do not compute the kernel (Gram) matrix, but instead use some callable to compute the kernel value each time
You can try to use some kind of LS-SVM approximation, like Fast Sparse Approximation of Least Squares Support Vector Machine : http://homes.cs.washington.edu/~lfb/software/FSALS-SVM.htm

Is there a machine learning algorithm which successfully learns the parity function?

The parity function is a function from a vector of n bits and outputs 1 if the sum is odd and 0 otherwise. This can be viewed as a classification task, where the n input are the features.
Is there any machine learning algorithm which would be able to learn this function? Clearly random decision forests would not succeed, since any strict subset of features has no predictive power. Also, I believe no neural network of a fixed depth would succeed, since computing the parity function is not in the complexity class AC0.
Polynomial SVMs can do this.
Encode zeros as 1 and ones as -1.
For n variables (bits), you need a polynomial kernel of degree n.
When the kernel is computed, it also implicitly computes the value x1 * x2 * ... * xn (where xi is the i-th input variable).
If the result is -1, you have an odd number of ones, otherwise you have an even number of ones.
If I'm not mistaken, Neural Networks should also be able to compute it. As far as I remember, Neural Networks with 2 hidden layers and sigmoid units are able to learn any arbitrary function.
What about Gaussian Process Classification? You can train your model by n-dimensional input vector and 1-dimensional parity bit output. Then for any test input you can ask for a prediction. You can check this online book.
http://www.gaussianprocess.org/gpml/chapters/
Chapter 3 addresses the classification problem.
Neural Networks can represent and learn the parity function with a single hidden layer with the same number of neurons as inputs. The fact that the parity function is not in AC0 is a fact about circuits of boolean gates, but multi-layer perceptrons (as commonly used) can have real-valued weights, which makes all the difference.
An example of an explicit solution would the following, assuming n inputs, n hidden units and a sign activation function (described for example in [1]):
Set all weights in the first layer to 1. This means that the pre-activation before the addition of the bias is the same for all hidden units and equal to the number of 1s in the input
Set the bias in the first hidden unit to -0.5, the threshold for the second hidden unit to -1.5, for the third hidden unit to -2.5 etc. This means that if there is no 1 in the input and the pre-activations before the addition of the bias are 0, the pre-activation after the addition of the bias is negative for all hidden units and the sign function will return a 0 for all hidden units. If there is a single 1 in the input, only the pre-activation of the first hidden unit will be positive and there will therefore be a single hidden unit that will send a 1 to the output. In general, if there are k 1s in the input, the first k hidden units will send a 1 to the output, the rest a zero.
Set the weights that connect the hidden units to the output +1, -1, +1, -1 etc. This means if there is no 1 in the input, the output will be 0. If there is a single 1 in the input, the output will be +1. If there are two 1s in the input, the output will be again +1-1=0 and so on.
That solves the parity problem.
Your were asking, however, about machine learning algorithms that can learn this function. According to the section "Parity" in [2], the answer is that at least for small n, back-propagation on a single-layer neural network can learn the function and, in fact, it actually learns a network very similar to the one described above.
[1] Franco, Leonardo, and Sergio A. Cannas. "Generalization properties of modular networks: implementing the parity function." IEEE transactions on neural networks 12.6 (2001): 1306-1313.
[2] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS-8506. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.

Resources