Finite difference computation blowup from intel compiler 14, but not 12 - vectorization

I have a finite difference code for wave propagation, because there is a lot of temporary mixed derivative term, I defined a temporary memory buffer and separate them into chunks to store each derivative term for memory efficiency. The code looks like
Wrk = malloc(2*(4*nxe*(2*ne+1) + 15*nxe)*sizeof(float));
computing function:
float *dudz = Wrk + NE;
float *dqdz = dudz + nxe;
for (int i=ix0_1; i<ixh_1; i++)
dudz [i] = hdzi*(u[i+nxe]-u[i-nxe]);
The problem for me, is that the code runs fine with Intel compiler 12, however it will blow up when compiling it with intel compiler 13 and 14.
All the compiling from intel compiler 12, 13 and 14 will optimize the code above by vectorizing the loops. If I turn off the compiler optimization for intel compiler 13 and 14, by defining
volatile float *dudz = Wrk + NE;
The code will also run fine although slower.
I would greatly appreciate if any of you could give me some advice,
Thank you so much,
gqchen

Related

MS Edge: SCRIPT5022: Failed to link vertex and fragment shaders

I have run my simple tensorflow.js app on Chrome (Windows10), Android, and iOS - and it is working. But when I try to run on MS Edge (Windows10) I get this error:
Failed to create D3D shaders.
index.ts (67,1)
SCRIPT5022: Failed to link vertex and fragment shaders.
The error occurs when I am trying to make a prediction (so the GPU is used):
function predict() {
var cData = ctx.getImageData(0, 0, canvas.width, canvas.height);
var cdata = cData.data;
for (var i = 0; i < cdata.length; i += 4) { // to grayscale
cdata[i] = (cdata[i] + cdata[i + 1] + cdata[i + 2]) / 3;
}
var x = tf.browser.fromPixels(cData, 1).asType('float32'); // keep only one channel
x = tf.image.resizeNearestNeighbor(x, [28, 28]); // resize
x = x.expandDims();
x = x.div(255);
var prediction;
tf.tidy(() => {
const output = model.predict(x);
const axis = 1;
prediction = Array.from(output.argMax(axis).dataSync());
preds = output.arraySync();
});
}
The printout on the console:
C:\fakepath(114,28-43): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(115,29-36): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(106,7-48): error X3531: can't unroll loops marked with loop attribute
C:\fakepath(114,28-43): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(115,29-36): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(126,2-29): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
C:\fakepath(126,2-29): error X3500: array reference cannot be used as an l-value; not natively addressable
C:\fakepath(106,7-48): error X3511: forced to unroll loop, but unrolling failed.
C:\fakepath(104,7-48): error X3511: forced to unroll loop, but unrolling failed.
Warning: D3D shader compilation failed with default flags. (ps_5_0)
Retrying with skip validation
C:\fakepath(114,28-43): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(115,29-36): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(126,2-29): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
C:\fakepath(126,2-29): error X3500: array reference cannot be used as an l-value; not natively addressable
C:\fakepath(106,7-48): error X3511: forced to unroll loop, but unrolling failed.
C:\fakepath(104,7-48): error X3511: forced to unroll loop, but unrolling failed.
Warning: D3D shader compilation failed with skip validation flags. (ps_5_0)
Retrying with skip optimization
C:\fakepath(114,28-43): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(115,29-36): warning X3556: integer divides may be much slower, try using uints if possible.
C:\fakepath(126,2-29): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
C:\fakepath(126,2-29): error X3500: array reference cannot be used as an l-value; not natively addressable
C:\fakepath(106,7-48): error X3511: forced to unroll loop, but unrolling failed.
C:\fakepath(104,7-48): error X3511: forced to unroll loop, but unrolling failed.
Warning: D3D shader compilation failed with skip optimization flags. (ps_5_0)
Failed to create D3D shaders.
webgl_util.ts (155,5)
SCRIPT5022: Failed to link vertex and fragment shaders.
Is it a problem with some browser setting? Is tensorflow.js supporting Edge? I guess it must support Edge. tfjs 1.0 is used.
I upgraded face-api.min.js to the 0.22.2 version and the error is gone.
Here is the source to latest version:
https://github.com/justadudewhohacks/face-api.js/

OpenCL OutOfResources

I have an OpenCL Kernel that throws an OutOfResources exception when run.
Note: I am using Cloo for C#
I created a minimum working example of my problem and the kernel now looks like this:
__kernel void MinBug
(
__global float * img,
__global float * background,
__global int * tau
)
{
int neighbourhoodSize = tau[0];
const int x = get_global_id(0);
const int y = get_global_id(1);
for (int i = -neighbourhoodSize; i <= neighbourhoodSize; i++)
{
for (int j = -neighbourhoodSize; j <= neighbourhoodSize; j++)
{
//...
}
}
}
For my original program, this runs fine when tau is small (ie: 2, 10, 15), but once tau gets to be around 27, this sometimes throws an exception. The minimum working example I created does not have this problem until tau gets near 300.
The specific error that I get in my C# program is
Cloo.OutOfResourcesComputeException: 'OpenCL error code detected:
OutOfResources.'
This always happens on the very next line after calling the Kernel.Execute() method.
What concept am I missing?
Thanks to Huseyin for his advice on installing the correct runtime.
I also needed to select the correct platform in the code.
On my computer I currently have three platforms.
Two of them seem to be associated with the CPU (intel i7).
And one seems to be the GPU (NVidia gtx 660 ti)
I tried explicitly running on my GPU and it ran out of juice. As you can see from the error message above.
When I specified the CPU
CLCalc.InitCL(Cloo.ComputeDeviceTypes.Cpu, 1);
It ran much better. Who'd have thought, my CPU seems to have more grunt than the GPU. Maybe its a simplistic metric. Its also worth noting my CPU supports a later version of OpenCL than the GPU.

Copy complex numbers from host to device using ArrayFire

I am trying to copy complex number array from host to device using ArrayFire framework:
std::complex<float> hostArray[131072];
array deviceArray (131072, hostArray);
But it gives compilation error due to data type incompatibilities. What am I doing wrong?
I can copy the real and imaginary parts separately into device in order to create complex numbers inside gpu memory but it is costly and I also dont't know how to construct complex number from two numbers in ArrayFire framework.
I would be grateful if someone can help me with ArrayFire framework in this matter.
ArrayFire uses cuComplex(cfloat in ArrayFire 2.0RC) to store complex numbers. cuComplex is defined as a float2 internally which is a struct with two elements.
std::complex should have the same structure. You might be able to perform a reinterpret_cast to change the type of the variable without moving the data to a different data structure. On my machine(Linux Mint with g++ 4.7.1) I was able to create an ArrayFire array from a std::complex using the following code:
int count = 10;
std::complex<float> host_complex[count];
for(int i = 0; i < count; i++) {
std::real(host_complex[i]) = i;
std::imag(host_complex[i]) = i*2;
}
array af_complex(count, reinterpret_cast<cuComplex*>(host_complex));
print(af_complex);
Output:
af_complex =
0.0000 + 0.0000i
1.0000 + 2.0000i
2.0000 + 4.0000i
3.0000 + 6.0000i
4.0000 + 8.0000i
5.0000 + 10.0000i
6.0000 + 12.0000i
7.0000 + 14.0000i
8.0000 + 16.0000i
9.0000 + 18.0000i
Caveat
As far as I can tell the C++ standard does not specify the size or data layout of the std::complex type therefore this approach might not be portable. If you want a portable solution, I would suggest storing your complex data in a struct like a float2/cfloat to avoid compiler related issues.
Umar

Bitwise operations, wrong result in Dart2Js

I'm doing ZigZag encoding on 32bit integers with Dart. This is the source code that I'm using:
int _encodeZigZag(int instance) => (instance << 1) ^ (instance >> 31);
int _decodeZigZag(int instance) => (instance >> 1) ^ (-(instance & 1));
The code works as expected in the DartVM.
But in dart2js the _decodeZigZag function is returning invalid results if I input negativ numbers. For example -10. -10 is encoded to 19 and should be decoded back to -10, but it is decoded to 4294967286. If I run (instance >> 1) ^ (-(instance & 1)) in the JavaScript console of Chrome, I get the expected result of -10. That means for me, that Javascript should be able to run this operation properly with it number model.
But Dart2Js generate the following JavaScript, that looks different from the code I tested in the console:
return ($.JSNumber_methods.$shr(instance, 1) ^ -(instance & 1)) >>> 0;
Why does Dart2Js adds a usinged right shift by 0 to the function? Without the shift, the result would be as expected.
Now I'm wondering, is it a bug in the Dart2Js compiler or the expected result? Is there a way to force Dart2Js to output the right javascript code?
Or is my Dart code wrong?
PS: Also tested splitting up the XOR into other operations, but Dart2Js is still adding the right shift:
final a = -(instance & 1);
final b = (instance >> 1);
return (a & -b) | (-a & b);
Results in:
a = -(instance & 1);
b = $.JSNumber_methods.$shr(instance, 1);
return (a & -b | -a & b) >>> 0;
For efficiency reasons dart2js compiles Dart numbers to JS numbers. JS, however, only provides one number type: doubles. Furthermore bit-operations in JS are always truncated to 32 bits.
In many cases (like cryptography) it is easier to deal with unsigned 32 bits, so dart2js compiles bit-operations so that their result is an unsigned 32 bit number.
Neither choice (signed or unsigned) is perfect. Initially dart2js compiled to signed 32 bits, and was only changed when we tripped over it too frequently. As your code demonstrate, this doesn't remove the problem, just shifts it to different (hopefully less frequent) use-cases.
Non-compliant number semantics have been a long-standing bug in dart2js, but fixing it will take time and potentially slow down the resulting code. In the short-term future Dart developers (compiling to JS) need to know about this restriction and work around it.
Looks like I found equivalent code that output the right result. The unit test pass for both the dart vm and dart2js and I will use it for now.
int _decodeZigZag(int instance) => ((instance & 1) == 1 ? -(instance >> 1) - 1 : (instance >> 1));
Dart2Js is not adding a shift this time. I would still be interested into the reason for this behavior.

Calculating factorial on FORTRAN with integer variables. Memory overflow

I'm doing a program with FORTRAN that is a bit special. I can only use integer variables, and as you know with these you've got a memory overflow when you try to calculate a factorial superior to 12 or 13. So I made this program to avoid this problem:
http://lendricheolfiles.webs.com/codigo.txt
But something very strange is happening. The program calculates the factorial well 4 or 5 times and then gives a memory overflow message. I'm using Windows 8 and I fear it might be the cause of the failure, or if it's just that I've done something wrong.
Thanks.
Try compiling with run-time subscript checking. In Fortran segmentation faults are generally caused either by subscript errors or by mismatches between actual and dummy arguments (i.e., between arguments in the call to a procedure and the arguments as declared in the procedure). I'll make a wild guess from glancing at your code that you have have a subscript error -- let the compiler find it for you by turning on run-time subscript checking. Most Fortran compilers have this as an compilation option.
P.S. You can also do calculations like this by using already written packages, e.g., the arbitrary precision arithmetic software of David Bailey, et al., available in Fortran 90 at http://crd-legacy.lbl.gov/~dhbailey/mpdist/
M.S.B.'s answer has the gist of your problem: your array indices go out of bounds at a couple of places.
In three loops, cifra - 1 == 0 is out of bounds:
do cifra=ncifras,1,-1
factor(1,cifra-1) = factor(1,cifra)/10 ! factor is (1:2, 1:ncifras)
factor(1,cifra) = mod(factor(1,cifra),10)
enddo
! :
! Same here:
do cifra=ncifras,1,-1
factor(2,cifra-1) = factor(2,cifra)/10
factor(2,cifra) = mod(factor(2,cifra),10)
enddo
!:
do cifra=ncifras,1,-1
sumaprovisional(cifra-1) = sumaprovisional(cifra-1)+(sumaprovisional(cifra)/10)
sumaprovisional(cifra) = mod(sumaprovisional(cifra),10)
enddo
In the next case, the value of cifra - (fila - 1) goes out of bounds:
do fila=1,nfilas
do cifra=1,ncifras
! Out of bounds for all cifra < fila:
sumando(fila,cifra-(fila-1)) = factor(1,cifra)*factor(2,ncifras-(fila-1))
enddo
sumaprovisional = sumaprovisional+sumando(fila,:)
enddo
You should be fine if you rewrite the first three loops as do cifra = ncifras, 2, -1 and the inner loop of the other case as do cifra = fila, ncifras. Also, in the example program you posted, you first have to allocate resultado properly before passing it to the subroutine.

Resources