openmp task final - task

I have a little problem regarding the final clause of the task concept.
The following code is working fine with the if-else statement which ensures that tasking is aborted at a length of less than 100 elements of an recursive quicksort implementation.
Now I want to implement this with the final clause but I doesn't work. It is much slower than with the if-else statement
//if ( length > 100 ){
#pragma omp task untied final(length < 100) mergeable
do_something(a,c);
#pragma omp task untied final(length < 100) mergeable
do_something(b,c);
//}else{
// do_something(a,c);
// do_something(b,c);
//}

Related

How to ask Clang++ not to cache function result during -O3 optimization?

This is my code:
int foo(int x) {
return x + 1; // I have more complex code here
}
int main() {
int s = 0;
for (int i = 0; i < 1000000; ++i) {
s += foo(42);
}
}
Without -O3 this code works for a few minutes. With -O3 it returns the same result in no time. Clang++, I believe, caches the value of foo(42) (it's a pure function) and doesn't call it a million times. How can I instruct it NOT to apply this particular optimization for this particular function call?
Out of curiosity, can you share why you would want to disable that optimization?
Anyway, about your question:
In your example code, s is never read after the loop, so the compiler would throw the whole loop away. So let's assume that s is used after the loop.
I'm not aware of any pragmas or compiler options to disable a particular optimization in a particular section of code.
Is changing the code an option?
To prevent that optimization in a portable manner, you can look for a creative way to compute the function call argument in a way such that the compiler is no longer able to treat the argument as constant. Of course the challenge here is to actually use a trick that does not rely on undefined behavior and that cannot be "outsmarted" by a newer compiler version.
See the commented example below.
pro: you use a trick that uses only the language that you can apply selectively
con: you get an additional memory access in every loop iteration; however, the access will be satisfied by your CPU cache most of the time
I verified the generated assembly for your particular example with clang++ -O3 -S. The compiler now generates your loop and no longer caches the result. However, the function gets inlined. If you want to prevent that as well, you can declare foo with __attribute__((noinline)), for example.
int foo(int x) {
return x + 1; // I have more complex code here
}
volatile int dummy = 0; // initialized to 0 and never changed
int main() {
int s = 0;
for (int i = 0; i < 1000000; ++i) {
// Because of the volatile variable, the compiler is forced to assume
// that the function call argument is different for each loop
// iteration and it is no longer able to use a cached result.
s += foo(42 + dummy);
}
}

Search for sequence in Uint8List

Is there a fast (native) method to search for a sequence in a Uint8List?
///
/// Return index of first occurrence of seq in list
///
int indexOfSeq(Uint8List list, Uint8List seq) {
...
}
EDIT: Changed List<int> into Uint8List
No. There is no built-in way to search for a sequence of elements in a list.
I am also not aware of any dart:ffi based implementations.
The simplest approach would be:
extension IndexOfElements<T> on List<T> {
int indexOfElements(List<T> elements, [int start = 0]) {
if (elements.isEmpty) return start;
var end = length - elements.length;
if (start > end) return -1;
var first = elements.first;
var pos = start;
while (true) {
pos = indexOf(first, pos);
if (pos < 0 || pos > end) return -1;
for (var i = 1; i < elements.length; i++) {
if (this[pos + i] != elements[i]) {
pos++;
continue;
}
}
return pos;
}
}
}
This has worst-case time complexity O(length*elements.length). There are several more algorithms with better worst-case complexity, but they also have larger constant factors and more expensive pre-computations (KMP, BMH). Unless you search for the same long list several times, or do so in a very, very long list, they're unlikely to be faster in practice (and they'd probably have an API where you compile the pattern first, then search with it.)
You could use dart:ffi to bind to memmem from string.h as you suggested.
We do the same with binding to malloc from stdlib.h in package:ffi (source).
final DynamicLibrary stdlib = Platform.isWindows
? DynamicLibrary.open('kernel32.dll')
: DynamicLibrary.process();
final PosixMalloc posixMalloc =
stdlib.lookupFunction<Pointer Function(IntPtr), Pointer Function(int)>('malloc');
Edit: as lrn pointed out, we cannot expose the inner data pointer of a Uint8List at the moment, because the GC might relocate it.
One could use dart_api.h and use the FFI to pass TypedData through the FFI trampoline as Dart_Handle and use Dart_TypedDataAcquireData from the dart_api.h to access the inner data pointer.
(If you want to use this in Flutter, we would need to expose Dart_TypedDataAcquireData and Dart_TypedDataReleaseData in dart_api_dl.h https://github.com/dart-lang/sdk/issues/40607 I've filed https://github.com/dart-lang/sdk/issues/44442 to track this.)
Alternatively, could address https://github.com/dart-lang/sdk/issues/36707 so that we could just expose the inner data pointer of a Uint8List directly in the FFI trampoline.

Can clang static analysis be challenged with various flows

I wonder if currently there is support for clang static analysis with various flows checking. For example - there is a checker to check for zero division, but can clang find a flow which a divider can evaluate to zero ?
simple example: in the bellow example there is a flow (i=0) that b will evaluate to 0. do I get warning here ?
for(int i = 10; i>=0; i--){
int a = div(i);
...
}
int div(int b){
return 100000 / b;
}
if not, is there a plan to support this ?

Console Print Speed

I’ve been looking at a few example programs in order to find better ways to code with Dart.
Not that this example (below) is of any particular importance, however it is taken from rosettacode dot org with alterations by me to (hopefully) bring it up-to-date.
The point of this posting is with regard to Benchmarks and what may be detrimental to results in Dart in some Benchmarks in terms of the speed of printing to the console compared to other languages. I don’t know what the comparison is (to other languages), however in Dart, the Console output (at least in Windows) appears to be quite slow even using StringBuffer.
As an aside, in my test, if n1 is allowed to grow to 11, the total recursion count = >238 million, and it takes (on my laptop) c. 2.9 seconds to run Example 1.
In addition, of possible interest, if the String assignment is altered to int, without printing, no time is recorded as elapsed (Example 2).
Typical times on my low-spec laptop (run from the Console - Windows).
Elapsed Microseconds (Print) = 26002
Elapsed Microseconds (StringBuffer) = 9000
Elapsed Microseconds (no Printing) = 3000
Obviously in this case, console print times are a significant factor relative to computation etc. times.
So, can anyone advise how this compares to eg. Java times for console output? That would at least be an indication as to whether Dart is particularly slow in this area, which may be relevant to some Benchmarks. Incidentally, times when running in the Dart Editor incur a negligible penalty for printing.
// Example 1. The base code for the test (Ackermann).
main() {
for (int m1 = 0; m1 <= 3; ++m1) {
for (int n1 = 0; n1 <= 4; ++n1) {
print ("Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}");
}
}
}
int fAcker(int m2, int n2) => m2==0 ? n2+1 : n2==0 ?
fAcker(m2-1, 1) : fAcker(m2-1, fAcker(m2, n2-1));
The altered code for the test.
// Example 2 //
main() {
fRunAcker(1); // print
fRunAcker(2); // StringBuffer
fRunAcker(3); // no printing
}
void fRunAcker(int iType) {
String sResult;
StringBuffer sb1;
Stopwatch oStopwatch = new Stopwatch();
oStopwatch.start();
List lType = ["Print", "StringBuffer", "no Printing"];
if (iType == 2) // Use StringBuffer
sb1 = new StringBuffer();
for (int m1 = 0; m1 <= 3; ++m1) {
for (int n1 = 0; n1 <= 4; ++n1) {
if (iType == 1) // print
print ("Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}");
if (iType == 2) // StringBuffer
sb1.write ("Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}\n");
if (iType == 3) // no printing
sResult = "Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}\n";
}
}
if (iType == 2)
print (sb1.toString());
oStopwatch.stop();
print ("Elapsed Microseconds (${lType[iType-1]}) = "+
"${oStopwatch.elapsedMicroseconds}");
}
int fAcker(int m2, int n2) => m2==0 ? n2+1 : n2==0 ?
fAcker(m2-1, 1) : fAcker(m2-1, fAcker(m2, n2-1));
//Typical times on my low-spec laptop (run from the console).
// Elapsed Microseconds (Print) = 26002
// Elapsed Microseconds (StringBuffer) = 9000
// Elapsed Microseconds (no Printing) = 3000
I tested using Java, which was an interesting exercise.
The results from this small test indicate that Dart takes about 60% longer for the console output than Java, using the results from the fastest for each. I really need to do a larger test with more terminal output, which I will do.
In terms of "computational" speed with no output, using this test and m = 3, and n = 10, the comparison is consistently around 530 milliseconds for Java compared to 580 milliseconds for Dart. That is 59.5 million calls. Java bombs with n = 11 (238 million calls), which I presume is stack overflow. I'm not saying that is a definitive benchmark of much, but it is an indication of something. Dart appears to be very close in the computational time which is pleasing to see. I altered the Dart code from using the "question mark operator" to use "if" statements the same as Java, and that appears to be a bit faster c. 10% or more, and that appeared to be consistently the case.
I ran a further test for console printing as shown below (example 1 – Dart), (Example 2 – Java).
The best times for each are as follows (100,000 iterations) :
Dart 47 seconds.
Java 22 seconds.
Dart Editor 2.3 seconds.
While it is not earth-shattering, it does appear to illustrate that for some reason (a) Dart is slow with console output, and (b) Dart-Editor is extremely fast with console output. (c) This needs to be taken into account when evaluating any performance that involves console output, which is what initially drew my attention to it.
Perhaps when they have time :) the Dart team could look at this if it is considered worthwhile.
Example 1 - Dart
// Dart - Test 100,000 iterations of console output //
Stopwatch oTimer = new Stopwatch();
main() {
// "warm-up"
for (int i1=0; i1 < 20000; i1++) {
print ("The quick brown fox chased ...");
}
oTimer.reset();
oTimer.start();
for (int i2=0; i2 < 100000; i2++) {
print ("The quick brown fox chased ....");
}
oTimer.stop();
print ("Elapsed time = ${oTimer.elapsedMicroseconds/1000} milliseconds");
}
Example 2 - Java
public class console001
{
// Java - Test 100,000 iterations of console output
public static void main (String [] args)
{
// warm-up
for (int i1=0; i1<20000; i1++)
{
System.out.println("The quick brown fox jumped ....");
}
long tmStart = System.nanoTime();
for (int i2=0; i2<100000; i2++)
{
System.out.println("The quick brown fox jumped ....");
}
long tmEnd = System.nanoTime() - tmStart;
System.out.println("Time elapsed in microseconds = "+(tmEnd/1000));
}
}

Can the STREAM and GUPS (single CPU) benchmark use non-local memory in NUMA machine

I want to run some tests from HPCC, STREAM and GUPS.
They will test memory bandwidth, latency, and throughput (in term of random accesses).
Can I start Single CPU test STREAM or Single CPU GUPS on NUMA node with memory interleaving enabled? (Is it allowed by the rules of HPCC - High Performance Computing Challenge?)
Usage of non-local memory can increase GUPS results, because it will increase 2- or 4- fold the number of memory banks, available for random accesses. (GUPS typically limited by nonideal memory-subsystem and by slow memory bank opening/closing. With more banks it can do update to one bank, while the other banks are opening/closing.)
Thanks.
UPDATE:
(you may nor reorder the memory accesses that the program makes).
But can compiler reorder loops nesting? E.g. hpcc/RandomAccess.c
/* Perform updates to main table. The scalar equivalent is:
*
* u64Int ran;
* ran = 1;
* for (i=0; i<NUPDATE; i++) {
* ran = (ran << 1) ^ (((s64Int) ran < 0) ? POLY : 0);
* table[ran & (TableSize-1)] ^= stable[ran >> (64-LSTSIZE)];
* }
*/
for (j=0; j<128; j++)
ran[j] = starts ((NUPDATE/128) * j);
for (i=0; i<NUPDATE/128; i++) {
/* #pragma ivdep */
for (j=0; j<128; j++) {
ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
}
}
The main loop here is for (i=0; i<NUPDATE/128; i++) { and the nested loop is for (j=0; j<128; j++) {. Using 'loop interchange' optimization, compiler can convert this code to
for (j=0; j<128; j++) {
for (i=0; i<NUPDATE/128; i++) {
ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
}
}
It can be done because this loop nest is perfect loop nest. Is such optimization prohibited by rules of HPCC?
As far as I can tell it is allowed given that the memory interleaving
is a system setting rather than a code modification (you may nor reorder
the memory accesses that the program makes).
If GUPS actually gets better performance with non-local memory on a
NUMA machine seems doubtful to me. Will bank conflict-induced latency
really be greater than the off-node memory access latency?
STREAM should not be limited by bank conflicts but will probably
benefit from off-node accesses if the CPU has an on-chip memory
controller (like the Opterons) since the bandwidth is then shared
between the local memory controller and the NUMA interconnect.

Resources