warning X3557: loop only executes for 0 iteration(s), forcing loop to unroll - directx

The compiler produce a "warning X3557: loop only executes for 0 iteration(s), forcing loop to unroll" and I don't understand why.
Here is the source code. It is a revisited itoa() function for HLSL producing resulting ascii codes in an array of uint.
#define ITOA_BUFFER_SIZE 16
// Convert uint to ascii and return number of characters
uint UIntToAscii(
in uint Num, // Number to convert
out uint Buf[ITOA_BUFFER_SIZE], // Where to put resulting ascii codes
in uint Base) // Numeration base for convertion
{
uint I, J, K;
I = 0;
while (I < (ITOA_BUFFER_SIZE - 1)) { // <==== Warning X3557
uint Digit = Num % Base;
if (Digit < 10)
Buf[I++] = '0' + Digit;
else
Buf[I++] = 'A' + Digit - 10;
if ((Num /= Base) == 0)
break;
}
// Reverse buffer
for (K = 0, J = I - 1; K < J; K++, J--) { // <==== Warning X3557
uint T = Buf[K];
Buf[K] = Buf[J];
Buf[J] = T;
}
// Fill remaining of buffer with zeros to make compiler happy
K = I;
while (K < ITOA_BUFFER_SIZE)
Buf[K++] = 0;
return I;
}
I tried to rewrite the while loop but this doesn't change anything. Also tried to use attribute [fastopt] without success. As far as I can see the function produce the correct result.
Any help appreciated.

The warning you are getting is
WAR_TOO_SIMPLE_LOOP 3557 The loop only executes for a limited number
of iterations or doesn't seem to do anything so consider removing it
or forcing it to unroll.
The warning is pretty much self explanatory, if you consider that loops are considered inefficient in GPGPU, so the compiler tries to unroll them when it's possible. What the compiler is telling you is that you created some loops that can run more efficiently if unrolled, or can be removed because they never run. If a loop is unrollable, it means that you can predict at compile time the number of times it will run. Your loops on first look should not fulfill this criterium.
I = 0;
while (I < (ITOA_BUFFER_SIZE - 1)) { // <==== Warning X3557
uint Digit = Num % Base;
if (Digit < 10)
Buf[I++] = '0' + Digit;
else
Buf[I++] = 'A' + Digit - 10;
if ((Num /= Base) == 0)
break;
}
This while loop runs up to 15 times I < (ITOA_BUFFER_SIZE - 1), depending on (Num /= Base) == 0. Final value of I is between 1 and 15, depending on how if ((Num /= Base) == 0) evaluates on each cycle. Nonetheless, it still is unrollable, because the compiler may still insert a conditional jump over the iterations.
// Reverse buffer
for (K = 0, J = I - 1; K < J; K++, J--) { // <==== Warning X3557
uint T = Buf[K];
Buf[K] = Buf[J];
Buf[J] = T;
}
This second loop, instead should not be unrollable, because I should not be known to the compiler.
The warning you reported
warning X3557: loop only executes for 0 iteration(s), forcing loop to unroll
might refer to the first loop if if ((Num /= Base) == 0) always evaluates to true on first iteration. In that case, I would be equal to 1, and J would be equal to 0 on the second loop. That second loop would not run because K < J would evaluate to false on first iteration.
What you get in the end if you let it [unroll] is probably a single iteration on the while loop and the complete removal of the subsequent for loop. I highly suspect this is not your intended behaviour, and while it might suppress the warning, you might want to check the code and see if something does not run the way it should.

Related

Explicit memory prefetching for Intel Compilers

I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
}
}
}
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
}
}
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024 and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times
I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?
TIA

How to convert Uint8List to decimal number in Dart?

I have an Uint8List data list, for example:
Uint8List uintList = Uint8List.fromList([10, 1]);
How can I convert these numbers to a decimal number?
int decimalValue = ??? // in this case 265
Mees' answer is the correct general method, and it's good to understand how to do bitwise operations manually.
However, Dart does have a ByteData class that has various functions to help parse byte data for you (e.g. getInt16, getUint16). In your case, you can do:
Uint8List uintList = Uint8List.fromList([10, 1]);
int decimalValue = ByteData.view(uintList.buffer).getInt16(0, Endian.little);
print(decimalValue); // Prints: 266.
From what I understand of your question, you want decimalValue to be an integer where the least significant byte is (decimal)10, and the byte after that to be 1. This would result in the value 1 * 256 + 10 = 266. If you meant the bytes the other way around, it would be 10 * 256 + 1 = 2560 + 1 = 2561.
I don't actually have any experience with dart, but I assume code similar to this would work:
int decimalValue = 0;
for (int i = 0; i < uintList.length; i++) {
decimalValue = decimalValue << 8; // shift everything one byte to the left
decimalValue = decimalValue | uintList[i]; // bitwise or operation
}
If it doesn't produce the number you want it to, you might have to iterate through the loop backwards instead, which requires changing one line of code:
for (int i = uintList.length-1; i >= 0; i--) {

How to vectorize Mersenne Twister loops over arrays

Currently i'm working with an custom implementation of the Mersenne Twister, and i'd like to improve my understanding of vector operations.
I have the following code:
#define N 624
#define M 397
for( k = N -1; k; k-- )
{
array[i] = (array[i] ^ ((array[i-1] ^ (array[i-1] >> 30)) * 1566083941UL)) - i;
array[i] &= 0xffffffffUL;
++i;
if ( i >= N )
{
array[0] = array[N-1];
i = 1;
}
}
Here i'm working with 32 bit integers only, so as i understand, I could perform 8 times as much operations at the same time, using AVX2 instructions? How can I do that in practice?
I know how to deal with addition of 2 vectors, but this case seems to be more complicated. I don't know how to begin.
For a scalar approach i'd work like that, but i'd like to get sure how to perform these actions in my case.
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
for (i = 0; i < 1024; i+=4)
{
C[i:i+3] = A[i:i+3]*B[i:i+3];
}
Unfortunately at my university there are no lessons about intrinsics, but i'm quite curious in order to get an improvement.
I'm also doing some thoughts, about how to create the array using vectors? Maybe matrix? (Maybe _mm256_setr_epi32)
I hope to get some advice regarding this topic!

method will not return correct result

public static double infiniteSeries(int terms) {
int i = 0;
double result = 0;
int n = 2;
do{
result += 1/n;
i++;
n *= 2;
} while (i < terms);
return result;
}
here is the code that I have written so far. the goal of the method is to return the sum of an infinite series 1/2 + 1/4 + 1/8... + ... 1/n. So if the user calls the method with an input of 1, the method should return .5 and if the user inputs 2, it shoudl return .75 and so on. Any time I call this method it returns 0 no matter what terms I enter. I tried changing result += 1/n; to result += n; and that returns the sum of the n values the way I would expect. So it doesn't like the 1/n for some reason, but I have no idea why or how to fix it. Any help is appreciated!

Accessing global memory in CUDA is slow

I have a CUDA kernel doing some computation on a local variable (in register), and after it gets computed, its value gets written into a global array p:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
Unfortunately, this function executes very slow.
However, it runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
//val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
It also runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
// p[idx ]= val;
__syncthreads();
}
So I am confused, and have no idea how to solve this problem. I have used NSight step in, and did not find access violations.
Here is how I launch the kernel (dimX:924; dimY: 16: dimZ: 1120):
dim3
blockSize(8,16,2),
gridSize(dimX/blockSize.x+1,dimY/blockSize.y, dimZ/blockSize.z);
float* dev_p; cudaMalloc((void**)&dev_p, dimX*dimY*dimZ*sizeof(float));
dd<<<gridSize, blockSize>>>( dev_p,dimX,dimY,dimZ);
Could anyone please gives some pointers? Because it does not make much sense to me. All computation of val is fast, and the final step is to move val into p. p never gets involved in the computation, and it only shows up once. So why is it so slow?
The computations are basically a loop over a 512 X 512 matrix. It is pretty fair amount of computation I'd say.
The computations you perform in the SomeComputationOnVal are extremely expensive. Each thread reads at least 1MB of data which is off cache (or in L2 at best for a small part should k vary in a small range) which totals for your run about 16 TB of data. Even on a high end gpu, it would take about 2 minutes to run, at the minimum. Not to mention everything that could slow this down.
Your function does not write any data in global memory and has no boundary effect. The compiler may decide to optimize out the method call should you not use the output.
Hence cases two and three not doing calculation are very fast. Writing 64 MB on gpu memory, with coesced threads is very fast (milliseconds range).
You can verify the generated ptx to see if code gets optimized out. Use the --keep option in nvcc and search for ptx files.

Resources