How to programmatically set pragma in RenderScript - pragma

I am working on a RenderSript project. In RenderScript, I can relax the floating point precision by #pragma rs_fp_imprecise. However, I do not want low precision in all cases. Is there any way to set the pragma programmatically?

No. The pragma is used (and can only be used) for the entire file. Please also note that you should really not be using rs_fp_imprecise at all. Please use rs_fp_relaxed if you don't need full IEEE-754 conforming behavior. rs_fp_imprecise came into existence to theoretically support low-end GPGPU drivers, but those have not really materialized. All of the existing RS GPGPU drivers will accelerate rs_fp_relaxed code, so that is most likely the appropriate value to use these days (assuming you can tolerate some loss of precision).

Related

x87 FPU and integer arithmetic?

I'm trying to understand using the FPU for 64-bit integer arithmetic. I write this (ATT syntax):
fildq A
fildq B
faddp
fistpq C
The result in C is A + B + 1. If I start with an "finit" instruction, it gives me the correct value A + B. I thought that the unwanted +1 was maybe because it was adding in a carry bit, but using gdb I see no difference at all in the FPU control registers when I use finit from when I don't -- in both cases the control register starts off as 0x27F, the tag register is 0xFFFF (= stack empty), and all the others (including the status register, where all the condition bits are located) are zero.
Using finit seems a bit of a blunt instrument here, and I'm also wondering where the extra +1 is coming from if I don't use it, given that all the FPU registers seem to have the same values in both cases. Can anyone shed any light on this for me?
[…] I see no difference at all in the FPU control registers when I use finit from when I don't -- in both cases the control register starts off as 0x27F […]
Are you sure?
finit is supposed to load 0x37F, one additional bit set in comparison to 0x27F.
The difference is in the precision control field.
The default value uses 80‑bits whilst your observed value is using 64‑bits.
The result in C is A + B + 1. […]
Using finit seems a bit of a blunt instrument here, and I'm also wondering where the extra +1 is coming from if I don't use it, […]
With sufficiently large A and B you’re likely seeing a loss in precision from fadd.
Unmasking the precision exception will confirm this.
I think you were using the inline assembly capabilities of your favorite compiler.
This is certainly convenient if you don’t wanna bother about menial tasks, yet apparently your compiler’s run-time system loads 0x27F at startup for compatibility considerations.
Study its manual (and possibly source code) for details.

How to use LogiCORE DSP48 Macro?

I want to learn how to use LogiCORE DSP48 Macro. I'm reading the Xilinx documentation but I cannot understand well how to start my first design with DSP48 Macro. Can anyone help me to make a simple design to get a better understanding of this IP core please?
Thanks in advance!
In many cases you would use DSP48 by writing Verilog/VHDL expressions containing add, subtract, and multiply.
x = a * b + c
A problem with the above expression is that the multiplication and addition take place in a single cycle. You can run the expression at a higher frequency if the operation could be pipelined. Vivado can sometimes retime these expressions across registers in order to make use of the DSP48 pipeline registers.
However, I understand wanting to use the DSP48 directly. You instantiate DSP48's just like other RTL modules. The ports, parameters, and behaviors are described in the DSP Slice User Guide for the FPGA logic that you are using.
wire [47:0] c;
wire [24:0] a;
wire [17:0] b;
DSP48E1#() dsp(
.a(a),
.b(b),
.c(c),
.p(x),
.opmode(5),
.alumode(0)
);
This instance is copied from one of my inner-product implementations. It is fully pipelined because I was aiming for 500MHz operation. Only achieved 400MHz due to other combinational paths.
For Xilinx 7 Series:
DSP48E1 Slice User Guide
For Xilinx Ultrascale:
DSP48E2 Slice User Guide

Fast way to swap endianness using opencl

I'm reading and writing lots of FITS and DNG images which may contain data of an endianness different from my platform and/or opencl device.
Currently I swap the byte order in the host's memory if necessary which is very slow and requires an extra step.
Is there a fast way to pass a buffer of int/float/short having wrong endianess to an opencl-kernel?
Using an extra kernel run just for fixing the endianess would be ok; using some overheadless auto-fixing-read/-write operation would be perfect.
I know about the variable attribute ((endian(host/device))) but this doesn't help with a big endian FITS file on a little endian platform using a little endian device.
I thought about a solution like this one (neither implemented nor tested, yet):
uint4 mask = (uint4) (3, 2, 1, 0);
uchar4 swappedEndianness = shuffle(originalEndianness, mask);
// to be applied on a float/int-buffer somehow
Hoping there's a better solution out there.
Thanks in advance,
runtimeterror
Sure. Since you have a uchar4 - you can simply swizzle the components and write them back.
output[tid] = input[tid].wzyx;
swizzling is very also performant on SIMD architectures with very little cost, so you should be able to combine it with other operations in your kernel.
Hope this helps!
Most processor architectures perform best when using instructions to complete the operation which can fit its register width, for example 32/64-bit width. When CPU/GPU performs such byte-wise operators, using subscripts .wxyz for uchar4, they needs to use a mask to retrieve each byte from the integer, shift the byte, and then using integer add or or operator to the result. For the endianness swaping, the processor needs to perform above integer and, shift, add/or for 4 times because there are 4 bytes.
The most efficient way is as follows
#define EndianSwap(n) (rotate(n & 0x00FF00FF, 24U)|(rotate(n, 8U) & 0x00FF00FF)
n could be in any gentype, for example, an uint4 variable. Because OpenCL does not allow C++ type overloading, so the best choice is macro.

False autovectorization in Intel C compiler (icc)

I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd, #pragma aligned, #pragma ivdep. When compiling with the several -vec-report options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).
The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS and PAPI_FP_OPS) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.
Has anyone experienced something similar with this?
Spills may destroy the advantage of vectorization, thus 64-bit mode may gain significantly over 32-bit mode. Also, icc may version a loop and you may be hitting a scalar version even though there is a vector version present. icc versions issued in the last year or 2 have fixed some problems in this area.

What's the deal with 17- and 40-bit math in TI DSPs?

The TMS320C55x has a 17-bit MAC unit and a 40-bit accumulator. Why the non-power-of-2-width units?
The 40-bit accumulator is common in a few TI DSPs. The idea is basically that you can accumulate up to 256 arbitrary 32-bit products without overflow. (vs. in C where if you take a 32-bit product, you can overflow fairly quickly unless you resort to using 64-bit integers.)
The only way you access these features is by assembly code or special compiler intrinsics. If you use regular C/C++ code, the accumulator is invisible. You can't get a pointer to it.
So there's not any real need to adhere to a power-of-2 scheme. DSP cores have been fairly optimized for power/performance tradeoffs.
I may be talking through my hat here, but I'd expect to see the 17-bit stuff used to avoid the need for a separate carry bit when adding/subtracting 16-bit samples.

Resources