Why does returning an element of a copied Matrix3d result in incorrect output when using Clang 3.9? - clang

Compiling the following example with -O2 on Clang 3.9 results in the reproFunction returning garbage (1.9038e+185) when called in main:
double reproFunction(const Eigen::Matrix3d& R_in)
const Eigen::Matrix3d R = R_in;
Eigen::Matrix3d Q = R.cwiseAbs();
if(R(1,2) < 2) {
Eigen::Vector3d n{0, 1, R(1, 2)};
double s2 = R(1,2);
s2 /= n.norm();
return R(1, 2);
int main() {
Eigen::Matrix3d R;
R = Eigen::Matrix3d::Zero(3,3);
// This fails - reproFunction(R) returns 0
R(1, 2) = 0.7;
double R12 = reproFunction(R);
bool are_they_equal = (R12 == R(1,2));
std::cout << "R12 == R(1,2): " << are_they_equal << std::endl;
std::cout << "R12: " << R12 << std::endl;
std::cout << "R(1, 2): " << R(1, 2) << std::endl;
R12 == R(1,2): 0
R12: 1.9036e+185
R(1, 2): 0.7
reproFunction, initializes R (which is const) by assignment from R_in. It returns R(1, 2). Between the assignment and the return, reproFunction uses R in several operations, but none of them should be able to change R. Removing any of those operations results in reproFunction returning the correct value.
This behavior does not appear in any of the following cases:
The program is compiled with Clang 3.5, Clang 4.0,or g++-5.4.
The optimization level is -O1 or lower
Eigen 3.2.10 is used instead of Eigen 3.3.3
Now the question: Is this behavior due to a bug I've missed in the code above, a bug in Eigen 3.3.3, or a bug in Clang 3.9?
A self-contained reproduction example can be found at https://github.com/avalenzu/eigen-clang-weirdness.

I could reproduce this with clang 3.9, but not with clang 3.8. I bisected the issue on Eigen's side to this commit from 2016-05-24 21:54:
Bug 256: enable vectorization with unaligned loads/stores. This concerns all architectures and all sizes. This new behavior can be disabled by defining EIGEN_UNALIGNED_VECTORIZE=0
That commit enables vectorized operations on unaligned data.
I still think, this is a bug in clang, but you can work-around it by compiling with
Also, Eigen could be 'fixed' by automatically disabling this feature if clang 3.9 is detected as compiler.


triSYCL throws non_cl_error, when tricycle::device::~device is called

I'm trying to run a parallel for loop with triSYCL. This is my code:
//standart libraries
#include <iostream>
#include <functional>
#include "CL/sycl.hpp"
struct Color
float r, g, b, a;
friend std::ostream& operator<<(std::ostream& os, const Color& c)
os << "(" << c.r << ", " << c.g << ", " << c.b << ", " << c.a << ")";
return os;
struct Vertex
float x, y;
Color color;
friend std::ostream& operator<<(std::ostream& os, const Vertex& v)
os << "x: " << v.x << ", y: " << v.y << ", color: " << v.color;
return os;
template<typename T>
T mapNumber(T x, T a, T b, T c, T d)
return (x - a) / (b - a) * (d - c) + c;
int windowWidth = 640;
int windowHeight = 720;
int main()
auto exception_handler = [](cl::sycl::exception_list exceptions) {
for (std::exception_ptr const& e : exceptions)
} catch (cl::sycl::exception const& e)
std::cout << "Caught asynchronous SYCL exception: " << e.what() << std::endl;
cl::sycl::default_selector defaultSelector;
cl::sycl::context context(defaultSelector, exception_handler);
cl::sycl::queue queue(context, defaultSelector, exception_handler);
auto* pixelColors = new Color[windowWidth * windowHeight];
cl::sycl::buffer<Color, 2> color_buffer(pixelColors, cl::sycl::range < 2 > {(unsigned long) windowWidth,
(unsigned long) windowHeight});
cl::sycl::buffer<int, 1> b_windowWidth(&windowWidth, cl::sycl::range < 1 > {1});
cl::sycl::buffer<int, 1> b_windowHeight(&windowHeight, cl::sycl::range < 1 > {1});
queue.submit([&](cl::sycl::handler& cgh) {
auto color_buffer_acc = color_buffer.get_access<cl::sycl::access::mode::write>(cgh);
auto width_buffer_acc = b_windowWidth.get_access<cl::sycl::access::mode::read>(cgh);
auto height_buffer_acc = b_windowHeight.get_access<cl::sycl::access::mode::read>(cgh);
cgh.parallel_for<class init_pixelColors>(
cl::sycl::range<2>((unsigned long) width_buffer_acc[0], (unsigned long) height_buffer_acc[0]),
[=](cl::sycl::id<2> index) {
color_buffer_acc[index[0]][index[1]] = {
mapNumber<float>(index[0], 0.f, width_buffer_acc[0], 0.f, 1.f),
mapNumber<float>(index[1], 0.f, height_buffer_acc[0], 0.f, 1.f),
std::cout << "cl::sycl::queue check - selected device: "
<< queue.get_device().get_info<cl::sycl::info::device::name>() << std::endl;
}//here the error appears
delete[] pixelColors;
return 0;
I'm building it with this CMakeLists.txt file:
cmake_minimum_required(VERSION 3.16.2)
find_package(OpenCL REQUIRED)
set(Boost_INCLUDE_DIR path/to/boost)
add_executable(${PROJECT_NAME} ${SRC_FILES})
set_target_properties(${PROJECT_NAME} PROPERTIES DEBUG_POSTFIX _d)
target_link_libraries(${PROJECT_NAME} ${LIBS})
When I try to run it, I get this message: libc++abi.dylib: terminating with uncaught exception of type trisycl::non_cl_error from path/to/SYCL/include/triSYCL/command_group/detail/task.hpp line: 278 function: trisycl::detail::task::get_kernel, the message was: "Cannot use an OpenCL kernel in this context".
I've tried to create a lambda of mapNumber in the kernel but that didn't make any difference. I've also tried to use this before the end of the scope to catch errors:
} catch (cl::sycl::exception const& e)
std::cout << "Caught synchronous SYCL exception: " << e.what() << std::endl;
but nothing was printed to the console except the error from before. And I've also tried to make an event of the queue.submit call and then call event.wait() before the end of the scope but again the exact same output.
Does any body have an idea what else I could try?
The problem is that triSYCL is a research project looking deeper at some aspects of SYCL while not providing a global generic SYCL support for an end-user. I have just clarified this on the README of the project. :-(
Probably the problem here is that the OpenCL SPIR kernel has not been generated.
So you need to first compile the specific (old) Clang & LLVM from triSYCL https://github.com/triSYCL/triSYCL/blob/master/doc/architecture.rst#trisycl-architecture-for-accelerator. But unfortunately there is no simple Clang driver to use all the specific Clang & LLVM to generate the kernels from the SYCL source. Right know it is done with some ad-hoc awful Makefiles (look around https://github.com/triSYCL/triSYCL/blob/master/tests/Makefile#L360) and, even if you can survive to this, you might encounter some bugs...
The good news is now there are several other implementations of SYCL which are quite easier to use, quite more complete and quite less buggy! :-) Look at ComputeCpp, DPC++ and hipSYCL for example.

How would one create a bitwise rotation function in dart?

I'm in the process of creating a cryptography package for Dart (https://pub.dev/packages/steel_crypt). Right now, most of what I've done is either exposed from PointyCastle or simple-ish algorithms where bitwise rotations are unnecessary or replaceable by >> and <<.
However, as I move toward complicated cryptography solutions, which I can do mathematically, I'm unsure of how to implement bitwise rotation in Dart with maximum efficiency. Because of the nature of cryptography, the speed part is emphasized and uncompromising, in that I need the absolute fastest implementation.
I've ported a method of bitwise rotation from Java. I'm pretty sure this is correct, but unsure of the efficiency and readability:
My tested implementation is below:
int INT_BITS = 64; //Dart ints are 64 bit
static int leftRotate(int n, int d) {
//In n<<d, last d bits are 0.
//To put first 3 bits of n at
//last, do bitwise-or of n<<d with
//n >> (INT_BITS - d)
return (n << d) | (n >> (INT_BITS - d));
static int rightRotate(int n, int d) {
//In n>>d, first d bits are 0.
//To put last 3 bits of n at
//first, we do bitwise-or of n>>d with
//n << (INT_BITS - d)
return (n >> d) | (n << (INT_BITS - d));
EDIT (for clarity): Dart has no unsigned right or left shift, meaning that >> and << are signed right shifts, which bears more significance than I might have thought. It poses a challenge that other languages don't in terms of devising an answer. The accepted answer below explains this and also shows the correct method of bitwise rotation.
As pointed out, Dart has no >>> (unsigned right shift) operator, so you have to rely on the signed shift operator.
In that case,
int rotateLeft(int n, int count) {
const bitCount = 64; // make it 32 for JavaScript compilation.
assert(count >= 0 && count < bitCount);
if (count == 0) return n;
return (n << count) |
((n >= 0) ? n >> (bitCount - count) : ~(~n >> (bitCount - count)));
should work.
This code only works for the native VM. When compiling to JavaScript, numbers are doubles, and bitwise operations are only done on 32-bit numbers.

implications of using _mm_shuffle_ps on integer vector

SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:
#include <iostream>
#include <immintrin.h>
#include <sstream>
using namespace std;
template <typename T>
std::string __m128i_toString(const __m128i var) {
std::stringstream sstr;
const T* values = (const T*) &var;
if (sizeof(T) == 1) {
for (unsigned int i = 0; i < sizeof(__m128i); i++) {
sstr << (int) values[i] << " ";
} else {
for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) {
sstr << values[i] << " ";
return sstr.str();
int main(){
cout << "Starting SSE test" << endl;
cout << "integer shuffle" << endl;
int A[] = {1, -2147483648, 3, 5};
int B[] = {4, 6, 7, 8};
__m128i pC;
__m128i* pA = (__m128i*) A;
__m128i* pB = (__m128i*) B;
*pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0));
pC = _mm_add_epi32(*pA,*pB);
cout << "A[0] = " << A[0] << endl;
cout << "A[1] = " << A[1] << endl;
cout << "A[2] = " << A[2] << endl;
cout << "A[3] = " << A[3] << endl;
cout << "B[0] = " << B[0] << endl;
cout << "B[1] = " << B[1] << endl;
cout << "B[2] = " << B[2] << endl;
cout << "B[3] = " << B[3] << endl;
cout << "pA = " << __m128i_toString<int>(*pA) << endl;
cout << "pC = " << __m128i_toString<int>(pC) << endl;
Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):
vshufps $228, 16(%rsp), %xmm1, %xmm0
vpaddd 16(%rsp), %xmm0, %xmm2
vmovdqa %xmm0, 32(%rsp)
vmovaps %xmm0, (%rsp)
vmovdqa %xmm2, 16(%rsp)
call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?
There is no equivalent to _mm_shuffle_ps for integers. To achieve the same effect in this case you can do
*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);
*pA = _mm_blend_epi16(*pA, *pB, 0xf0);
or change to the floating point domain like this
*pA = _mm_castps_si128(
_mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));
But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner
The bypass delay is important in long dependency chains where latency is a bottleneck, but
not where it is throughput rather than latency that matters.
You have to test your code and see which method above is more efficient.
Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps between most integer-vector instructions. Agner says:
For example, I found no delay when mixing PADDD and SHUFPS [on Sandybridge].
Nehalem does have 2 bypass-delay latency to/from SHUFPS, but even then a single SHUFPS is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.
The reverse (integer shuffles between FP math instructions) is not as safe:
In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD (_mm_shuffle_epi32) instead of SHUFPS (_mm_shuffle_ps) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).
On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS or PSHUFD (see page 140).

odeint: How do I log intermediate results while integrating?

I want to know how I can log the values other than states during integration by odeint. I have a simulation of the satellite dynamics, which is described as differential equations of total angular momentum, L, and momentum of an internal wheel, h. My simulation is running correctly. But I need to log not only the state variables but also some other values such as external torque, N, and angular velocity, omega, that is Jinv*L, where Jinv is a 3x3 constant, satellite-inertia matrix. In a sense, the purpose of my simulator is not to calculate L and h, but to generate time-histories of "other" varialbes.
To show what I'm doing, below is a slightly simplified version of my current code.
class satellite
Eigen::Matrix3d Jinv;
void operator()( state_type &x , state_type &dxdt , double t )
L << x[0], x[1], x[2];
h << x[3], x[4], x[5], x[6];
N = external_torque(t);
omega = Jinv * (L-h);
dLdt = N - omega.cross(L);
OMEGA = func1(omega(0), omega(1), omega(2));
dqdt = OMEGA * q * 0.5;
dxdt[0] = dLdt(0); dxdt[1] = dLdt(1); dxdt[2] = dLdt(2);
dxdt[3] = dqdt(0); dxdt[4] = dqdt(1); dxdt[5] = dqdt(2); dxdt[6] = dqdt(3);
class streaming_observer
std::ostream& os;
satellite& sat;
streaming_observer( std::ostream& _os, satellite& _sat ) : os(_os), sat(_sat) { }
template<class State>
void operator() (const State& x, double t) const
L << x[0], x[1], x[2];
os << t << ' ' << (sat.Jinv*(L)).transpose() << std::endl;
You must do the calculation of your intermediate and the logging in the observer. To avoid redundancy it might be favourable to out the calculations in a separate function of class method and call this method from the system function (hence the operator() in your example) and from the observer. You can also record values in there and do some later analysis with these values.

Warning: Signed shift result (0x1F0000000) requires 34 bits to represent, but 'int' only has 32 bits

After compiling the reMail project with no error, one of the warnings is:
remail-iphone/sqlite3/sqlite3.c:18703:15: Signed shift result
(0x1F0000000) requires 34 bits to represent, but 'int' only has 32
i.e. (0x1f<<28) in the following code:
if (!(a&0x80))
a &= (0x1f<<28)|(0x7f<<14)|(0x7f);
b &= (0x7f<<14)|(0x7f);
b = b<<7;
a |= b;
s = s>>11;
*v = ((u64)s)<<32 | a;
return 7;
What's the proper way to kill this warning for iOS (32-bit)?
reMail for iPhone seems to be using an old version of SQLite (3.6.15). If I'm not mistaken, the following commit should fix exactly this problem: http://www.sqlite.org/src/info/587109c81a9cf479?sbs=0
if (!(a&0x80))
/* assert( ((0xFF<<28)|(0x7f<<14)|(0x7f))==0xf01fc07f ); */
a &= 0xf01fc07f;
b &= (0x7f<<14)|(0x7f);
b = b<<7;
a |= b;
s = s>>11;
*v = ((u64)s)<<32 | a;
return 7;
However, there might be other code sections where this problem occurs. The mentioned link shows two instances in util.c, but since sqlite.c is "an amalgamation of many separate C source files from SQLite", you may find additional occurences.
Maybe reMail would work with a recent version of SQLite, too...
