In a related question, How to trap floating-point exceptions on M1 Macs?, someone wanted to understand how to make the following code work natively on macOS hosted by a machine using the M1 processor:
#include <cmath> // for sqrt()
#include <csignal> // for signal()
#include <iostream>
#include <xmmintrin.h> // for _mm_setcsr
void fpe_signal_handler(int /*signal*/) {
std::cerr << "Floating point exception!\n";
exit(1);
}
void enable_floating_point_exceptions() {
_mm_setcsr(_MM_MASK_MASK & ~_MM_MASK_INVALID);
signal(SIGFPE, fpe_signal_handler);
}
int main() {
const double x{-1.0};
std::cout << sqrt(x) << "\n";
enable_floating_point_exceptions();
std::cout << sqrt(x) << "\n";
}
I am looking at this from another angle, and want to understand why it doesn't work using Rosetta 2. I compiled it using the following command:
clang++ -g -std=c++17 -arch x86_64 -o fpe fpe.cpp
When I run it, I see the following output:
nan
nan
Mind you, when I do the same thing on a Intel-based Mac, I see the following output:
nan
Floating point exception!
Does anyone know if it is possible to trap floating-point exceptions on Rosetta 2?
Considering the difference in trapping on Intel using:
_mm_setcsr(_MM_MASK_MASK & ~_MM_MASK_INVALID);
and trapping on Apple Silicon using:
fegetenv(&env);
env.__fpcr = env.__fpcr | __fpcr_trap_invalid;
fesetenv(&env);
it seems more likely it is a bug in the Rosetta implementation.
Related
I need to create an OpenCL application that instruments the code of the OpenCL kernel that it receives as input, for some exotic profiling purposes (haven't found what I need, so I need/want to do it myself).
I want to compile the kernel to an intermediate representation (LLVM-IR right now), instrument it (using the LLVM C++ bindings), transpile the instrumented code to SPIR-V and then create a kernel in the hostcode with clCreateProgramWithIL().
For now, I am just compiling a simple OpenCL kernel that adds 2 vectors, without instrumentation:
__kernel void vadd(
__global float* a,
__global float* b,
__global float* c,
const unsigned int count)
{
int i = get_global_id(0);
if(i < count) c[i] = a[i] + b[i];
}
For compiling the above to LLVM IR, I use the following command:
clang -c -emit-llvm -include libclc/generic/include/clc/clc.h -I libclc/generic/include/ vadd.cl -o vadd.bc -emit-llvm -O0 -x cl
Afterwards, I transpile vadd.bc to vadd.spv with the llvm-spirv tool (here).
Finally, I try building a kernel from the C hostcode like this:
...
cl_program program = clCreateProgramWithIL(context, binary_data->data, binary_data->size, &err);
err = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
...
After running the hostcode, I receive the above error from the clBuildProgram command:
CL_BUILD_PROGRAM_FAILURE
error: undefined reference to `get_global_id()'
error: backend compiler failed build.
It seems that the vadd.spv file is not link with the OpenCL kernel library. Any idea how to achieve this?
if someone reading this question has a minute or two, might test the build of the following code:
#include <cstdint>
#include <x86intrin.h>
// some compiler feature tests used by makefile
typedef uint8_t vector_8_16 __attribute__ ((vector_size(16)));
static const vector_8_16 key16 = { 7, 6, 5, 4, 3, 2, 1, 0,
15, 14, 13, 12, 11, 10, 9, 8};
int main() {
vector_8_16 a = key16;
vector_8_16 b, c;
b = reinterpret_cast<vector_8_16>(_mm_shuffle_pd(a, a, 1));
c = _mm_xor_si128(b, a);
c = _mm_cmpeq_epi8(b, a);
c = _mm_andnot_si128(c, a);
return c[2] & 0;
}
with the following invocation:
gcc -std=c++11 -march=corei7-avx -flax-vector-conversions test.cc
At the moment, I tried gcc5 from this site: http://hpc.sourceforge.net/ but it just doesn't work:
/var/folders/d8/m9xrbkrs2tj3x6xw_h0nmkn40000gn/T//ccXbpcH7.s:10:no such instruction: `vmovdqa LC0(%rip), %xmm0'
/var/folders/d8/m9xrbkrs2tj3x6xw_h0nmkn40000gn/T//ccXbpcH7.s:11:no such instruction: `vmovaps %xmm0, -16(%rbp)'
/var/folders/d8/m9xrbkrs2tj3x6xw_h0nmkn40000gn/T//ccXbpcH7.s:12:no such instruction: `vmovapd -16(%rbp), %xmm1'
/var/folders/d8/m9xrbkrs2tj3x6xw_h0nmkn40000gn/T//ccXbpcH7.s:13:no such instruction: `vmovapd -16(%rbp), %xmm0'
/var/folders/d8/m9xrbkrs2tj3x6xw_h0nmkn40000gn/T//ccXbpcH7.s:14:no such instruction: `vshufpd $1, %xmm1,%xmm0,%xmm0'
A few years ago, I managed to get gcc 4.7 working, after building from source, and replacing the assembler in /usr/bin/as with the one from gcc. But that endeavour took some days, and I'm not sure if it works with the current OSX and Xcode tools versions. I suspect this one has similar problems, either trying to use the same assembler as the one came with Xcode, and has a misunderstanding with it, or trying to use it's own assembler, which doesn't know about AVX . I'm not sure yet what exactly is the problem, my next hope ( before spending a few days hacking it to use a useful assembler ) is to try the brew GCC package.
Or, if anyone knows an easy way to bring GCC with AVX to life on mac OS X, I'm happy to hear about it.
Note: clang I can already use, this question is specifically about GCC
==EDIT==
After a lot of searching, I found the same issue answered here, with a solution that works for me:
How to use AVX/pclmulqdq on Mac OS X
Sorry for another duplicate question.
Nice talking to myself.
Out
Hello I have to parse some LLVM IR code for a compiler course. I am very new to LLVM.
I have clang and LLVM on my computer, and when I compile a simple C program:
#include <stdio.h>
int main(int argc, char *argv[])
{
for (int i = 0; i < 10; i++) {
printf("Stuff!\n");
}
return 0;
}
using command: clang -cc1 test.c -emit-llvm
I get llvm IR with what I believe are called implicit blocks:
; <label>:4 ; preds = %9, %0
However my parser also needs to handle llvm IR with textual labels:
for.cond: ; preds = %for.inc, %entry
My problem is that I do not know how to generate such IR and was hoping someone show me how.
I tried Google and such, but I couldn't find appropriate information. Thanks in advance.
The accepted answer is no longer valid. Nor is it a good way to achieve the stated.
In case someone stumbles upon this question, like I did, I'm providing the answer.
clang-8 -S -fno-discard-value-names -emit-llvm test.c
use this site with Show detailed bytecode analysis checked
http://ellcc.org/demo/index.cgi
I am using MATLAB in my office and Octave when I am at home. Although they are very similar, I was trying to do something I would expected to be very easy and obvious, but found it really annoying. I can't find out how to import TIFF images in Octave. I know the MATLAB geotiffread function is not present, but I thought there would be another method.
I could also skip importing them, as I can work with the imread function in some cases, but then the second problem would be that I can't find a way to write a georeferenced TIFF file (in MATLAB I normally call geotiffwrite with geotiffinfo inputs inside). My TIFF files are usually 8 bit unsigned integer or 32 bit signed integer. I hope someone can suggest a way to solve this problem. I also saw this thread but did not understand if it is possible to use the code proposed by Ashish in Octave.
You may want to look at the mapping library in Octave.
You can also use the raster functions to work with GeoTiffs
Example:
pkg load mapping
filename=”C:\\sl\SDK\\DTED\\n45_w122_1arc_v2.tif”
rasterinfo (filename)
rasterdraw (filename)
The short answer is you can't do it in Octave out of the box. But this is not because it is impossible to do it. It is simply because no one has yet bothered to implement it. As a piece of free software, Octave has the features that its users are willing to spend time or money implementing.
About writing of signed 32-bit images
As of version 3.8.1, Octave uses either GraphicsMagick or ImageMagick to handle the reading and writing of images. This introduces some problems. The number 1 is that your precision is limited to how you built GraphicsMagick (its quantum-depth option). In addition, you can only write unsigned integers. Hopefully this will change in the future but since not many users require it, it's been this way until now.
Dealing with geotiff
Provided you know C++, you can write this functions yourself. This shouldn't be too hard since there is already libgeotiff, a C library for it. You would only need to write a wrapper as an Octave oct function (of course, if you don't know C or C++, then this "only" becomes a lot of work).
Here is the example oct file code which needs to be compiled. I have taken reference of https://gerasimosmichalitsianos.wordpress.com/2018/01/08/178/
#include <octave/oct.h>
#include "iostream"
#include "fstream"
#include "string"
#include "cstdlib"
#include <cstdio>
#include "gdal_priv.h"
#include "cpl_conv.h"
#include "limits.h"
#include "stdlib.h"
using namespace std;
typedef std::string String;
DEFUN_DLD (test1, args, , "write geotiff")
{
NDArray maindata = args(0).array_value ();
const dim_vector dims = maindata.dims ();
int i,j,nrows,ncols;
nrows=dims(0);
ncols=dims(1);
//octave_stdout << maindata(i,0);
NDArray transform1 = args(1).array_value ();
double* transform = (double*) CPLMalloc(sizeof(double)*6);
float* rowBuff = (float*) CPLMalloc(sizeof(float)*ncols);
//GDT_Float32 *rowBuff = CPLMalloc(sizeof(GDT_Float32)*ncols);
String tiffname;
tiffname = "nameoftiff2.tif";
cout<<"The transformation matrix is";
for (i=0; i<6; i++)
{
transform[i]=transform1(i);
cout<<transform[i]<<" ";
}
GDALAllRegister();
CPLPushErrorHandler(CPLQuietErrorHandler);
GDALDataset *geotiffDataset;
GDALDriver *driverGeotiff;
GDALRasterBand *geotiffBand;
OGRSpatialReference oSRS;
char **papszOptions = NULL;
char *pszWKT = NULL;
oSRS.SetWellKnownGeogCS( "WGS84" );
oSRS.exportToWkt( &pszWKT );
driverGeotiff = GetGDALDriverManager()->GetDriverByName("GTiff");
geotiffDataset = (GDALDataset *) driverGeotiff->Create(tiffname.c_str(),ncols,nrows,1,GDT_Float32,NULL);
geotiffDataset->SetGeoTransform(transform);
geotiffDataset->SetProjection(pszWKT);
//CPLFree( pszSRS_WKT );
cout<<" \n Number of rows and columns in array are: \n";
cout<<nrows<<" "<<ncols<<"\n";
for (i=0; i<nrows; i++)
{
for (j=0; j <ncols; j++)
rowBuff[j]=maindata(i,j);
//cout<<rowBuff[0]<<"\n";
geotiffDataset->GetRasterBand(1)->RasterIO(GF_Write,0,i,ncols,1,rowBuff,ncols,1,GDT_Float32,0,0);
}
GDALClose(geotiffDataset) ;
CPLFree(transform);
CPLFree(rowBuff);
CPLFree(pszWKT);
GDALDestroyDriverManager();
return octave_value_list();
}
it can be compiled and run using following
mkoctfile -lgdal test1.cc
aa=rand(50,53);
b=[60,1,0,40,0,-1];
test1(aa,b);
I am trying to use tbb::parallel_for on a machine with 160 parallel threads (8 Intel E7-8870) and 0.5 TBytes of memory. It is a current Ubuntu system with kernel 3.2.0-35-generic #55-Ubuntu SMP. TBB is from the package libtbb2 Version 4.0+r233-1
Even with a very simple task, I tend to run out of resources, either "bad_alloc" or "thread_monitor Resource temporarily unavailable". I boiled it down to this very simple test:
#include <vector>
#include <cstdlib>
#include <cmath>
#include <iostream>
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace tbb;
class Worker
{
std::vector<double>& dst;
public:
Worker(std::vector<double>& dst)
: dst(dst)
{}
void operator()(const blocked_range<size_t>& r ) const
{
for (size_t i=r.begin(); i!=r.end(); ++i)
dst[i] = std::sin(i);
}
};
int main(int argc, char** argv)
{
unsigned int n = 10000000;
unsigned int p = task_scheduler_init::default_num_threads();
std::cout << "Vector length: " << n << std::endl
<< "Processes : " << p << std::endl;
const size_t grain_size = n/p;
std::vector<double> src(n);
std::cerr << "Starting loop" << std::endl;
parallel_for(blocked_range<size_t>(0, n, grain_size), RandWorker(src));
std::cerr << "Loop finished" << std::endl;
}
Typical output is
Vector length: 10000000
Processes : 160
Starting loop
thread_monitor Resource temporarily unavailable
thread_monitor Resource temporarily unavailable
thread_monitor Resource temporarily unavailable
The errors appear randomly, and more frequent with greater n. The value of 10 million here is a point where they happen quite regularly. Nevertheless, given the machine characteristics, this should by far not exhaust the memory (I am using it alone for these tests).
The grain size was introduced after tbb created too many instances of the Worker, which made it fail for even smaller n.
Can anybody advise on how to set up tbb to handle large numbers of threads?
Summarizing the discussion in comments in an answer:
The message "thread_monitor Resource temporarily unavailable in pthread_create" basically tells that TBB cannot create enough threads; the "Resource temporarily unavailable" is what strerror() reports for the error code returned by pthread_create(). One possible reason for this error is insufficient memory to allocate stack for a new thread. By default, TBB requests 4M of stack for a worker thread; this value can be adjusted with a parameter to tbb::task_scheduler_init constructor if necessary.
In this particular case, as Guido Kanschat reported, the problem was caused by ulimit accidentally set which limited the memory available for the process.