I am using icpc compiler to see the speed up of my code usually compiled with g++.
The processor on which I compile belongs to Intel's Sandy Bridge architecture, so I want to use AVX vectorization.
Someone told that the "-xhost" flag with icpc could allow me to benefit automatically from AVX vectorization : is this the case ?
If not, could you tell the flag(s) to put with icpc to activate AVX.
Last question: could I benefit from AVX2 too ? and if yes, how ?
Thanks
To benefit from AVX2 you need a 4th generation Intel Coree(R) processor, built upon Haswell architecture.
Your CPU supports only AVX. You can instruct the compiler to use it as you mentioned by using "-xHost" compilation flag. This tells the compiler to use the highest SIMD like instructions available on your host machine. You can also use "-mavx" flag.
Be aware although that if you generate code using AVX, you will be able to run it only on machines that have AVX (later than Sandy Bridge).
To check if the compiler has generated AVX code, dump the assemblies and look for the YMM registers. Those are AVX specific.For more info look here.
Cheers!
Related
What are the (architectural) differences in implementing a programming language on the GraalVM architecture – in particular between Graal, Truffle, and LLVM using Sulong?
I plan to reimplement an existing statically typed programming language on the GraalVM architecture, so that I can use it from Java without much of a hassle.
There are at moment three options:
Emit JVM bytecode
Write a Truffle interpreter
Emit LLVM bitcode, use Sulong to run it on GraalVM
Emitting JVM bytecode is the traditional option. You will have to work at the bytecode level, and you'll have to optimise your code your before emitting bytecode as the options that the JVM has for optimising it after it has been emitted are limited. To get good performance you may have to use invokedynamic.
Using Truffle is I'd say the easy option. You only have to write an AST interpreter and then code generation is all done for you. It's also the high performance option - in all languages where there is a Truffle version and a bytecode version, the Truffle version confidently outperforms the bytecode version as well as being simpler due to no bytecode generation stage.
Emitting LLVM bitcode and running on Sulong is an option, but it's not one I would recommend unless you have other constraints that lead you towards that option. Again you have to do that bitcode generation yourself, and you'll have to optimise yourself before emitting the bitcode as optimisations are limited after the bitcode is set.
Ruby is good for comparing these options - because there is a version that emits JVM bytecode (JRuby), a version using Truffle (TruffleRuby) and a version that emits LLVM bitcode (Rubinius, but it doesn't then run that bitcode on Sulong). I'd say that TruffleRuby is both faster and simpler in implementation than Rubinius or JRuby. (I work on TruffleRuby.)
I wouldn't worry about the fact that your language is statically typed. Truffle can work with static types, and it can use profiling specialisation to detect more fine-grained types again at runtime than are expressed statically.
I have seen caffe installation for Mac. But I have a question. If my Mac does not have GPU, then I have no chances to use GPU?? and I have to use CPU-only?
or I have the chance of using (virtual!) GPU by NVIDIA web driver?
Moreover, can I have digits on my Mac? as I try to download it, it does not have any options for Mac download and it is just for Ubuntu!
I am very confused about these questions! Can you please make me clear about these?
The difference in architectures between CPU and GPU does not allow simple transformation of the code written for one architecture to the other. The GPU drivers are specifically written for the GPU architecture and cannot be easily virtualized. On the other hand, some software supports both. This includes OpenGL instructions and caffe (http://caffe.berkeleyvision.org/). NVidia DIGITS is based on caffe and therefore can work without a dedicated GPU (Here the thread how to install on Macs: https://github.com/NVIDIA/DIGITS/issues/88)
According to https://www.github.com/NVIDIA/DIGITS/issues/251 CUDA cannot be run on computers that do not have a dedicated NVidia GPU, but according to How to run my CUDA application on ATI or Intel card in software mode? there is a program gpuocelot that receives CUDA instructions and can work on NVidia GPU, AMD GPU and x86.
In scientific shared computing they wrote separate programs for different devices, e.g. Einstein at Home has four separate programs to find gravitational waves: CPU, NVidia GPU (CUDA), AMD GPU and ARM.
To make DIGITS work you need to
build Caffe with CPU_ONLY and tell DIGITS not to use any GPUs by
running digits-devserver with the --config flag
(https://github.com/NVIDIA/caffe/blob/v0.13.2/Makefile.config.example#L9-L10, https://github.com/NVIDIA/DIGITS/issues/251).
Other possibility:
you can still use the --config flag with the web installer. Try this:
./runme.sh --config. Choose "N" to select none.
Also a possibility:
I am trying to answer how you can choose CPU or GPUs.. Within the
caffe folder, there is a Makefile.config.example file.. Copy the
contents of this file into a new file and rename it as
"Makefile.config". If you want to use CPU, then
1. comment out the "USE_CUDNN :=1 Within "Makefile.config" file,
2. uncomment CPU_ONLY := 1
3. issue the make all command again within the caffe folder..
And if nothing helps you can do the procedure two times because it helped someone at the end of the thread.
I would like to translate X86_64, x86, ARM executables into LLVM IR (disassembly).
What solution do you suggest ?
mcsema is a production-quality binary lifter. It takes x86 and x86-64 and statically "lifts" it to LLVM IR. It's actively maintained, BSD licensed, and has extensive tests and documentation.
https://github.com/trailofbits/mcsema
Consider using RevGen tool developed within the S2E project. It allows converting x86 binaries to LLVM IR. The source code could be checked out from Revgen branch of GIT repository available by url https://dslabgit.epfl.ch/git/s2e/s2e.git.
As regards to RevGen tool mentioned by #bsa2000, this latest paper "A compiler level intermediate representation based binary analysis and rewriting system" has pointed out some limitations in S2E and Revinc.
I pull them out here.
shortcoming of dynamic translation:
S2E [16] and Revnic [14] present a method for dynamically translating
x86 to LLVM using QEMU. Unlike our approach, these methods convert
blocks of code to LLVM on the fly which limits the application of LLVM
analyses to only one block at a time.
IR incomplete:
Revnic [14] and RevGen [15] recover an IR by merging the translated
blocks, but the recovered IR is incomplete and is only valid for
current execution; consequently, various whole program analyses will
provide incomplete information.
no abstract stack or promoting information
Further, the translated code retains all the assumptions of the
original bi- nary about the stack layout. They do not provide any
methods for obtaining an abstract stack or promoting memory locations
to symbols, which are essential for the application of several
source-level analyses.
I doubt there will be universal solution (think about indirect branches, etc.), LLVM IR is much "higher level" than any assembler. Though it's possible to translate on per-BB basis. You might want to check llvm-qemu and libcpu projects among others.
Just post some references on translating ARM binary to LLVM IR:
disarm - arm binary to llvm ir disassembler
https://code.google.com/p/disarm/
However, I have not tried it, thus not sure about its quality and stability. Anyone else may post additional information about this project?
There is new project, being in some early phases, The libbeauty:
https://github.com/jcdutton/libbeauty
Article about project: Libbeauty: Another Reverse-Engineering Tool, 24 December 2013, Michael Larabel - http://www.phoronix.com/scan.php?page=news_item&px=MTU1MTU
It only supports subset of x86_64 as input now. One of the project goals - is to be able to compile the generated LLVM IR back to assembly to get the binary with same functionality.
I'm looking at some code compiled for iOS in XCode (so compiled for ARM with gcc) and as far as I can see, the compiler has never used ARM's feature of allowing arbitrary instructions to have a condition attached to them, but instead always branches on a condition as would be the case on Intel and other architectures.
Is this simply a restriction of GCC (I can understand that it might be: that "condition = branch" is embedded at a too high a level in the compiler architecture to allow otherwise), or is there a particular optimisation flag that needs to be turned on to allow compilation of conditional instructions?
(Obviously I appreciate I'm making big assumptions about where use of conditional instructions "ought" to be used and would actually be an optimisation, but I have experience of programming earlier ARM chips and using and analysing the output of Acorn's original ARM C compiler, so I have a rough idea.)
Update: Having investigated this more thanks to the information below, it turns out that:
XCode compiles in Thumb-2 mode, in which conditional execution of arbitrary instructions is not available;
Under some circumstances, it does however use the ITE (if-then-else) instruction to effectively produce instructions with conditional execution.
Seeing some actual assembly would make things clear, but I suspect that the default settings for iOS compilation prefer generation of Thumb code instead of ARM for better code density. While there are pseudo-conditional instructions in Thumb32 aka Thumb-2 (supported in ARMv7 architecture via the IT instruction), the original Thumb16 only has conditional branches. Also, even in ARM mode there are some instructions that cannot be conditional (e.g. many NEON instructions use the extended opcode space with condition field set to NV).
Yes, gcc does not really produce the most optimal code WRT conditional instructions. It works well in the most simple cases, but real code suffers from some pointless slowdowns that can be avoided in hand coded arm ASM. Just to give you a rough idea, I was able to get a 2x speedup for a very low level graphics blit method by doing the read/write and copy logic in ARM asm instead of the C code emitted by gcc. But, keep in mind that this optimization is only worth it for the most heavily used parts of your code. It takes a lot of work to write well optimized ARM asm, so don't even attempt it unless there is a real benefit in the optimization.
The first thing to keep in mind is that xcode uses Thumb mode by default, so in order to generate ARM asm you will need to add the -mno-thumb option to the module specific options for the specific .c file that will contain the ARM asm. Once the ARM asm is getting emitted, you will want to conditionally compile asm statements as indicated in the answer to the following question:
ARM asm conditional compilation question
I am developing small command line utilities using Vala on win32. Programs compiled using vala depend on the following DLLs
libgobject-2.0-0.dll
libgthread-2.0-0.dll
libglib-2.0-0.dll
They are taking up 1500 kbyes of space. Is there a way to reduce the size of these dependencies (besides compressing them with UPX and the like)? I can't imagine a simple helloworld like app using all the features provided by glib.
Thanks!
If your vala source is fairly simple, you may be able to compile it in the posix profile
valac --profile posix hello.vala
Then your binary will not have any dependency outside of the standard C library. However, the posix profile may still be experimental.