How to profile code in hexagon dsp simulator - clang

I have been trying to compile my code using -pg to enable profiling in the simulator and once I do that it gives me linker errors.
Compilation command
hexagon-clang++ main.cpp -o hello -mv62 -pg
Error
hexagon-clang++ main.cpp -o hello -mv62 -pg
Error: /tmp/main-924ac3.o(.text+0x30): undefined reference to `mcount'
Error: /tmp/main-924ac3.o(.text+0x130): undefined reference to `mcount'
Fatal: Linking had errors.
This is my first time to write code for DSP chip, specifically the hexagon 682. Are there any tutorials or references other than the programmer reference manual because they haven't been very useful in helping me understand how things work. Specially I don't understand how SIMD programming works. I am not sure what's the size of SIMD registers. Also it seems that using Floating point in DSP chips is not a great idea. So would it be better if I convert my code to use fixed point.

You can use hexagon-sim to generate the profiling data without rebuilding instrumented binaries.
hexagon-sim --profile ./hello will generate the gmon input file(s) necessary for hexagon-gprof to consume.
e.g. (taken from SDK 3.3.3 Examples/)
hexagon-clang -O2 -g -mv5 -c -o mandelbrot.o mandelbrot.c
hexagon-clang -O2 -g -mv5 mandelbrot.o -o mandelbrot -lhexagon
hexagon-sim -mv5 --timing --profile mandelbrot
hexagon-gprof mandelbrot gmon.t*
Note also that the SDK comes with hexagon-profiler, a richer tool that allows you to see in depth performance counters -- information beyond just which code was executed and how often.
See "Hexagon Profiler User Guide" (doc number 80-N2040-10 A) for details.
Are there any tutorials or references other than the programmer
reference manual because they haven't been very useful in helping me
understand how things work.
Specially I don't understand how SIMD programming works. I am not sure
what's the size of SIMD registers.
Hexagon's vector programming extension is called "HVX". There's a HVX-specific PRM that's available at https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools -- it describes different 512-bit and 1024-bit vector modes.

Related

warnings when trying to statically link cross compiled Fortran 90 code to run on Aarch64-linux - one being "relocation truncated to fit"

I am able to cross compile some Fortran 90 code (large block written by someone else so do not want to convert it) using x86_64 GNU/Linux as the build system and aarch64-linux as the host system and using dynamic linking. However, I want to generate a statically linked binary so added -static to the mpif90 call. When I do this, I get this warning:
/home/me/CROSS-REPOS/glibc-2.35/math/../sysdeps/ieee754/dbl-64/e_log.c:106: warning: too many GOT entries for -fpic, please recompile with -fPIC
When I add this flag as in "mpif90 -static -fPIC" the same error appears. Also tried -mcmodel=large option as in "mpif90 -static -mcmodel=large" to no avail.
Then checked the options for "/home/me/CROSS-JUL2022/lib/gcc/aarch64-linux/12.1.0/../../../../aarch64-linux/bin/ld", I see this one, --long-plt (to generate long PLT entries and to handle large .plt/.got displacements). But trying "mpif90 -static -Wl,--long-plt" says --long-plt is not an option. How to invoke this --long-plt option then?
One other thing, I know static linking will make the binaries a fair amount bigger but do not want to carry libs over to the Android device. Furthermore, some reading is indicating that dynamic linking on the Android device could lead to some security issues. Thanks for any suggestions.

Is it possible to create LLVM Pass for OpenCL Kernel?

I would like to create an LLVM Pass to optimize OpenCL kernel for NVIDIA Cards. I wonder if it is possible.
I have tried followings:
clang -Xclang -load -Xclang lib/simplePass.so main.c
It did not work, cannot alter the kernel code.
Separate compiling then linking.
It also does not work, gave me error that get_global_id is undefined.
Using offline compiler then clCreateProgramWithBinary
I followed Apple's example, It work on the with Intel GPU, however was not able to use an LLVM Pass. When I tried to use it, it gave me error:
LLVM ERROR: Sized aggregate specification in datalayout string
When I tried to adopt it into Xubuntu, it does not work.
Is there any another method that I can tried? I know I can use SPIR-V IR but Nvidia does not support OpenCL 2.2 currently.
Thank you for your time.

LLVM IR of OpenCL kernel to PTX to binary

I am using clang to generate LLVM IR for Nvidia OpenCL and Cuda kernels, which i want to subsequently instrument, doing something like this for OpenCL:
clang -c -x cl -S -emit-llvm -cl-std=CL2.0 kernel.cl -o kernel.ll
and what's described here for Cuda.
What i am looking for is a way to go from the instrumented IR to an actual binary. For the case of Cuda i know i can use the NVPTX backend to generate PTX and JIT compile as described here (or perhaps use ptxas?). I was wondering if something similar is also possible for the OpenCL case, and if so, perhaps a minimal example. Thanks in advance.
You can in principle extract binaries for loaded and compiled OpenCL kernels by using clGetProgramInfo() with CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES.
As far as I'm aware, this will produce binaries in an entirely implementation-defined format. So if you're unlucky, you just get IR code back anyway. With any luck, it might contain PTX machine code on your platform, however.

Faster "release" build from Xcode?

I am relatively new to Xcode. We are testing an app that displays incoming data and it needs to be as fast as possible. With other platforms I need to change from "debug" to "release" in order for optimizations to kick in and debug code to be removed, which can have a profound effect on speed. What are the equivalent things I need to do in Xcode to build in fast/release mode?
(I am googling this and see lots of hits that seem to be in the general vicinity but I might be a little thrown off by the terminology, I might need it dumbed down a bit :))
Thanks for the help.
The first step is to set the Optimization Level for release as described above. There are lots of options here. From the clang LLVM compiler man page (man cc) -- (note that -Os is the default for Release):
Code Generation Options
-O0 -O1 -O2 -O3 -Ofast -Os -Oz -O -O4
Specify which optimization level to use:
-O0 Means "no optimization": this level compiles the fastest and
generates the most debuggable code.
-O1 Somewhere between -O0 and -O2.
-O2 Moderate level of optimization which enables most
optimizations.
-O3 Like -O2, except that it enables optimizations that take longer
to perform or that may generate larger code (in an attempt to
make the program run faster).
-Ofast
Enables all the optimizations from -O3 along with other
aggressive optimizations that may violate strict compliance
with language standards.
-Os Like -O2 with extra optimizations to reduce code size.
-Oz Like -Os (and thus -O2), but reduces code size further.
-O Equivalent to -O2.
-O4 and higher
Currently equivalent to -O3
You will notice the 'Ofast' option -- very fast, somewhat risky.
A second step is to consider whether to enable "Unroll Loops". I've read that this can in some code lead to a 15% speed increase (at the expense of debugging, but not an issue for Release builds).
Next, consider whether you want to Build and use an Optimization Profile. See Apple for details, but the gist is that:
Profile Guided Optimization (PGO) is a means to improve compiler
optimization of an app. PGO utilizes a specially instrumented build of
the app to generate profile information about the most commonly used
code paths and methods. The compiler then uses this profile
information to focus optimization efforts on the most frequently used
code, taking advantage of the extra information about how the program
typically behaves to do a better job of optimization.
You define the profile and whether you use it under Build Settings -> Apple LLVM 6.0 - Code Generation -> Use Optimization Profile.
First have a look at this part in Xcode (screenshot of Xcode 5 but same on Xcode 6)
You should also prefer PNG to Jpeg (as Jpeg requires more calculation - but are generally smaller in terms of size so better for network...)
Finally, Use multi-threading.
Those are (to mu humble opinion) the first steps to look at.
Edit the scheme to use release configuration.

LLVM jit and native

I don't understand how LLVM JIT relates to normal no JIT compilation and the documentation isn't good.
For example suppose I use the clang front end:
Case 1: I compile C file to native with clang/llvm. This flow I understand is like gcc flow - I get my x86 executable and that runs.
Case 2: I compile into some kind of LLVM IR that runs on LLVM JIT. In this case the executable contains the LLVM runtime to execute the IR on JIT, or how does it work?
What is the difference between these two and are they correct? Does LLVM flow include support for both JIT and non JIT? When do I want to use JIT - does it make sense at all for a language like C?
You have to understand that LLVM is a library that helps you build compilers. Clang is merely a frontend for this library.
Clang translates C/C++ code into LLVM IR and hands it over to LLVM, which compiles it into native code.
LLVM is also able to generate native code directly in memory, which then can be called as a normal function. So case 1. and 2. share LLVM's optimization and code generation.
So how does one use LLVM as a JIT compiler? You build an application which generates some LLVM IR (in memory), then use the LLVM library to generate native code (still in memory). LLVM hands you back a pointer which you can call afterwards. No clang involved.
You can, however, use clang to translate some C code into LLVM IR and load this into your JIT context to use the functions.
Real World examples:
Unladen Swallow Python VM
Rubinius Ruby VM
There is also the Kaleidoscope tutorial which shows how to implement a simple language with JIT compiler.
First, you get LLVM bytecode (LLVM IR):
clang -emit-llvm -S -o test.bc test.c
Second, you use LLVM JIT:
lli test.bc
That runs the program.
Then, if you wish to get native, you use LLVM backend:
llc test.bc
From the assembly output:
as test.S
I am taking the steps to compile and run the JIT'ed code from a mail message in LLVM community.
[LLVMdev] MCJIT and Kaleidoscope Tutorial
Header file:
// foo.h
extern void foo(void);
and the function for a simple foo() function:
//foo.c
#include <stdio.h>
void foo(void) {
puts("Hello, I'm a shared library");
}
And the main function:
//main.c
#include <stdio.h>
#include "foo.h"
int main(void) {
puts("This is a shared library test...");
foo();
return 0;
}
Build the shared library using foo.c:
gcc foo.c -shared -o libfoo.so -fPIC
Generate the LLVM bitcode for the main.c file:
clang -Wall -c -emit-llvm -O3 main.c -o main.bc
And run the LLVM bitcode through jit (and MCJIT) to get the desired output:
lli -load=./libfoo.so main.bc
lli -use-mcjit -load=./libfoo.so main.bc
You can also pipe the clang output into lli:
clang -Wall -c -emit-llvm -O3 main.c -o - | lli -load=./libfoo.so
Output
This is a shared library test...
Hello, I'm a shared library
Source obtained from
Shared libraries with GCC on Linux
Most compilers have a front end, some middle code/structure of some sort, and the backend. When you take your C program and use clang and compile such that you end up with a non-JIT x86 program that you can just run, you have still gone from frontend to middle to backend. Same goes for gcc, gcc goes from frontend to a middle thing and a backend. Gccs middle thing is not wide open and usable as is like LLVM's.
Now one thing that is fun/interesting about llvm, that you cannot do with others, or at least gcc, is that you can take all of your source code modules, compile them to llvms bytecode, merge them into one big bytecode file, then optimize the whole thing, instead of per file or per function optimization you get with other compilers, with llvm you can get any level of partial to compilete program optimization you like. then you can take that bytecode and use llc to export it to the targets assembler. I normally do embedded so I have my own startup code that I wrap around that but in theory you should be able to take that assembler file and with gcc compile and link it and run it. gcc myfile.s -o myfile. I imagine there is a way to get the llvm tools to do this and not have to use binutils or gcc, but I have not taken the time.
I like llvm because it is always a cross compiler, unlike gcc you dont have to compile a new one for each target and deal with nuances for each target. I dont know that I have any use for the JIT thing is what I am saying I use it as a cross compiler and as a native compiler.
So your first case is the front, middle, end and the process is hidden from you you start with source and get a binary, done. The second case is if I understand right the front and the middle and stop with some file that represents the middle. Then the middle to end (the specific target processor) can happen just in time at runtime. The difference there is the backend, the real time execution of the middle language of case two, is likely different than the backend of case one.

Resources