690mb memory overhead for openmp program compiled with ifort - memory

I was running some tests using openmp and fortran and came to realize that a binary compiled with ifort 15 (15.0.0 20140723) has 690MB of virtual memory overhead.
My sample program is:
program sharedmemtest
use omp_lib
implicit none
integer :: nroot1
integer, parameter :: dp = selected_real_kind(14,200)
real(dp),allocatable :: matrix_elementsy(:,:,:,:)
!$OMP PARALLEL NUM_THREADS(10) SHARED(matrix_elementsy)
nroot1=2
if (OMP_GET_THREAD_NUM() == 0) then
allocate(matrix_elementsy(nroot1,nroot1,nroot1,nroot1))
print *, "after allocation"
read(*,*)
end if
!$OMP BARRIER
!$OMP END PARALLEL
end program
running
ifort -openmp test_openmp_minimal.f90 && ./a.out
shows a memory usage of
50694 user 20 0 694m 8516 1340 S 0.0 0.0 0:03.58 a.out
in top. Running
gfortran -fopenmp test_openmp_minimal.f90 && ./a.out
shows a memory usage of
50802 user 20 0 36616 956 740 S 0.0 0.0 0:00.98 a.out
Where is the 690MB of overhead coming from when compiling with ifort? Am I doing something wrong? Or is this a bug in ifort?
For completeness: This is a minimal example taken from a much larger program. I am using gfortran 4.4 (4.4.7 20120313).
I appreciate all comments and ideas.

I don't believe top is reliable here. I do not see any evidence that the binary created from your test allocates anywhere near that much memory.
Below I have shown the result of generating the binary normally, with the Intel libraries linked statically and with everything linked statically. The static binary is in the ballpark of 2-3 megabytes.
It is possible that OpenMP thread stacks, which I believe are allocated from the heap, could be the source of the addition virtual memory here. Can you try this test with OMP_STACKSIZE=4K? I think the default is a few megabytes.
Dynamic Executable
jhammond#cori11:/tmp> ifort -O3 -qopenmp smt.f90 -o smt
jhammond#cori11:/tmp> size smt
text data bss dec hex filename
748065 13984 296024 1058073 102519 smt
jhammond#cori11:/tmp> ldd smt
linux-vdso.so.1 => (0x00002aaaaaaab000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaaaab0c000)
libiomp5.so => /opt/intel/parallel_studio_xe_2016.0.047/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64/libiomp5.so (0x00002aaaaad86000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab0c7000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaab2e4000)
libgcc_s.so.1 => /opt/gcc/5.1.0/snos/lib64/libgcc_s.so.1 (0x00002aaaab661000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab878000)
/lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
Dynamic Executable with Static Intel
jhammond#cori11:/tmp> ifort -O3 -qopenmp smt.f90 -static-intel -o smt
jhammond#cori11:/tmp> size smt
text data bss dec hex filename
1608953 41420 457016 2107389 2027fd smt
jhammond#cori11:/tmp> ls -l smt
-rwxr-x--- 1 jhammond jhammond 1872489 Jan 12 05:51 smt
Static Executable
jhammond#cori11:/tmp> ifort -O3 -qopenmp smt.f90 -static -o smt
jhammond#cori11:/tmp> size smt
text data bss dec hex filename
2262019 43120 487320 2792459 2a9c0b smt
jhammond#cori11:/tmp> ldd smt
not a dynamic executable

Related

The memory footprint of a program for a RISC processor

How can I test the memory footprints programs written for a RISC and a CISC processor?
Which one would require more memory and why?
So, the way I would do this is via experimentation. I would compile binaries for both types of architectures and then use gcc tools to see what the memory footprints are. For the following examples, I will compare x86_64 and RISCV architectures. First method I would use is the size tool which breaks down the various portions of an elf and reports the size.
# riscv64-unknown-elf-size Test.elf
Which will output something like this
text data bss dec hex filename
XXXXXX XXX XXXXXXX XXXXXXX XXXXXX Test.elf
Then compare that to the x86 version:
# size Test.exe
Which will output something like this
text data bss dec hex filename
XXXXXX XXX XXXXXXX XXXXXXX XXXXXX Test.exe
The other method is to convert your elf to a straight binary that will be bit for bit what is put into your memory ( this may not be true for more complex memory architectures, but we'll assume a simple case where it is all stored and executed from a RAM ). The tool for that is objcopy.
# riscv64-unknown-elf-objcopy -O binary Test.elf Test.elf.bin
# objcopy -O binary Test.exe Test.exe.bin
Then check the sizes of the two resulting bin files.

How is -fPIE passed along to llc?

I'm working on an LLVM backend for a new architecture and we need to have position independent executables. I can pass '-fPIE' on the clang command line, but I don't see any indication of this show up in the resulting LLVM IR. For example, if I run:
clang -v -emit-llvm -fPIC -O0 -S global_dat.c -o global_dat_x86_pic.ll
And then take a look at the resulting global_dat_x86_pic.ll file, I see the following near the bottom:
!0 = !{i32 1, !"PIC Level", i32 2}
Ok, makes sense.
However if I run:
clang -v -emit-llvm -fPIE -O0 -S global_dat.c -o global_dat_x86_pie.ll
I see that the two .ll files are identical. Near the bottom of global_cat_x86_pie.ll I see:
!0 = !{i32 1, !"PIC Level", i32 2}
Which is identical to the case where I ran with -fPIE. There's no indication of "PIE Level" in the .ll file. If this .ll file were passed on to llc how would llc know that -fPIE had been set on the clang command line?
I have run in gdb and see that in fact in the second case with -fPIE on the clang commandline there is an Opts.PIELevel (in $LLVM_HOME/tools/clang/lib/Frontend/CompilerInvocation.cpp) that gets set to 2 (in fact, both Opts.PIELevel and Opts.PICLevel are set to 2 in that case whereas in -fPIC is passed to clang only Opts.PICLevel is set to 2)
This depends on your default target triple, which I can't tell from your question. You can see what happens if you cut off the default architecture (or run a native clang on an arch that does support PIE)
For example, bare X86-64 shows this,
$ clang -c hello.c -target x86_64 -fPIE -emit-llvm -###
if you run that command, you'll find "-pie-level" "2" in the output, which is how llc (well, it's internal equivalent) knows about it.
The key here is that you'll have to arrange for your backend to do something with this flag. Certain platforms (like Darwin) just ignore it. If you happen to be experiment on an OSX host, you won't see -pie-level in the bogbrush output.

Clang/LLVM OpenMP program not spawning threads

According to http://blog.llvm.org/2015/05/openmp-support_22.html, OpenMP support in Clang is completed. However, I'm having difficulties trying a simple program.
I've installed Clang/LLVM as explained in http://clang.llvm.org/get_started.html and the OpenMP runtime as explained in http://openmp.llvm.org/.
The test program is:
#include "omp.h"
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv)
{
#pragma omp parallel
{
printf("thread %d\n", omp_get_thread_num());
}
return 0;
}
The compilation line is:
clang -lomp -I/.../openmp/runtime/exports/common/include -L/.../openmp/runtime/exports/lin_32e/lib ./test-openmp.c -o ./test-openmp
Using which, I check I'm using the correct clang binary.
With ldd, I check I'm linking to the correct OpenMP library:
$ ldd ./test-openmp
linux-vdso.so.1 => (0x00007ffdaf6d7000)
libomp.so => /.../openmp/runtime/exports/lin_32e/lib/libomp.so (0x00007f7d47552000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7d47191000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f7d46f8d000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7d46d71000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7d477fb000)
But when running, it executes only one thread:
$ OMP_NUM_THREADS=4 ./test-openmp
thread 0
The reason I'm linking with lomp is because if I link with fopenmp, the code is wrongly linked to the gcc omp library. However in that case, the result is the same:
$ clang -fopenmp -I/.../openmp/runtime/exports/common/include -L/.../openmp/runtime/exports/lin_32e/lib ./test-openmp.c -o ./test-openmp
$ ldd ./test-openmp
linux-vdso.so.1 => (0x00007ffdf351f000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fbc1c3e1000)
librt.so.1 => /lib64/librt.so.1 (0x00007fbc1c1d9000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fbc1bfbd000)
libc.so.6 => /lib64/libc.so.6 (0x00007fbc1bbfc000)
/lib64/ld-linux-x86-64.so.2 (0x00007fbc1c5f8000)
$ OMP_NUM_THREADS=4 ./test-openmp
thread 0
When using gcc, it works as expected:
$ gcc -fopenmp ./test-openmp.c -o ./test-openmp
$ ldd ./test-openmp
linux-vdso.so.1 => (0x00007ffc444e0000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f7d425ce000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7d423b2000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7d41ff1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7d427e5000)
$ OMP_NUM_THREADS=4 ./test-openmp
thread 0
thread 2
thread 3
thread 1
I've used in the past the implementation described in http://clang-omp.github.io/ and I know that one works (a different repo is presented there for Clang and LLVM, but the OpenMP repo is the same). However that page was (apparently) updated in 2014, and the blog in http://blog.llvm.org/2015/05/openmp-support_22.html is from May 2015, which makes you think that you can use the latest Clang/LLVM for OpenMP.
So my question is, am I missing something or the blog from May 2015 is actually referring to Clang/LLVM implementation from in http://clang-omp.github.io and not to the latest one?
Thanks
Adding -fopenmp=libomp should do the trick.
This is a temporary situation; hopefully, soon clang will be changed to do what is described in the blog post.
Yours,
Andrey

Dynamic library size bigger than static library and sum of linked objects size, how comes?

[See edit, it seems the extra size comes from debugging symbols added at linking time, but the reason why this happens is still unclear!]
I am cross compiling OpenCV 2.4.11 Ubuntu x86 64bit -> armeabi.
I am using the toolchain available here https://developer.android.com/tools/sdk/ndk/index.html, choosing the 4.9 compiler.
When I compile the dynamic libraries, they get considerably bigger than the static library. Examples:
3793082 Mar 12 17:21 libopencv_core.a
6131716 Mar 12 17:29 libopencv_core.so
446060 Mar 12 17:22 libopencv_highgui.a
5510352 Mar 12 17:30 libopencv_highgui.so
3477794 Mar 12 17:21 libopencv_imgproc.a
5325504 Mar 12 17:29 libopencv_imgproc.so
38004 Mar 12 17:19 libopencv_info.so
844990 Mar 12 17:21 libopencv_ml.a
3827136 Mar 12 17:29 libopencv_ml.so
747744 Mar 12 17:22 libopencv_objdetect.a
2370188 Mar 12 17:30 libopencv_objdetect.so
405920 Mar 12 17:22 libopencv_video.a
2196268 Mar 12 17:30 libopencv_video.so
For the static library the size corresponds more or less to the total size of the object files. Example for core and highgui.
du -chs `find -iname \*.o|grep opencv_core.dir`
[...]
3,5M total
du -chs `find -iname \*.o|grep opencv_highgui.dir`
[...]
352K total
The same happens if I build with make or ninja.
There is just a small difference in the compiler flags at build time, but if I check the object files generated for the static and the dynamic build, they have exactly the same size. That's the command I use to generate such a list:
ls -s `find -iname \*.o`|grep core
So, I thought, it must something in the linking phase. I took a look at the build.ninja file differences, and these are lines present only for the shared version:
LINK_FLAGS = -llog -Wl,--fix-cortex-a8 -Wl,--no-undefined -Wl,--gc-sections -Wl,-z,noexecstack -Wl,-z,relro -Wl,-z,now
LINK_LIBRARIES = lib/libopencv_features2d.so -ldl -lm -llog -ltbb lib/libopencv_flann.so lib/libopencv_highgui.so lib/libopencv_imgproc.so lib/libopencv_core.so
I do not think the additional libraries linked (dl, m, log, tbb) influence the final size, as they are all much smaller than the difference I found. Furthermore, I started to verify and for log there's only the .so available, and for tbb (100Kb) I have both shared and static version. BTW, I tried to build also without tbb.
To be 100% sure, I took the actual command line that was linking the object file, removed the -no-undefined option, and then removed all other options and linked libraries. The file size did not change, apart from when removing -Wl,--gc-sections which caused the file size to increase (it's some garbage collection option).
So, the only option that is left is a... linker bug?!? Does anybody have any idea what is happening?
Some additional information:
Compiler details:
./arm-linux-androideabi-gcc -v
Using built-in specs.
COLLECT_GCC=./arm-linux-androideabi-gcc
COLLECT_LTO_WRAPPER=/opt/toolchain-arm17/bin/../libexec/gcc/arm-linux-androideabi/4.9/lto-wrapper
Target: arm-linux-androideabi
Configured with: /s/ndk-toolchain/src/build/../gcc/gcc-4.9/configure --prefix=/tmp/ndk-andrewhsieh/build/toolchain/prefix --target=arm-linux-androideabi --host=x86_64-linux-gnu --build=x86_64-linux-gnu --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --with-gmp=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-mpfr=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-mpc=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-cloog=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-isl=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-ppl=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --disable-ppl-version-check --disable-cloog-version-check --disable-isl-version-check --enable-cloog-backend=isl --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' --disable-libssp --enable-threads --disable-nls --disable-libmudflap --disable-libgomp --disable-libstdc__-v3 --disable-sjlj-exceptions --disable-shared --disable-tls --disable-libitm --with-float=soft --with-fpu=vfp --with-arch=armv5te --enable-target-optspace --enable-initfini-array --disable-nls --prefix=/tmp/ndk-andrewhsieh/build/toolchain/prefix --with-sysroot=/tmp/ndk-andrewhsieh/build/toolchain/prefix/sysroot --with-binutils-version=2.24 --with-mpfr-version=3.1.1 --with-mpc-version=1.0.1 --with-gmp-version=5.0.5 --with-gcc-version=4.9 --with-gdb-version=7.6 --with-python=/usr/local/google/home/andrewhsieh/mydroid/ndk/prebuilt/linux-x86_64/bin/python-config.sh --with-gxx-include-dir=/tmp/ndk-andrewhsieh/build/toolchain/prefix/include/c++/4.9 --with-bugurl=http://source.android.com/source/report-bugs.html --enable-languages=c,c++ --disable-bootstrap --enable-plugins --enable-libgomp --disable-libsanitizer --enable-gold --enable-graphite=yes --with-cloog-version=0.18.0 --with-isl-version=0.11.1 --enable-eh-frame-hdr-for-static --with-arch=armv5te --program-transform-name='s&^&arm-linux-androideabi-&' --enable-gold=default
Thread model: posix
gcc version 4.9 20140827 (prerelease) (GCC)
I also tried to see what would happen trying another version of the linker, but no changes in size
arm-linux-androideabi-g++.exe -v
Using built-in specs.
COLLECT_GCC=arm-linux-androideabi-g++.exe
COLLECT_LTO_WRAPPER=lto-wrapper.exe
Target: arm-linux-androideabi
Configured with: /s/ndk-toolchain/src/build/../gcc/gcc-4.8/configure --prefix=/tmp/ndk-andrewhsieh/build/toolchain/prefix --target=arm-linux-androideabi --host=x86_64-pc-mingw32msvc --build=x86_64-lin
ux-gnu --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --with-gmp=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-mpfr=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-mpc=/tmp/n
dk-andrewhsieh/build/toolchain/temp-install --with-cloog=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-isl=/tmp/ndk-andrewhsieh/build/toolchain/temp-install --with-ppl=/tmp/ndk-andrewhsieh/
build/toolchain/temp-install --disable-ppl-version-check --disable-cloog-version-check --disable-isl-version-check --enable-cloog-backend=isl --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc+
+,-Bdynamic -lm' --disable-libssp --enable-threads --disable-nls --disable-libmudflap --disable-libgomp --disable-libstdc__-v3 --disable-sjlj-exceptions --disable-shared --disable-tls --disable-libitm
--with-float=soft --with-fpu=vfp --with-arch=armv5te --enable-target-optspace --enable-initfini-array --disable-nls --prefix=/tmp/ndk-andrewhsieh/build/toolchain/prefix --with-sysroot=/tmp/ndk-andrew
hsieh/build/toolchain/prefix/sysroot --with-binutils-version=2.24 --with-mpfr-version=3.1.1 --with-mpc-version=1.0.1 --with-gmp-version=5.0.5 --with-gcc-version=4.8 --with-gdb-version=7.6 --with-pytho
n=/usr/local/google/home/andrewhsieh/mydroid/ndk/prebuilt/windows-x86_64/bin/python-config.sh --with-gxx-include-dir=/tmp/ndk-andrewhsieh/build/toolchain/prefix/include/c++/4.8 --with-bugurl=http://so
urce.android.com/source/report-bugs.html --enable-languages=c,c++ --disable-bootstrap --enable-plugins --enable-libgomp --disable-libsanitizer --enable-gold --enable-graphite=yes --with-cloog-version=
0.18.0 --with-isl-version=0.11.1 --enable-eh-frame-hdr-for-static --with-arch=armv5te --program-transform-name='s&^&arm-linux-androideabi-&' --enable-gold=default
Thread model: posix
gcc version 4.8 (GCC)
EDIT:
As suggested by MarcB, I tried to strip the library. The result is suprising (to me :) )
$ arm-linux-androideabi/bin/strip -g libopencv_core.so -o libopencv_core_stripped.so
$ ls -la *core*
6293308 Mar 13 10:40 libopencv_core.so
3224840 Mar 13 12:18 libopencv_core_stripped.so
Where did all those debug symbols came out if the object files where compiled without -g (or even with -g0)?
Note: the library stripped like that seem to be fully functional. The nm -D output of the stripped/unstripped library are the same, and nm output is just a few lines smaller (like 50 lines less out of 12000).
Just to be sure, I tried also to strip objects before linking, but their file does not change (it increases just a bit), and linking the "stripped" object files produces a library of the same big size of before.
Those are not debug symbols. They are regular linker symbols.
A shared library may have two sets of symbols: one is for linking, the other one is for dynamic loading. strip removes the first kind of symbols. You cannot link with a stripped shared library, but you can load it at run time normally (e.g if you use dlopen, or link with the library then strip it).
See nm yourlib.so and nm -D yourlib.so both before and after running strip.
CORRECTION it is possible to link with a stripped library. A good explanation about the two kinds of symbol tables is here.

Debug error in fortran for the array negtivel index

I have a test program here:
program test
implicit none
integer(4) :: indp
integer(4) :: t1(80)
indp = -3
t1(indp) = 1
write(*,*) t1(indp)
end program test
in line 8 it is wrong, because the indp is negative number. but when I compile it use 'ifort' or 'gfortran' both of them cannot find this error.
and even use valgrind to debug this program it also cannot find this error.
do you have any idea find this kind of problem?
Fortran compilers aren't required to give you warnings about things like this; and in general, t1(-3) = 1 could be a perfectly reasonable statement if you set the lower bound of your fortran array to something equal to or less than -3, eg
integer(kind=4), dimension(-5:74) :: t1(80)
would certainly allow setting and reading t1(-3).
If you want to make sure these sorts of errors are checked at runtime, you can compile with -fbounds-check with gfortran:
$ gfortran -o foo foo.f90 -fcheck=bounds
$ ./foo
At line 8 of file foo.f90
Fortran runtime error: Array reference out of bounds for array 't1', lower bound of dimension 1 exceeded (-3 < 1)
or -check bounds in ifort:
ifort -o foo foo.f90 -check bounds
$ ifort -o foo foo.f90 -check bounds
$ ./foo
forrtl: severe (408): fort: (3): Subscript #1 of the array T1 has value -3 which is less than the lower bound of 1
Image PC Routine Line Source
foo 000000000046A8DA Unknown Unknown Unknown
The reason valgrind doesn't catch this is a little subtle, but note that it would if the array were allocated:
program test
implicit none
integer(kind=4) :: indp
integer(kind=4), allocatable :: t1(:)
indp = -3
allocate(t1(80))
t1(indp) = 1
write(*,*) t1(indp)
deallocate(t1)
end program test
$ gfortran -o foo foo.f90 -g
$ valgrind ./foo
==18904== Memcheck, a memory error detector
==18904== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==18904== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==18904== Command: ./foo
==18904==
==18904== Invalid write of size 4
==18904== at 0x400931: MAIN__ (foo.f90:9)
==18904== by 0x400A52: main (foo.f90:13)
==18904== Address 0x5bb3420 is 16 bytes before a block of size 320 alloc'd
==18904== at 0x4C264B2: malloc (vg_replace_malloc.c:236)
==18904== by 0x400904: MAIN__ (foo.f90:8)
==18904== by 0x400A52: main (foo.f90:13)
==18904==
==18904== Invalid read of size 4
==18904== at 0x4F07368: extract_int (write.c:450)
==18904== by 0x4F08171: write_integer (write.c:1260)
==18904== by 0x4F0BBAE: _gfortrani_list_formatted_write (write.c:1553)
==18904== by 0x40099F: MAIN__ (foo.f90:10)
==18904== by 0x400A52: main (foo.f90:13)
==18904== Address 0x5bb3420 is 16 bytes before a block of size 320 alloc'd
==18904== at 0x4C264B2: malloc (vg_replace_malloc.c:236)
==18904== by 0x400904: MAIN__ (foo.f90:8)
==18904== by 0x400A52: main (foo.f90:13)
There is no error. You declared indp as an integer of a certain range and precision (of a certain KIND <- look up in help for that term), which can be either positive or negative.
After that you assigned the value of 1 to an t1(indp) and wrote it out.

Resources