I am hitting a segmentation fault only when both using AVX and linking to other code that does - vectorization

I am using Eigen to set up a sparse linear system as follows (slightly pseudocode):
Eigen::SparseQR<Eigen::SparseMatrix<real_t>, Eigen::COLAMDOrdering<int>> solver;
Eigen::SparseMatrix<real_t> P(rows, cols);
P.setFromTriplets(triplet_list.begin(), triplet_list.end());
P.makeCompressed();
solver.compute(P);
This code is within a small library. I am compiling with -mavx -mfma -O2. If I build a simple executable using this library, everything runs fine. If I instead link into another library (in which the C++ sources are built with the same compiler flags, but which also includes CUDA), I get a segmentation fault in Eigen::SparseQR<Eigen::SparseMatrix<real_t>, Eigen::COLAMDOrdering<int>>::factorize. If I compile with -O0 the segmentation fault disappears.
I have not been able to isolate this into a minimum working example; I would appreciate suggestions on how I could describe the problem better or ideas as to what might be going wrong. While vectorization is not critical for this solve, I do need it elsewhere in the library so simply removing the AVX flags is not a good option.
EDIT: adding some context as requested.
If I compile with -g and run in gdb, the exact crash line is line 98 in Core/util/Memory.h
│95 /** \internal Frees memory allocated with handmade_aligned_malloc */ │
│96 inline void handmade_aligned_free(void *ptr) │
│97 { │
>│98 if (ptr) std::free(*(reinterpret_cast<void**>(ptr) - 1)); │
│99 }
with stack trace
#0 0x00007ffff12e94dc in free () from /lib64/libc.so.6
#1 0x00007fffe3dadb1f in Eigen::internal::handmade_aligned_free (ptr=<optimized out>) at include/eigen3/Eigen/src/Core/util/Memory.h:98
#2 Eigen::internal::aligned_free (ptr=<optimized out>) at include/eigen3/Eigen/src/Core/util/Memory.h:179
#3 Eigen::aligned_allocator<float>::deallocate (this=<optimized out>, p=<optimized out>) at include/eigen3/Eigen/src/Core/util/Memory.h:763
#4 std::allocator_traits<Eigen::aligned_allocator<float> >::deallocate (__a=..., __n=<optimized out>, __p=<optimized out>) at include/c++/7.3.0/bits/alloc_traits.h:328
#5 std::_Vector_base<float, Eigen::aligned_allocator<float> >::_M_deallocate (this=<optimized out>, __n=<optimized out>, __p=<optimized out>) at include/c++/7.3.0/bits/stl_vector.h:180
#6 std::vector<float, Eigen::aligned_allocator<float> >::_M_default_append (this=0x7fffe3fefc20 <lse_helper_t::singleton()::helper>, __n=<optimized out>) at include/c++/7.3.0/bits/vector.tcc:592
#7 0x00007fffe3dae688 in std::vector<float, Eigen::aligned_allocator<float> >::resize (__new_size=10, this=0x7fffe3fefc20 <lse_helper_t::singleton()::helper>) at include/c++/7.3.0/bits/stl_vector.h:692
If I run with valgrind, I see errors of the form below. However, the program no longer crashes (the same code run outside of valgrind does still segfault).
==16218== Invalid read of size 8
==16218== at 0x19049B16: handmade_aligned_free (Memory.h:98)
==16218== by 0x19049B16: aligned_free (Memory.h:179)
==16218== by 0x19049B16: deallocate (Memory.h:763)
==16218== by 0x19049B16: deallocate (alloc_traits.h:328)
==16218== by 0x19049B16: _M_deallocate (stl_vector.h:180)
==16218== by 0x19049B16: std::vector<float, Eigen::aligned_allocator<float> >::_M_default_append(unsigned long) (vector.tcc:592)
==16218== by 0x1904A687: resize (stl_vector.h:692)
==16218== Address 0x3e195558 is 8 bytes before a block of size 8 alloc'd
==16218== at 0x4C29BE3: malloc (vg_replace_malloc.c:299)
==16218== by 0x123B7326: Eigen::internal::aligned_malloc(unsigned long) (in /gdn/centos7/0001/x3/prefixes/desmond-dependencies/2.14c7__dc4688ce01c7/lib/libminimax.so)
==16218== by 0x19049B73: allocate (Memory.h:758)
==16218== by 0x19049B73: allocate (alloc_traits.h:301)
==16218== by 0x19049B73: _M_allocate (stl_vector.h:172)
==16218== by 0x19049B73: std::vector<float, Eigen::aligned_allocator<float> >::_M_default_append(unsigned long) (vector.tcc:571)
==16218== by 0x1904A687: resize (stl_vector.h:692)
==16218== Invalid free() / delete / delete[] / realloc()
==16218== at 0x4C2ACDD: free (vg_replace_malloc.c:530)
==16218== by 0x19049B1E: handmade_aligned_free (Memory.h:98)
==16218== by 0x19049B1E: aligned_free (Memory.h:179)
==16218== by 0x19049B1E: deallocate (Memory.h:763)
==16218== by 0x19049B1E: deallocate (alloc_traits.h:328)
==16218== by 0x19049B1E: _M_deallocate (stl_vector.h:180)
==16218== by 0x19049B1E: std::vector<float, Eigen::aligned_allocator<float> >::_M_default_append(unsigned long) (vector.tcc:592)
==16218== by 0x1904A687: resize (stl_vector.h:692)
==16218== Invalid read of size 8
==16218== at 0x1905327B: handmade_aligned_free (Memory.h:98)
==16218== by 0x1905327B: aligned_free (Memory.h:179)
==16218== by 0x1905327B: conditional_aligned_free<true> (Memory.h:230)
==16218== by 0x1905327B: conditional_aligned_delete_auto<double, true> (Memory.h:416)
==16218== by 0x1905327B: ~DenseStorage (DenseStorage.h:542)
==16218== by 0x1905327B: ~PlainObjectBase (PlainObjectBase.h:98)
==16218== by 0x1905327B: ~Matrix (Matrix.h:178)
==16218== by 0x1905327B: Eigen::SparseQR<Eigen::SparseMatrix<double, 0, int>, Eigen::COLAMDOrdering<int> >::factorize(Eigen::SparseMatrix<double, 0, int> const&) (SparseQR.h:360)
==16218== by 0x19047A28: compute (SparseQR.h:118)
I am attempting to turn this into a minimal reproducible example.

The described problem usually occurs if compilation units with different memory-alignment options are linked together. By default Eigen aligns memory to 16 bytes, unless AVX is enabled, in which case memory is aligned to 32 bytes (or 64 bytes for AVX512 -- I think).
Ideally, you should compile all compilation units with the same target architecture -- if you only plan to run on your local machine best use -march=native (this also enables tuning for the local architecture).
If you need to have some parts compiled with AVX enabled and others without, you can manually override the memory-alignment of Eigen using -DEIGEN_MAX_ALIGN_BYTES=16 or -DEIGEN_MAX_ALIGN_BYTES=32 (for consistency, either one should be added to all compilation units, even though some would be redundant).

Related

CMA Allocation Fault on Petalinux 2020.2 (Zynq-7000)

I want to use 1920x1080 (or more) on my custom Zynq-7000 Board. Mode 1024x768 works nice.
There is CMA allocation error, when I try to use FullHD.
I added some output to source code (output below is for 2560x1600, it is the same for 1920x1080, excepting buffer size):
[12:09:34:466] xlnx-pl-disp amba_pl:xlnx_pl_disp: surface width(2560), height(1600) and bpp(24)
[12:09:34:474] xlnx-pl-disp amba_pl:xlnx_pl_disp: bytes per line after alignment: 12288000
[12:09:34:480] xlnx-pl-disp amba_pl:xlnx_pl_disp: allocating 12288000 bytes with kzalloc()...
[12:09:34:488] xlnx-pl-disp amba_pl:xlnx_pl_disp: OK
[12:09:34:491] xlnx-pl-disp amba_pl:xlnx_pl_disp: init gem object...
[12:09:34:497] xlnx-pl-disp amba_pl:xlnx_pl_disp: OK
[12:09:34:500] xlnx-pl-disp amba_pl:xlnx_pl_disp: creating mmap offset...
[12:09:34:505] xlnx-pl-disp amba_pl:xlnx_pl_disp: OK
[12:09:34:508] xlnx-pl-disp amba_pl:xlnx_pl_disp: gem cma created with size 12288000
[12:09:34:514] xlnx-pl-disp amba_pl:xlnx_pl_disp: failed to allocate buffer with size 12288000
[12:09:34:522] xlnx-pl-disp amba_pl:xlnx_pl_disp: Failed to create cma gem object (12288000 bytes)
[12:09:34:527] xlnx-pl-disp amba_pl:xlnx_pl_disp: drm_fb_helper_single_fb_probe() returns -12
[12:09:34:536] xlnx-pl-disp amba_pl:xlnx_pl_disp: Failed to set initial hw configuration.
[12:09:34:541] xlnx-pl-disp amba_pl:xlnx_pl_disp: failed to initialize drm fb
As I see, the issue goes from this line (drm_gem_cma_helper.c)
cma_obj->vaddr = dma_alloc_wc(drm->dev, size, &cma_obj->paddr,GFP_KERNEL | __GFP_NOWARN);
I try to change some settings:
increase CMA Size in Kernel config (it was 128, now 256 Mb)
increase CMA Areas number in Kernel config (from 7 to 20)
add reserved memory to Device Tree
add coherent_pool option to bootargs
I get the same fault anyway.
Please help to find the reason and solve my issue.
Many thanks!
With regards,
Maksim
The only solution I've found is to change kernel base address from 0x18000000 to 0x11000000 (as shown on screenshot).
Unfortunately, I don't have complete explanation on how it helps.
With regards,
Maksim

Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions?

Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions? (I could not find any fence instructions.)
Similar question has answers here.
Does pthread_mutex_lock contains memory fence instruction?
Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions?
There is no the implementation -- there is an implementation for each supported processor.
The x86_64 implementation does not use memory fences; it uses lock prefix instead:
gdb -q /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) disas pthread_spin_lock
Dump of assembler code for function pthread_spin_lock:
0x00000000000108c0 <+0>: lock decl (%rdi)
0x00000000000108c3 <+3>: jne 0x108d0 <pthread_spin_lock+16>
0x00000000000108c5 <+5>: xor %eax,%eax
0x00000000000108c7 <+7>: retq
0x00000000000108c8 <+8>: nopl 0x0(%rax,%rax,1)
0x00000000000108d0 <+16>: pause
0x00000000000108d2 <+18>: cmpl $0x0,(%rdi)
0x00000000000108d5 <+21>: jg 0x108c0 <pthread_spin_lock>
0x00000000000108d7 <+23>: jmp 0x108d0 <pthread_spin_lock+16>
Since lock-prefixed instructions are already a memory barrier on x86_64 (and i386), no additional memory barriers are necessary.
But powerpc implementation uses lwarx and stwcx instructions, which are closer to "memory fence", and sparc64 implementation uses full membar (memory barrier) instruction.
You can see the various implementations in sysdeps/.../pthread_spin_lock.* files in GLIBC sources.

Malloc failed to allocate bytes even there is sufficient amount of bytes available

I tried to allocate 21,128 bytes using a wrapper function, which calls malloc internally.
Malloc Stats() :-
system bytes = 14618624
in use bytes = 13759424
Arena 1:
Arena 0:
system bytes = 14626816
in use bytes = 13759600
Arena 1:
system bytes = 135168
in use bytes = 3280
Arena 2:
system bytes = 135168
in use bytes = 13088
But still, I see malloc has failed. What could be possibly the reason?
*** Error in `./wr_acc': malloc(): memory corruption: 0x00007ff4747a2ff0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x82c86)[0x7ff48c9d5c86]
/lib64/libc.so.6(__libc_malloc+0x4c)[0x7ff48c9d884c]
./wr_acc[0xdf4c28]
Please help. I am a beginner.
The error message is clear:
*** Error in `./wr_acc': malloc(): memory corruption: 0x00007ff4747a2ff0 ***
malloc() detected an invalid state in its ancillary structures so it gave up trying to allocate memory and aborted the program to avoid potentially damaging side effects.
The private data malloc() uses to keep track of allocated and free blocks may have been overwritten by your program, for example by writing beyond the end of an allocated block or before its beginning. You could post the code and see if anyone can spot such problems.

The Keil RTX RTOS thread stack size

In the Keil RTX RTOS configuration file, user could configure the default user thread stack size.
In general, the stack holds auto/local variables. The "ZI data" section holds uninitialized global variables.
So if I change the user thread stack size in the RTX configuration file, the stack size will increase and the "ZI data" section size will not increase.
I test it, the test result shows that if I increase user thread stack size. The "ZI data" section size will increase synchronously with the same size.
In my test program, there is 6 threads and each has 600 bytes stack. I use Keil to build the program and it shows me that:
Code (inc. data) RO Data RW Data ZI Data Debug
36810 4052 1226 380 6484 518461 Grand Totals
36810 4052 1226 132 6484 518461 ELF Image Totals (compressed)
36810 4052 1226 132 0 0 ROM Totals
==============================================================================
Total RO Size (Code + RO Data) 38036 ( 37.14kB)
Total RW Size (RW Data + ZI Data) 6864 ( 6.70kB)
Total ROM Size (Code + RO Data + RW Data) 38168 ( 37.27kB)
But if I changed each thread stack size to 800 bytes. Keil shows me as follows:
==============================================================================
Code (inc. data) RO Data RW Data ZI Data Debug
36810 4052 1226 380 7684 518461 Grand Totals
36810 4052 1226 132 7684 518461 ELF Image Totals (compressed)
36810 4052 1226 132 0 0 ROM Totals
==============================================================================
Total RO Size (Code + RO Data) 38036 ( 37.14kB)
Total RW Size (RW Data + ZI Data) 8064 ( 7.88kB)
Total ROM Size (Code + RO Data + RW Data) 38168 ( 37.27kB)
==============================================================================
The "ZI data" section size increase from 6484 to 7684 bytes. 7684 - 6484 = 1200 = 6 * 200. And 800 - 600 = 200.
So I see the thread stack is put in "ZI Data" section.
My question is:
Does it mean auto/local variables in the thread will be put in "ZI Data" section, when thread stack is put in "ZI data" section in RAM ?
If it's true, it means there is no stack section at all. There are only "RO/RW/ZI Data" and heap sections at all.
This article gives me the different answer. And I am a little confused about it now.
https://developer.mbed.org/handbook/RTOS-Memory-Model
The linker determines what memory sections exist. The linker creates some memory sections by default. In your case, three of those default sections are apparently named "RO Data", "RW Data", and "ZI Data". If you don't explicitly specify which section a variable should be located in then the linker will assign it to one of these default sections based on whether the variable is declared const, initialized, or uninitialized.
The linker is not automatically aware that you are using an RTOS. And it has no special knowledge of which variables you intend to use as thread stacks. So the linker does not automatically create independent memory sections for your thread stacks. Rather, the linker will treat the stack variables like any other variable and include them in one of the default memory sections. In your case the thread stacks are apparently being put into the ZI Data section by the linker.
If you want the linker to create special independent memory sections for your thread stacks then you have to explicitly tell the linker to do so via the linker command file. And then you also have to specify that the stack variables should be located in your custom sections. Consult the linker manual for details on how to do this.
Tasks stacks have to come from somewhere - in RTX by default they are allocated statically and are of fixed size.
The os_tsk_create_user() allows the caller to supply a stack that could be allocated in any manner (statically or from the heap; allocating from the caller stack is possible, but unusual, probably pointless and certainly dangerous) so long as it has 8 byte alignment. I find RTX's automatic stack allocation almost useless and seldom appropriate in all but the most trivial application.

cuda max number in Blocks and allocation manage

i'm writing a CUDA kernel and I have to execute on this device:
name: GeForce GTX 480
CUDA capability: 2.0
Total global mem: 1610285056
Total constant Mem: 65536
Shared mem per mp: 49152
Registers per mp: 32768
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (65535, 65535, 65535)
The kernel, in minimal form, is:
_global__ void CUDAvegas( ... )
{
devParameters p;
extern __shared__ double shared[];
int width = Ndim * Nbins;
int ltid = p.lId;
while(ltid < 2* Ndim){
shared[ltid+2*width] = ltid;
ltid += p.lOffset; //offset inside a block
}
__syncthreads();
din2Vec<double> lxp(Ndim, Nbins);
__syncthreads();
for(int i=0; i< Ndim; i++){
for(int j=0; j< Nbins; j++){
lxp.v[i][j] = shared[i*Nbins+j];
}
}
}// end kernel
where Ndim=2, Nbins=128, devParameters is a class whose method p.lId is for counting the local thread's id (inside a block), and din2Cec is a class for creating a Vector of dim Ndim*Nbins whit the new command (in it's destructor I've implemented the corresponds delete[]).
The nvcc output is:
nvcc -arch=sm_20 --ptxas-options=-v file.cu -o file.x
ptxas info : Compiling entry function '_Z9CUDAvegas4LockidiiPdS0_S0_P7sumAccuP17curandStateXORWOWS0_i' for 'sm_20'
ptxas info : Function properties for _Z9CUDAvegas4LockidiiPdS0_S0_P7sumAccuP17curandStateXORWOWS0_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 116 bytes cmem[0], 51200 bytes cmem[2]
the number of threads is compatible with the MultiProcessors limits: max Shared memory, max register per thread and MP and warps per MP.
If I launch 64 threads X 30 blocks (Shared Memory per Block is 4128), it's all right, but if use more than 30 block I obtain the error:
cudaCheckError() failed at file.cu:508 : unspecified launch failure
========= Invalid __global__ read of size 8
========= at 0x000015d0 in CUDAvegas
========= by thread (0,0,0) in block (1,0,0)
========= Address 0x200ffb428 is out of bounds
I think that's a problem in allocating single thread's memory, but I don't understand what's my limit per MP and for total blocks...
Someone can help me or remind to a right topic?
PS: I know the kernel presented do nothing, but It's just to understand my limit problems.
I think that the error you receive is explanatory. It is pointed out that there is an out-of-bounds global read of a datatype of size 8. The responsible for the out-of-bounds read is thread (0,0,0) in block (1,0,0). I suspect that responsible instruction is lxp.v[i][j] = shared[i*Nbins+j]; in the last nested for loop. Probably you allocate an amount of global memory which is not related to the number of blocks you launch so that, when you launch too many blocks, you receive such an error.

Resources