Large VIRT memory in Fortran - memory

I have a large Fortran/MPI code that when running uses a very large amount of VIRT memory (~20G) although the actual memory used (500 mb) is quite modest.
How can I profile the code to understand which part produces this huge VIRT memory?
At this stage, I'm even happy to use a brute force approach.
What I have tried is to put sleep statements in the code and recorded the memory usage through "top" to try to pin-point by bisection where the big allocation are.
However, this does not work as the sleep call puts the memory usage to 0. Is there a way to freeze the code while keeping current memory usage?
PS: I have tried VALGRIND but the code being so large, VALGRIND never finished. Is there an alternative to VALGRIND that is "easy" to used?
Thank you,
Sam

A solution for this is this modified (to get VIRT memory) subroutine from Track memory usage in Fortran 90
subroutine system_mem_usage(valueRSS)
use ifport !if on intel compiler
implicit none
integer, intent(out) :: valueRSS
character(len=200):: filename=' '
character(len=80) :: line
character(len=8) :: pid_char=' '
integer :: pid
logical :: ifxst
valueRSS=-1 ! return negative number if not found
!--- get process ID
pid=getpid()
write(pid_char,'(I8)') pid
filename='/proc/'//trim(adjustl(pid_char))//'/status'
!--- read system file
inquire (file=filename,exist=ifxst)
if (.not.ifxst) then
write (*,*) 'system file does not exist'
return
endif
open(unit=100, file=filename, action='read')
do
read (100,'(a)',end=120) line
if (line(1:7).eq.'VmSize:') then
read (line(8:),*) valueRSS
exit
endif
enddo
120 continue
close(100)
return
end subroutine system_mem_usage

Related

fortran deallocate array but not released in OS

I have the following question:how do I deallocate array memory in type? Like a%b%c,
how do I deallocate c? the specific problem is(The compiler environment I tried are gfortran version gcc4.4.7 and ifort version 18.0.1.OS:linux):
module grist_domain_types
implicit none
public :: aaa
type bbb
real (8), allocatable :: c(:)
end type bbb
type aaa
type(bbb), allocatable :: b(:)
end type aaa
end module grist_domain_types
program main
use grist_domain_types
type(aaa) :: a
integer(4) :: time,i
time=20
allocate(a%b(1:100000000))
call sleep(time)!--------------1
do i=1,100000000
allocate(a%b(i)%c(1:1))
enddo
call sleep(time)!--------------2
do i=1,100000000
deallocate(a%b(i)%c)
enddo
call sleep(time)!--------------3
deallocate(a%b)
call sleep(time)!--------------4
end program
First,"gfortran main.F90 -o main" to compile the program, and run this program. Then I use top -p processID to see memory. When the program is executed to 1, the memory is 4.5G. When the program is executed to 2, the memory is 7.5G. When the program is executed to 3, the memory is also 7.5G(but I think is 4.5G). When the program is executed to 4, the memory is 3G(I think is 0G or close to 0G). So deallocate(a%b(i)%c) does not seem to work. However, I use valgrind to see memory. the memory of this program is all deallocate...I used ifort and gfortran. This problem happens no matter which compiler I use. How to explain this question? I allocate many c array in this way,the program will finally crash due to insufficient memory. And how to solve it?
Take a look at this post from the Intel forum. There are 2 important information in there:
(From Doctor Forran):
When you do a DEALLOCATE, the memory that was allocated returns to the pool used by the memory allocator (on Linux and OS X this is the same as C's malloc/free). The memory is not released back to the OS - it is very rare that this would even be possible. What often happens is that the pattern of allocations and deallocations causes virtual memory to be fragmented, so that while the total available space may be high, there may not be sufficient contiguous space to allocate a large item. Unlike with disks, there is no way to "defrag" memory.
(From Jim Dempsey)
See if you can deallocate the memory in the reverse order in which it was allocated. This can reduce memory fragmentation.
You may also refer to this other Intel post:
During the program run, the Fortran runtime library will manage your heap. Yes, if data is DEALLOCATED, the runtime may choose to wait to release that memory. It's an optimization - if you do another ALLOCATE with the same size it will just reuse those pages. If the heap starts to run low, it will do some collection but not until it's absolutely necessary.
Also, let me add something: Check if there aren't other objects dynamically created in scope, like automatic arrays or temporal array copies. That could be demanding memory that may be freed only when they get out of scope.
Summing up, even if 'top' says the memory is still in use, you should start to worry only if your program starts to crash or if Valgrind shows something wreid.
I modified your program ( to see where it was up to ) and ran on Windows 7 / gFortran 7.2.0. It does not demonstrate the memory retention as you report, as memory reverts to 13 mb. Contrary to my comment, memory demand did not change during initialisation of c.
module grist_domain_types
implicit none
public :: aaa
type bbb
real (8), allocatable :: c(:)
end type bbb
type aaa
type(bbb), allocatable :: b(:)
end type aaa
end module grist_domain_types
program main
use grist_domain_types
type(aaa) :: a
integer(4),parameter :: million = 1000000
integer(4) :: n = 100*million
integer(4) :: time = 5, i, pass
do pass = 1,5
write (*,*) ' go #', pass
allocate(a%b(1:n))
write (*,*) 'allocate b'
call sleep(time)!--------------1
write (*,*) ' go'
do i=1,n
allocate(a%b(i)%c(1:1))
enddo
write (*,*) 'allocate c'
call sleep(time)!--------------2
write (*,*) ' go'
do i=1,n
a%b(i)%c = real(i)
enddo
write (*,*) 'use c'
call sleep(time)!--------------2a
write (*,*) ' go'
do i=1,n
deallocate(a%b(i)%c)
enddo
write (*,*) 'deallocate c'
call sleep(time)!--------------3
write (*,*) ' go'
deallocate(a%b)
write (*,*) 'deallocate b'
call sleep(time)!--------------4
end do
write (*,*) ' done : exit ?'
read (*,*) i
end program
edit: I have given the test a repeat with do pass... to repeat the memory demands. This shows no memory leakage for this Fortran program. I use Task manager to identify memory usage, both for this program and the O/S. Your particular O/S and Fortran compiler may be different.

MPI Fortran code: how to share data on node via openMP?

I am working on an Fortan code that already uses MPI.
Now, I am facing a situation, where a set of data grows very large but is same for every process, so I would prefer to store it in memory only once per node and all processes on one node access the same data.
Storing it once for every process would go beyond the available RAM.
Is it somehow possible to achieve something like that with openMP?
Data sharing per node is the only thing I would like to have, no other per node paralellisation required, because this is already done via MPI.
You don't need to implement a hybrid MPI+OpenMP code if it is only for sharing a chunk of data. What you actually have to do is:
1) Split the world communicator into groups that span the same host/node. That is really easy if your MPI library implements MPI-3.0 - all you need to do is call MPI_COMM_SPLIT_TYPE with split_type set to MPI_COMM_TYPE_SHARED:
USE mpi_f08
TYPE(MPI_Comm) :: hostcomm
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
MPI_INFO_NULL, hostcomm)
MPI-2.2 or earlier does not provide the MPI_COMM_SPLIT_TYPE operation and one has to get somewhat creative. You could for example use my simple split-by-host implementation that can be found on Github here.
2) Now that processes that reside on the same node are part of the same communicator hostcomm, they can create a block of shared memory and use it to exchange data. Again, MPI-3.0 provides an (relatively) easy and portable way to do that:
USE mpi_f08
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
INTEGER :: hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: size
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
TYPE(MPI_Win) :: win
TYPE(MY_DATA_TYPE), POINTER :: shared_data
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
CALL MPI_Comm_rank(hostcomm, hostrank)
if (hostrank == 0) then
size = 10000000 ! Put the actual data size here
else
size = 0
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(size, disp_unit, MPI_INFO_NULL, &
hostcomm, baseptr, win)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, size, disp_unit, baseptr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, shared_data)
! Use shared_data as if it was ALLOCATE'd
! ...
! Destroy the shared memory window
CALL MPI_Win_free(win)
The way that code works is that it uses the MPI-3.0 functionality for allocating shared memory windows. MPI_WIN_ALLOCATE_SHARED allocates a chunk of shared memory in each process. Since you want to share one block of data, it only makes sense to allocate it in a single process and not have it spread across the processes, therefore size is set to 0 for all but one ranks while making the call. MPI_WIN_SHARED_QUERY is used to find out the address at which that shared memory block is mapped in the virtual address space of the calling process. One the address is known, the C pointer can be associated with a Fortran pointer using the C_F_POINTER() subroutine and the latter can be used to access the shared memory. Once done, the shared memory has to be freed by destroying the shared memory window with MPI_WIN_FREE.
MPI-2.2 or earlier does not provide shared memory windows. In that case one has to use the OS-dependent APIs for creation of shared memory blocks, e.g. the standard POSIX sequence shm_open() / ftruncate() / mmap(). A utility C function callable from Fortran has to be written in order to perform those operations. See that code for some inspiration. The void * returned by mmap() can be passed directly to the Fortran code in a C_PTR type variable that can be then associated with a Fortran pointer.
With this answer I want to add a complete running code example (for ifort 15 and mvapich 2.1). The MPI shared memory concept is still pretty new and in particular for Fortran there aren't many code examples out there. It is based on the answer from Hristo and a very useful email on the mvapich mailing list (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-June/005003.html).
The code example is based on the problems I ran into and adds to Hristo's answer in the following ways:
uses mpi instead of mpi_f08 (some libraries do not provide a full fortran 2008 interface yet)
Has ierr added to the respective MPI calls
Explicit calculation of the windowsize elements*elementsize
How to use C_F_POINTER to map the shared memory to a multi dimensional array
Reminds to use MPI_WIN_FENCE after modifying the shared memory
Intel mpi (5.0.1.035) needs an additional MPI_BARRIER after the MPI_FENCE since it only guarantees that between "between two MPI_Win_fence calls, all RMA operations are completed." (https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication)
Kudos go to Hristo and Michael Rachner.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,MY_RANK,IERR) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,Total,IERR) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
matrix_elementsy(1,2,3,4)=1.0_dp
end if
CALL MPI_WIN_FENCE(0, win, ierr)
print *,"my_rank=",my_rank,matrix_elementsy(1,2,3,4),matrix_elementsy(1,2,3,5)
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program
In the spirit of adding Fortran shared memory MPI examples, I'd like to extend ftiaronsem's code to incorporate a loop so that the behavior of MPI_Win_fence and MPI_Barrier is clearer (at least it is for me now, anyway).
Specifically, try running the code with either or both of the MPI_Win_Fence or MPI_Barrier commands in the loop commented out to see the effect. Alternatively, reverse their order.
Removing the MPI_Win_Fence allows the write statement to display memory that has not been updated yet.
Removing the MPI_Barrier allows other processes to run the next iteration and change memory before a process has the chance to write.
The previous answers really helped me implement the shared memory paradigm in my MPI code. Thanks.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total, i
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank, ierr) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,total,ierr) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
endif
call MPI_WIN_FENCE(0, win, ierr)
do i=1, 15
if (hostrank == 0) then
matrix_elementsy(1,2,3,4)=i * 1.0_dp
matrix_elementsy(1,2,2,4)=i * 2.0_dp
elseif ((hostrank > 5) .and. (hostrank < 11)) then ! code for non-root nodes to do something different
matrix_elementsy(1,2,hostrank, 4) = hostrank * 1.0 * i
endif
call MPI_WIN_FENCE(0, win, ierr)
write(*,'(A, I4, I4, 10F7.1)') "my_rank=",my_rank, i, matrix_elementsy(1,2,:,4)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
enddo
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program

Can anyone explain this Erlang Crash dump?

I got this error report when running my Erlang application.
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 18446744071692551144 bytes of memory (of type "heap").
It's a simple program ran on a simple PC. How is it possible to get such numbers? It is trying to allocate 10^10 gb by the way. The program basically only runs tail recursion and a quite low amount of processes.
If you get this error while running you application that means that one of your functions are calling recursively and trying to allocate that much memory where OS would not provide to VM and hence VM crashes with that memory allocation error.
Previously when I was running into similar dumps, it was caused by a huge mailbox in a process, it had piled millions of messages.
You could check it with this snippet of code:
top() ->
Procs = lists:foldl(fun(Pid, Acc) ->
case erlang:process_info(Pid, message_queue_len) of
{_K, V} -> [{Pid, V} | Acc];
_ -> Acc
end
end, [], erlang:processes()),
lists:keysort(2, Procs).

how to get the available memory on the device

I'm trying to obtain how much free memory I have on the device. To do this I call the cuda function cuMemGetInfo from a fortran code, but it returns negative values for the free amount of memory, so there's clearly something wrong.
Does anyone know how I can do that?
Thanks
EDIT:
Sorry, in fact my question was not very clear. I'm using OpenACC in Fortran and I call the C++ cuda function cudaMemGetInfo. Finally I could fix the code, the problem was effectively the kind of variables that I was using. Switching to size_ fixed everything. This is the interface in fortran that I'm using:
interface
subroutine get_dev_mem(total,free) bind(C,name="get_dev_mem")
use iso_c_binding
integer(kind=c_size_t)::total,free
end subroutine get_dev_mem
end interface
and this the cuda code
#include <cuda.h>
#include <cuda_runtime.h>
extern "C" {
void get_dev_mem(size_t& total, size_t& free)
{
cuMemGetInfo(&free, &total);
}
}
There's one last question: I pushed an array on the gpu and I checked its size using cuMemGetInfo, then I computed it's size counting the number of bytes, but I don't have the same answer, why? In the first case it is 3052mb large, in the latter 3051mb. This difference of 1mb could be the size of the array descriptor? Here there's the code that I used:
integer, parameter:: long = selected_int_kind(12)
integer(kind=c_size_t) :: total, free1,free2
real(8), dimension(:),allocatable::a
integer(kind=long)::N, eight, four
allocate(a(four*N))
!some OpenACC stuff in order to init the gpu
call get_dev_mem(total,free1)
!$acc data copy(a)
call get_dev_mem(total,free2)
print *,"size a in the gpu = ",(free1-free2)/1024/1024, " mb"
print *,"size a in theory = ", (eight*four*N)/1024/1024, " mb"
!$acc end data
deallocate(a)
Right, so, like commenters have suggested, we're not sure exactly what you're running, but filling in the missing details by guessing, here's a shot:
Most CUDA API calls return a status code (or error code if you will); this is true both in C/C++ and in Fortran, as we can see in the Portland Group's CUDA Fortran Manual:
Most of the runtime API routines are integer functions that return an error code; they return a value of zero if the call was successful, and a nonzero value if there was an error. To interpret the error codes, refer to “Error Handling,” on page 48.
This is the case for cudaMemGetInfo() specifically:
integer function cudaMemGetInfo( free, total )
integer(kind=cuda_count_kind) :: free, total
The two integers for free and total are cuda_count_kind, which if I am not mistaken are effectively unsigned... anyway, I would guess that what you're getting is an error code. Have a look at the Error Handling section on page 48 of the manual.

Freeing memory allocated with newCString

As library docs say CString created with newCString must be freed with free function. I have been expecting that when CString is created it would take some memory and when it is released with free memory usage would go down, but it didn't! Here is example code:
module Main where
import Foreign
import Foreign.C.String
import System.IO
wait = do
putStr "Press enter" >> hFlush stdout
_ <- getLine
return ()
main = do
let s = concat $ replicate 1000000 ['0'..'9']
cs <- newCString s
cs `seq` wait -- (1)
free cs
wait -- (2)
When program stopped at (1), htop program showed that memory usage is somewhere around 410M - this is OK. I press enter and the program stops at line (2), but memory usage is still 410M despite cs has been freed!
How is this possible? Similar program written in C behaves as it should. What am I missing here?
The issue is that free just indicates to the garbage collector that it can now collect the string. That doesn't actually force the garbage collector to run though -- it just indicates that the CString is now garbage. It is still up to the GC to decide when to run, based on heap pressure heuristics.
You can force a major collection by calling performGC straight after the call to free, which immediately reduces the memory to 5M or so.
E.g. this program:
import Foreign
import Foreign.C.String
import System.IO
import System.Mem
wait = do
putStr "Press enter" >> hFlush stdout
_ <- getLine
return ()
main = do
let s = concat $ replicate 1000000 ['0'..'9']
cs <- newCString s
cs `seq` wait -- (1)
free cs
performGC
wait -- (2)
Behaves as expected, with the following memory profile - the first red dot is the call to performGC, immediately deallocating the string. The program then hovers around 5M until terminated.

Resources