fortran deallocate array but not released in OS

fortran deallocate array but not released in OS - memory

I have the following question:how do I deallocate array memory in type? Like a%b%c,
how do I deallocate c? the specific problem is(The compiler environment I tried are gfortran version gcc4.4.7 and ifort version 18.0.1.OS:linux):
module grist_domain_types
implicit none
public :: aaa
type bbb
real (8), allocatable :: c(:)
end type bbb
type aaa
type(bbb), allocatable :: b(:)
end type aaa
end module grist_domain_types
program main
use grist_domain_types
type(aaa) :: a
integer(4) :: time,i
time=20
allocate(a%b(1:100000000))
call sleep(time)!--------------1
do i=1,100000000
allocate(a%b(i)%c(1:1))
enddo
call sleep(time)!--------------2
do i=1,100000000
deallocate(a%b(i)%c)
enddo
call sleep(time)!--------------3
deallocate(a%b)
call sleep(time)!--------------4
end program
First,"gfortran main.F90 -o main" to compile the program, and run this program. Then I use top -p processID to see memory. When the program is executed to 1, the memory is 4.5G. When the program is executed to 2, the memory is 7.5G. When the program is executed to 3, the memory is also 7.5G(but I think is 4.5G). When the program is executed to 4, the memory is 3G(I think is 0G or close to 0G). So deallocate(a%b(i)%c) does not seem to work. However, I use valgrind to see memory. the memory of this program is all deallocate...I used ifort and gfortran. This problem happens no matter which compiler I use. How to explain this question? I allocate many c array in this way,the program will finally crash due to insufficient memory. And how to solve it?

Take a look at this post from the Intel forum. There are 2 important information in there:
(From Doctor Forran):
When you do a DEALLOCATE, the memory that was allocated returns to the pool used by the memory allocator (on Linux and OS X this is the same as C's malloc/free). The memory is not released back to the OS - it is very rare that this would even be possible. What often happens is that the pattern of allocations and deallocations causes virtual memory to be fragmented, so that while the total available space may be high, there may not be sufficient contiguous space to allocate a large item. Unlike with disks, there is no way to "defrag" memory.
(From Jim Dempsey)
See if you can deallocate the memory in the reverse order in which it was allocated. This can reduce memory fragmentation.
You may also refer to this other Intel post:
During the program run, the Fortran runtime library will manage your heap. Yes, if data is DEALLOCATED, the runtime may choose to wait to release that memory. It's an optimization - if you do another ALLOCATE with the same size it will just reuse those pages. If the heap starts to run low, it will do some collection but not until it's absolutely necessary.
Also, let me add something: Check if there aren't other objects dynamically created in scope, like automatic arrays or temporal array copies. That could be demanding memory that may be freed only when they get out of scope.
Summing up, even if 'top' says the memory is still in use, you should start to worry only if your program starts to crash or if Valgrind shows something wreid.

I modified your program ( to see where it was up to ) and ran on Windows 7 / gFortran 7.2.0. It does not demonstrate the memory retention as you report, as memory reverts to 13 mb. Contrary to my comment, memory demand did not change during initialisation of c.
module grist_domain_types
implicit none
public :: aaa
type bbb
real (8), allocatable :: c(:)
end type bbb
type aaa
type(bbb), allocatable :: b(:)
end type aaa
end module grist_domain_types
program main
use grist_domain_types
type(aaa) :: a
integer(4),parameter :: million = 1000000
integer(4) :: n = 100*million
integer(4) :: time = 5, i, pass
do pass = 1,5
write (*,*) ' go #', pass
allocate(a%b(1:n))
write (*,*) 'allocate b'
call sleep(time)!--------------1
write (*,*) ' go'
do i=1,n
allocate(a%b(i)%c(1:1))
enddo
write (*,*) 'allocate c'
call sleep(time)!--------------2
write (*,*) ' go'
do i=1,n
a%b(i)%c = real(i)
enddo
write (*,*) 'use c'
call sleep(time)!--------------2a
write (*,*) ' go'
do i=1,n
deallocate(a%b(i)%c)
enddo
write (*,*) 'deallocate c'
call sleep(time)!--------------3
write (*,*) ' go'
deallocate(a%b)
write (*,*) 'deallocate b'
call sleep(time)!--------------4
end do
write (*,*) ' done : exit ?'
read (*,*) i
end program
edit: I have given the test a repeat with do pass... to repeat the memory demands. This shows no memory leakage for this Fortran program. I use Task manager to identify memory usage, both for this program and the O/S. Your particular O/S and Fortran compiler may be different.

Related

Memory allocation when a pointer is assigned

I am trying to avoid memory allocation and local copy in my code. Below is a small example :
module test
implicit none
public
integer, parameter :: nb = 1000
type :: info
integer n(nb)
double precision d(nb)
end type info
type(info), save :: abc
type(info), target, save :: def
contains
subroutine test_copy(inf)
implicit none
type(info), optional :: inf
type(info) :: local
if (present(inf)) then
local = inf
else
local = abc
endif
local%n = 1
local%d = 1.d0
end subroutine test_copy
subroutine test_assoc(inf)
implicit none
type(info), target, optional :: inf
type(info), pointer :: local
if (present(inf)) then
local => inf
else
local => def
endif
local%n = 1
local%d = 1.d0
end subroutine test_assoc
end module test
program run
use test
use caliper_mod
implicit none
type(ConfigManager), save :: mgr
abc%n = 0
abc%d = 0.d0
def%n = 0
def%d = 0.d0
! Init caliper profiling
mgr = ConfigManager_new()
call mgr%add("runtime-report(mem.highwatermark,output=stdout)")
call mgr%start
! Call subroutine with copy
call cali_begin_region("test_copy")
call test_copy()
call cali_end_region("test_copy")
! Call subroutine with pointer
call cali_begin_region("test_assoc")
call test_assoc()
call cali_end_region("test_assoc")
! End caliper profiling
call mgr%flush()
call mgr%stop()
call mgr%delete()
end program run
As far as i understand, the subroutine test_copy should produce a local copy while the subroutine test_assoc should only assign a pointer to some existing object. However, memory profiling with caliper leads to :
$ ./a.out
Path Min time/rank Max time/rank Avg time/rank Time % Allocated MB
test_assoc 0.000026 0.000026 0.000026 0.493827 0.000021
test_copy 0.000120 0.000120 0.000120 2.279202 0.000019
What looks odd is that Caliper shows the exact same amount of memory allocated whatever the value of the parameter nb. Am I using the right tool the right way to track memory allocation and local copy ?
Test performed with gfortran 11.2.0 and Caliper 2.8.0.

The local object is just a simple small scalar, albeit containing an array component, and will be very likely placed on the stack.
A stack is a pre-allocated part of memory of fixed size. Stack "allocation" is actually just a change of value of the stack pointer which is just one integer value. No actual memory allocation from the operating system takes place during stack allocation. There will be no change in the memory the process occupies.

DTrace build in built-in variable stackdepth always return 0

I am recently using DTrace to analyze my iOS app。
Everything goes well except when I try to use the built-in variable stackDepth。
I read the document here where shows the introduction of built-in variable stackDepth.
So I write some D code
pid$target:::entry
{
self->entry_times[probefunc] = timestamp;
}
pid$target:::return
{
printf ("-----------------------------------\n");
this->delta_time = timestamp - self->entry_times[probefunc];
printf ("%s\n", probefunc);
printf ("stackDepth %d\n", stackdepth);
printf ("%d---%d\n", this->delta_time, epid);
ustack();
printf ("-----------------------------------\n");
}
And run it with sudo dtrace -s temp.d -c ./simple.out。 unstack() function goes very well, but stackDepth always appears to 0。
I tried both on my iOS app and a simple C program.
So anybody knows what's going on?
And how to get stack depth when the probe fires?

You want to use ustackdepth -- the user-land stack depth.
The stackdepth variable refers to the kernel thread stack depth; the ustackdepth variable refers to the user-land thread stack depth. When the traced program is executing in user-land, stackdepth will (should!) always be 0.
ustackdepth is calculated using the same logic as is used to walk the user-land stack as with ustack() (just as stackdepth and stack() use similar logic for the kernel stack).

This seems like a bug in the Mac / iOS implementation of DTrace to me.
However, since you're already probing every function entry and return, you could just keep a new variable self->depth and do ++ in the :::entry probe and -- in the :::return probe. This doesn't work quite right if you run it against optimized code, because any tail-call-optimized functions may look like they enter but never return. To solve that, you can turn off optimizations.
Also, because what you're doing looks a lot like this, I thought maybe you would be interested in the -F option:
Coalesce trace output by identifying function entry and return.
Function entry probe reports are indented and their output is prefixed
with ->. Function return probe reports are unindented and their output
is prefixed with <-.
The normal script to use with -F is something like:
pid$target::some_function:entry { self->trace = 1 }
pid$target:::entry /self->trace/ {}
pid$target:::return /self->trace/ {}
pid$target::some_function:return { self->trace = 0 }
Where some_function is the function whose execution you want to be printed. The output shows a textual call graph for that execution:
-> some_function
-> another_function
-> malloc
<- malloc
<- another_function
-> yet_another_function
-> strcmp
<- strcmp
-> malloc
<- malloc
<- yet_another_function
<- some_function

Large VIRT memory in Fortran

I have a large Fortran/MPI code that when running uses a very large amount of VIRT memory (~20G) although the actual memory used (500 mb) is quite modest.
How can I profile the code to understand which part produces this huge VIRT memory?
At this stage, I'm even happy to use a brute force approach.
What I have tried is to put sleep statements in the code and recorded the memory usage through "top" to try to pin-point by bisection where the big allocation are.
However, this does not work as the sleep call puts the memory usage to 0. Is there a way to freeze the code while keeping current memory usage?
PS: I have tried VALGRIND but the code being so large, VALGRIND never finished. Is there an alternative to VALGRIND that is "easy" to used?
Thank you,
Sam

A solution for this is this modified (to get VIRT memory) subroutine from Track memory usage in Fortran 90
subroutine system_mem_usage(valueRSS)
use ifport !if on intel compiler
implicit none
integer, intent(out) :: valueRSS
character(len=200):: filename=' '
character(len=80) :: line
character(len=8) :: pid_char=' '
integer :: pid
logical :: ifxst
valueRSS=-1 ! return negative number if not found
!--- get process ID
pid=getpid()
write(pid_char,'(I8)') pid
filename='/proc/'//trim(adjustl(pid_char))//'/status'
!--- read system file
inquire (file=filename,exist=ifxst)
if (.not.ifxst) then
write (*,*) 'system file does not exist'
return
endif
open(unit=100, file=filename, action='read')
do
read (100,'(a)',end=120) line
if (line(1:7).eq.'VmSize:') then
read (line(8:),*) valueRSS
exit
endif
enddo
120 continue
close(100)
return
end subroutine system_mem_usage

MPI Fortran code: how to share data on node via openMP?

I am working on an Fortan code that already uses MPI.
Now, I am facing a situation, where a set of data grows very large but is same for every process, so I would prefer to store it in memory only once per node and all processes on one node access the same data.
Storing it once for every process would go beyond the available RAM.
Is it somehow possible to achieve something like that with openMP?
Data sharing per node is the only thing I would like to have, no other per node paralellisation required, because this is already done via MPI.

You don't need to implement a hybrid MPI+OpenMP code if it is only for sharing a chunk of data. What you actually have to do is:
1) Split the world communicator into groups that span the same host/node. That is really easy if your MPI library implements MPI-3.0 - all you need to do is call MPI_COMM_SPLIT_TYPE with split_type set to MPI_COMM_TYPE_SHARED:
USE mpi_f08
TYPE(MPI_Comm) :: hostcomm
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
MPI_INFO_NULL, hostcomm)
MPI-2.2 or earlier does not provide the MPI_COMM_SPLIT_TYPE operation and one has to get somewhat creative. You could for example use my simple split-by-host implementation that can be found on Github here.
2) Now that processes that reside on the same node are part of the same communicator hostcomm, they can create a block of shared memory and use it to exchange data. Again, MPI-3.0 provides an (relatively) easy and portable way to do that:
USE mpi_f08
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
INTEGER :: hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: size
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
TYPE(MPI_Win) :: win
TYPE(MY_DATA_TYPE), POINTER :: shared_data
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
CALL MPI_Comm_rank(hostcomm, hostrank)
if (hostrank == 0) then
size = 10000000 ! Put the actual data size here
else
size = 0
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(size, disp_unit, MPI_INFO_NULL, &
hostcomm, baseptr, win)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, size, disp_unit, baseptr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, shared_data)
! Use shared_data as if it was ALLOCATE'd
! ...
! Destroy the shared memory window
CALL MPI_Win_free(win)
The way that code works is that it uses the MPI-3.0 functionality for allocating shared memory windows. MPI_WIN_ALLOCATE_SHARED allocates a chunk of shared memory in each process. Since you want to share one block of data, it only makes sense to allocate it in a single process and not have it spread across the processes, therefore size is set to 0 for all but one ranks while making the call. MPI_WIN_SHARED_QUERY is used to find out the address at which that shared memory block is mapped in the virtual address space of the calling process. One the address is known, the C pointer can be associated with a Fortran pointer using the C_F_POINTER() subroutine and the latter can be used to access the shared memory. Once done, the shared memory has to be freed by destroying the shared memory window with MPI_WIN_FREE.
MPI-2.2 or earlier does not provide shared memory windows. In that case one has to use the OS-dependent APIs for creation of shared memory blocks, e.g. the standard POSIX sequence shm_open() / ftruncate() / mmap(). A utility C function callable from Fortran has to be written in order to perform those operations. See that code for some inspiration. The void * returned by mmap() can be passed directly to the Fortran code in a C_PTR type variable that can be then associated with a Fortran pointer.

With this answer I want to add a complete running code example (for ifort 15 and mvapich 2.1). The MPI shared memory concept is still pretty new and in particular for Fortran there aren't many code examples out there. It is based on the answer from Hristo and a very useful email on the mvapich mailing list (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-June/005003.html).
The code example is based on the problems I ran into and adds to Hristo's answer in the following ways:
uses mpi instead of mpi_f08 (some libraries do not provide a full fortran 2008 interface yet)
Has ierr added to the respective MPI calls
Explicit calculation of the windowsize elements*elementsize
How to use C_F_POINTER to map the shared memory to a multi dimensional array
Reminds to use MPI_WIN_FENCE after modifying the shared memory
Intel mpi (5.0.1.035) needs an additional MPI_BARRIER after the MPI_FENCE since it only guarantees that between "between two MPI_Win_fence calls, all RMA operations are completed." (https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication)
Kudos go to Hristo and Michael Rachner.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,MY_RANK,IERR) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,Total,IERR) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
matrix_elementsy(1,2,3,4)=1.0_dp
end if
CALL MPI_WIN_FENCE(0, win, ierr)
print *,"my_rank=",my_rank,matrix_elementsy(1,2,3,4),matrix_elementsy(1,2,3,5)
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program

In the spirit of adding Fortran shared memory MPI examples, I'd like to extend ftiaronsem's code to incorporate a loop so that the behavior of MPI_Win_fence and MPI_Barrier is clearer (at least it is for me now, anyway).
Specifically, try running the code with either or both of the MPI_Win_Fence or MPI_Barrier commands in the loop commented out to see the effect. Alternatively, reverse their order.
Removing the MPI_Win_Fence allows the write statement to display memory that has not been updated yet.
Removing the MPI_Barrier allows other processes to run the next iteration and change memory before a process has the chance to write.
The previous answers really helped me implement the shared memory paradigm in my MPI code. Thanks.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total, i
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank, ierr) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,total,ierr) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
endif
call MPI_WIN_FENCE(0, win, ierr)
do i=1, 15
if (hostrank == 0) then
matrix_elementsy(1,2,3,4)=i * 1.0_dp
matrix_elementsy(1,2,2,4)=i * 2.0_dp
elseif ((hostrank > 5) .and. (hostrank < 11)) then ! code for non-root nodes to do something different
matrix_elementsy(1,2,hostrank, 4) = hostrank * 1.0 * i
endif
call MPI_WIN_FENCE(0, win, ierr)
write(*,'(A, I4, I4, 10F7.1)') "my_rank=",my_rank, i, matrix_elementsy(1,2,:,4)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
enddo
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program

Freeing memory allocated with newCString

As library docs say CString created with newCString must be freed with free function. I have been expecting that when CString is created it would take some memory and when it is released with free memory usage would go down, but it didn't! Here is example code:
module Main where
import Foreign
import Foreign.C.String
import System.IO
wait = do
putStr "Press enter" >> hFlush stdout
_ <- getLine
return ()
main = do
let s = concat $ replicate 1000000 ['0'..'9']
cs <- newCString s
cs `seq` wait -- (1)
free cs
wait -- (2)
When program stopped at (1), htop program showed that memory usage is somewhere around 410M - this is OK. I press enter and the program stops at line (2), but memory usage is still 410M despite cs has been freed!
How is this possible? Similar program written in C behaves as it should. What am I missing here?

The issue is that free just indicates to the garbage collector that it can now collect the string. That doesn't actually force the garbage collector to run though -- it just indicates that the CString is now garbage. It is still up to the GC to decide when to run, based on heap pressure heuristics.
You can force a major collection by calling performGC straight after the call to free, which immediately reduces the memory to 5M or so.
E.g. this program:
import Foreign
import Foreign.C.String
import System.IO
import System.Mem
wait = do
putStr "Press enter" >> hFlush stdout
_ <- getLine
return ()
main = do
let s = concat $ replicate 1000000 ['0'..'9']
cs <- newCString s
cs `seq` wait -- (1)
free cs
performGC
wait -- (2)
Behaves as expected, with the following memory profile - the first red dot is the call to performGC, immediately deallocating the string. The program then hovers around 5M until terminated.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart