I am trying to avoid memory allocation and local copy in my code. Below is a small example :
module test
implicit none
public
integer, parameter :: nb = 1000
type :: info
integer n(nb)
double precision d(nb)
end type info
type(info), save :: abc
type(info), target, save :: def
contains
subroutine test_copy(inf)
implicit none
type(info), optional :: inf
type(info) :: local
if (present(inf)) then
local = inf
else
local = abc
endif
local%n = 1
local%d = 1.d0
end subroutine test_copy
subroutine test_assoc(inf)
implicit none
type(info), target, optional :: inf
type(info), pointer :: local
if (present(inf)) then
local => inf
else
local => def
endif
local%n = 1
local%d = 1.d0
end subroutine test_assoc
end module test
program run
use test
use caliper_mod
implicit none
type(ConfigManager), save :: mgr
abc%n = 0
abc%d = 0.d0
def%n = 0
def%d = 0.d0
! Init caliper profiling
mgr = ConfigManager_new()
call mgr%add("runtime-report(mem.highwatermark,output=stdout)")
call mgr%start
! Call subroutine with copy
call cali_begin_region("test_copy")
call test_copy()
call cali_end_region("test_copy")
! Call subroutine with pointer
call cali_begin_region("test_assoc")
call test_assoc()
call cali_end_region("test_assoc")
! End caliper profiling
call mgr%flush()
call mgr%stop()
call mgr%delete()
end program run
As far as i understand, the subroutine test_copy should produce a local copy while the subroutine test_assoc should only assign a pointer to some existing object. However, memory profiling with caliper leads to :
$ ./a.out
Path Min time/rank Max time/rank Avg time/rank Time % Allocated MB
test_assoc 0.000026 0.000026 0.000026 0.493827 0.000021
test_copy 0.000120 0.000120 0.000120 2.279202 0.000019
What looks odd is that Caliper shows the exact same amount of memory allocated whatever the value of the parameter nb. Am I using the right tool the right way to track memory allocation and local copy ?
Test performed with gfortran 11.2.0 and Caliper 2.8.0.
The local object is just a simple small scalar, albeit containing an array component, and will be very likely placed on the stack.
A stack is a pre-allocated part of memory of fixed size. Stack "allocation" is actually just a change of value of the stack pointer which is just one integer value. No actual memory allocation from the operating system takes place during stack allocation. There will be no change in the memory the process occupies.
I am recently using DTrace to analyze my iOS app。
Everything goes well except when I try to use the built-in variable stackDepth。
I read the document here where shows the introduction of built-in variable stackDepth.
So I write some D code
pid$target:::entry
{
self->entry_times[probefunc] = timestamp;
}
pid$target:::return
{
printf ("-----------------------------------\n");
this->delta_time = timestamp - self->entry_times[probefunc];
printf ("%s\n", probefunc);
printf ("stackDepth %d\n", stackdepth);
printf ("%d---%d\n", this->delta_time, epid);
ustack();
printf ("-----------------------------------\n");
}
And run it with sudo dtrace -s temp.d -c ./simple.out。 unstack() function goes very well, but stackDepth always appears to 0。
I tried both on my iOS app and a simple C program.
So anybody knows what's going on?
And how to get stack depth when the probe fires?
You want to use ustackdepth -- the user-land stack depth.
The stackdepth variable refers to the kernel thread stack depth; the ustackdepth variable refers to the user-land thread stack depth. When the traced program is executing in user-land, stackdepth will (should!) always be 0.
ustackdepth is calculated using the same logic as is used to walk the user-land stack as with ustack() (just as stackdepth and stack() use similar logic for the kernel stack).
This seems like a bug in the Mac / iOS implementation of DTrace to me.
However, since you're already probing every function entry and return, you could just keep a new variable self->depth and do ++ in the :::entry probe and -- in the :::return probe. This doesn't work quite right if you run it against optimized code, because any tail-call-optimized functions may look like they enter but never return. To solve that, you can turn off optimizations.
Also, because what you're doing looks a lot like this, I thought maybe you would be interested in the -F option:
Coalesce trace output by identifying function entry and return.
Function entry probe reports are indented and their output is prefixed
with ->. Function return probe reports are unindented and their output
is prefixed with <-.
The normal script to use with -F is something like:
pid$target::some_function:entry { self->trace = 1 }
pid$target:::entry /self->trace/ {}
pid$target:::return /self->trace/ {}
pid$target::some_function:return { self->trace = 0 }
Where some_function is the function whose execution you want to be printed. The output shows a textual call graph for that execution:
-> some_function
-> another_function
-> malloc
<- malloc
<- another_function
-> yet_another_function
-> strcmp
<- strcmp
-> malloc
<- malloc
<- yet_another_function
<- some_function
I have a large Fortran/MPI code that when running uses a very large amount of VIRT memory (~20G) although the actual memory used (500 mb) is quite modest.
How can I profile the code to understand which part produces this huge VIRT memory?
At this stage, I'm even happy to use a brute force approach.
What I have tried is to put sleep statements in the code and recorded the memory usage through "top" to try to pin-point by bisection where the big allocation are.
However, this does not work as the sleep call puts the memory usage to 0. Is there a way to freeze the code while keeping current memory usage?
PS: I have tried VALGRIND but the code being so large, VALGRIND never finished. Is there an alternative to VALGRIND that is "easy" to used?
Thank you,
Sam
A solution for this is this modified (to get VIRT memory) subroutine from Track memory usage in Fortran 90
subroutine system_mem_usage(valueRSS)
use ifport !if on intel compiler
implicit none
integer, intent(out) :: valueRSS
character(len=200):: filename=' '
character(len=80) :: line
character(len=8) :: pid_char=' '
integer :: pid
logical :: ifxst
valueRSS=-1 ! return negative number if not found
!--- get process ID
pid=getpid()
write(pid_char,'(I8)') pid
filename='/proc/'//trim(adjustl(pid_char))//'/status'
!--- read system file
inquire (file=filename,exist=ifxst)
if (.not.ifxst) then
write (*,*) 'system file does not exist'
return
endif
open(unit=100, file=filename, action='read')
do
read (100,'(a)',end=120) line
if (line(1:7).eq.'VmSize:') then
read (line(8:),*) valueRSS
exit
endif
enddo
120 continue
close(100)
return
end subroutine system_mem_usage
I am working on an Fortan code that already uses MPI.
Now, I am facing a situation, where a set of data grows very large but is same for every process, so I would prefer to store it in memory only once per node and all processes on one node access the same data.
Storing it once for every process would go beyond the available RAM.
Is it somehow possible to achieve something like that with openMP?
Data sharing per node is the only thing I would like to have, no other per node paralellisation required, because this is already done via MPI.
You don't need to implement a hybrid MPI+OpenMP code if it is only for sharing a chunk of data. What you actually have to do is:
1) Split the world communicator into groups that span the same host/node. That is really easy if your MPI library implements MPI-3.0 - all you need to do is call MPI_COMM_SPLIT_TYPE with split_type set to MPI_COMM_TYPE_SHARED:
USE mpi_f08
TYPE(MPI_Comm) :: hostcomm
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
MPI_INFO_NULL, hostcomm)
MPI-2.2 or earlier does not provide the MPI_COMM_SPLIT_TYPE operation and one has to get somewhat creative. You could for example use my simple split-by-host implementation that can be found on Github here.
2) Now that processes that reside on the same node are part of the same communicator hostcomm, they can create a block of shared memory and use it to exchange data. Again, MPI-3.0 provides an (relatively) easy and portable way to do that:
USE mpi_f08
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
INTEGER :: hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: size
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
TYPE(MPI_Win) :: win
TYPE(MY_DATA_TYPE), POINTER :: shared_data
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
CALL MPI_Comm_rank(hostcomm, hostrank)
if (hostrank == 0) then
size = 10000000 ! Put the actual data size here
else
size = 0
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(size, disp_unit, MPI_INFO_NULL, &
hostcomm, baseptr, win)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, size, disp_unit, baseptr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, shared_data)
! Use shared_data as if it was ALLOCATE'd
! ...
! Destroy the shared memory window
CALL MPI_Win_free(win)
The way that code works is that it uses the MPI-3.0 functionality for allocating shared memory windows. MPI_WIN_ALLOCATE_SHARED allocates a chunk of shared memory in each process. Since you want to share one block of data, it only makes sense to allocate it in a single process and not have it spread across the processes, therefore size is set to 0 for all but one ranks while making the call. MPI_WIN_SHARED_QUERY is used to find out the address at which that shared memory block is mapped in the virtual address space of the calling process. One the address is known, the C pointer can be associated with a Fortran pointer using the C_F_POINTER() subroutine and the latter can be used to access the shared memory. Once done, the shared memory has to be freed by destroying the shared memory window with MPI_WIN_FREE.
MPI-2.2 or earlier does not provide shared memory windows. In that case one has to use the OS-dependent APIs for creation of shared memory blocks, e.g. the standard POSIX sequence shm_open() / ftruncate() / mmap(). A utility C function callable from Fortran has to be written in order to perform those operations. See that code for some inspiration. The void * returned by mmap() can be passed directly to the Fortran code in a C_PTR type variable that can be then associated with a Fortran pointer.
With this answer I want to add a complete running code example (for ifort 15 and mvapich 2.1). The MPI shared memory concept is still pretty new and in particular for Fortran there aren't many code examples out there. It is based on the answer from Hristo and a very useful email on the mvapich mailing list (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-June/005003.html).
The code example is based on the problems I ran into and adds to Hristo's answer in the following ways:
uses mpi instead of mpi_f08 (some libraries do not provide a full fortran 2008 interface yet)
Has ierr added to the respective MPI calls
Explicit calculation of the windowsize elements*elementsize
How to use C_F_POINTER to map the shared memory to a multi dimensional array
Reminds to use MPI_WIN_FENCE after modifying the shared memory
Intel mpi (5.0.1.035) needs an additional MPI_BARRIER after the MPI_FENCE since it only guarantees that between "between two MPI_Win_fence calls, all RMA operations are completed." (https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication)
Kudos go to Hristo and Michael Rachner.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,MY_RANK,IERR) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,Total,IERR) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
matrix_elementsy(1,2,3,4)=1.0_dp
end if
CALL MPI_WIN_FENCE(0, win, ierr)
print *,"my_rank=",my_rank,matrix_elementsy(1,2,3,4),matrix_elementsy(1,2,3,5)
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program
In the spirit of adding Fortran shared memory MPI examples, I'd like to extend ftiaronsem's code to incorporate a loop so that the behavior of MPI_Win_fence and MPI_Barrier is clearer (at least it is for me now, anyway).
Specifically, try running the code with either or both of the MPI_Win_Fence or MPI_Barrier commands in the loop commented out to see the effect. Alternatively, reverse their order.
Removing the MPI_Win_Fence allows the write statement to display memory that has not been updated yet.
Removing the MPI_Barrier allows other processes to run the next iteration and change memory before a process has the chance to write.
The previous answers really helped me implement the shared memory paradigm in my MPI code. Thanks.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total, i
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank, ierr) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,total,ierr) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
endif
call MPI_WIN_FENCE(0, win, ierr)
do i=1, 15
if (hostrank == 0) then
matrix_elementsy(1,2,3,4)=i * 1.0_dp
matrix_elementsy(1,2,2,4)=i * 2.0_dp
elseif ((hostrank > 5) .and. (hostrank < 11)) then ! code for non-root nodes to do something different
matrix_elementsy(1,2,hostrank, 4) = hostrank * 1.0 * i
endif
call MPI_WIN_FENCE(0, win, ierr)
write(*,'(A, I4, I4, 10F7.1)') "my_rank=",my_rank, i, matrix_elementsy(1,2,:,4)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
enddo
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program
As library docs say CString created with newCString must be freed with free function. I have been expecting that when CString is created it would take some memory and when it is released with free memory usage would go down, but it didn't! Here is example code:
module Main where
import Foreign
import Foreign.C.String
import System.IO
wait = do
putStr "Press enter" >> hFlush stdout
_ <- getLine
return ()
main = do
let s = concat $ replicate 1000000 ['0'..'9']
cs <- newCString s
cs `seq` wait -- (1)
free cs
wait -- (2)
When program stopped at (1), htop program showed that memory usage is somewhere around 410M - this is OK. I press enter and the program stops at line (2), but memory usage is still 410M despite cs has been freed!
How is this possible? Similar program written in C behaves as it should. What am I missing here?
The issue is that free just indicates to the garbage collector that it can now collect the string. That doesn't actually force the garbage collector to run though -- it just indicates that the CString is now garbage. It is still up to the GC to decide when to run, based on heap pressure heuristics.
You can force a major collection by calling performGC straight after the call to free, which immediately reduces the memory to 5M or so.
E.g. this program:
import Foreign
import Foreign.C.String
import System.IO
import System.Mem
wait = do
putStr "Press enter" >> hFlush stdout
_ <- getLine
return ()
main = do
let s = concat $ replicate 1000000 ['0'..'9']
cs <- newCString s
cs `seq` wait -- (1)
free cs
performGC
wait -- (2)
Behaves as expected, with the following memory profile - the first red dot is the call to performGC, immediately deallocating the string. The program then hovers around 5M until terminated.