MPI Fortran code: how to share data on node via openMP?

MPI Fortran code: how to share data on node via openMP? - memory

I am working on an Fortan code that already uses MPI.
Now, I am facing a situation, where a set of data grows very large but is same for every process, so I would prefer to store it in memory only once per node and all processes on one node access the same data.
Storing it once for every process would go beyond the available RAM.
Is it somehow possible to achieve something like that with openMP?
Data sharing per node is the only thing I would like to have, no other per node paralellisation required, because this is already done via MPI.

You don't need to implement a hybrid MPI+OpenMP code if it is only for sharing a chunk of data. What you actually have to do is:
1) Split the world communicator into groups that span the same host/node. That is really easy if your MPI library implements MPI-3.0 - all you need to do is call MPI_COMM_SPLIT_TYPE with split_type set to MPI_COMM_TYPE_SHARED:
USE mpi_f08
TYPE(MPI_Comm) :: hostcomm
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
MPI_INFO_NULL, hostcomm)
MPI-2.2 or earlier does not provide the MPI_COMM_SPLIT_TYPE operation and one has to get somewhat creative. You could for example use my simple split-by-host implementation that can be found on Github here.
2) Now that processes that reside on the same node are part of the same communicator hostcomm, they can create a block of shared memory and use it to exchange data. Again, MPI-3.0 provides an (relatively) easy and portable way to do that:
USE mpi_f08
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
INTEGER :: hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: size
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
TYPE(MPI_Win) :: win
TYPE(MY_DATA_TYPE), POINTER :: shared_data
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
CALL MPI_Comm_rank(hostcomm, hostrank)
if (hostrank == 0) then
size = 10000000 ! Put the actual data size here
else
size = 0
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(size, disp_unit, MPI_INFO_NULL, &
hostcomm, baseptr, win)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, size, disp_unit, baseptr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, shared_data)
! Use shared_data as if it was ALLOCATE'd
! ...
! Destroy the shared memory window
CALL MPI_Win_free(win)
The way that code works is that it uses the MPI-3.0 functionality for allocating shared memory windows. MPI_WIN_ALLOCATE_SHARED allocates a chunk of shared memory in each process. Since you want to share one block of data, it only makes sense to allocate it in a single process and not have it spread across the processes, therefore size is set to 0 for all but one ranks while making the call. MPI_WIN_SHARED_QUERY is used to find out the address at which that shared memory block is mapped in the virtual address space of the calling process. One the address is known, the C pointer can be associated with a Fortran pointer using the C_F_POINTER() subroutine and the latter can be used to access the shared memory. Once done, the shared memory has to be freed by destroying the shared memory window with MPI_WIN_FREE.
MPI-2.2 or earlier does not provide shared memory windows. In that case one has to use the OS-dependent APIs for creation of shared memory blocks, e.g. the standard POSIX sequence shm_open() / ftruncate() / mmap(). A utility C function callable from Fortran has to be written in order to perform those operations. See that code for some inspiration. The void * returned by mmap() can be passed directly to the Fortran code in a C_PTR type variable that can be then associated with a Fortran pointer.

With this answer I want to add a complete running code example (for ifort 15 and mvapich 2.1). The MPI shared memory concept is still pretty new and in particular for Fortran there aren't many code examples out there. It is based on the answer from Hristo and a very useful email on the mvapich mailing list (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-June/005003.html).
The code example is based on the problems I ran into and adds to Hristo's answer in the following ways:
uses mpi instead of mpi_f08 (some libraries do not provide a full fortran 2008 interface yet)
Has ierr added to the respective MPI calls
Explicit calculation of the windowsize elements*elementsize
How to use C_F_POINTER to map the shared memory to a multi dimensional array
Reminds to use MPI_WIN_FENCE after modifying the shared memory
Intel mpi (5.0.1.035) needs an additional MPI_BARRIER after the MPI_FENCE since it only guarantees that between "between two MPI_Win_fence calls, all RMA operations are completed." (https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication)
Kudos go to Hristo and Michael Rachner.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,MY_RANK,IERR) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,Total,IERR) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
matrix_elementsy(1,2,3,4)=1.0_dp
end if
CALL MPI_WIN_FENCE(0, win, ierr)
print *,"my_rank=",my_rank,matrix_elementsy(1,2,3,4),matrix_elementsy(1,2,3,5)
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program

In the spirit of adding Fortran shared memory MPI examples, I'd like to extend ftiaronsem's code to incorporate a loop so that the behavior of MPI_Win_fence and MPI_Barrier is clearer (at least it is for me now, anyway).
Specifically, try running the code with either or both of the MPI_Win_Fence or MPI_Barrier commands in the loop commented out to see the effect. Alternatively, reverse their order.
Removing the MPI_Win_Fence allows the write statement to display memory that has not been updated yet.
Removing the MPI_Barrier allows other processes to run the next iteration and change memory before a process has the chance to write.
The previous answers really helped me implement the shared memory paradigm in my MPI code. Thanks.
program sharedmemtest
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(14,200)
integer :: win,win2,hostcomm,hostrank
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit,my_rank,ierr,total, i
TYPE(C_PTR) :: baseptr,baseptr2
real(dp), POINTER :: matrix_elementsy(:,:,:,:)
integer,allocatable :: arrayshape(:)
call MPI_INIT( ierr )
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank, ierr) !GET THE RANK OF ONE PROCESS
call MPI_COMM_SIZE(MPI_COMM_WORLD,total,ierr) !GET THE TOTAL PROCESSES OF THE COMM
CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
CALL MPI_Comm_rank(hostcomm, hostrank,ierr)
! Gratefully based on: http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
allocate(arrayshape(4))
arrayshape=(/ 10,10,10,10 /)
if (hostrank == 0) then
windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
else
windowsize = 0_MPI_ADDRESS_KIND
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)
! Obtain the location of the memory segment
if (hostrank /= 0) then
CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
end if
! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
!!! your code here!
!!! sample below
if (hostrank == 0) then
matrix_elementsy=0.0_dp
endif
call MPI_WIN_FENCE(0, win, ierr)
do i=1, 15
if (hostrank == 0) then
matrix_elementsy(1,2,3,4)=i * 1.0_dp
matrix_elementsy(1,2,2,4)=i * 2.0_dp
elseif ((hostrank > 5) .and. (hostrank < 11)) then ! code for non-root nodes to do something different
matrix_elementsy(1,2,hostrank, 4) = hostrank * 1.0 * i
endif
call MPI_WIN_FENCE(0, win, ierr)
write(*,'(A, I4, I4, 10F7.1)') "my_rank=",my_rank, i, matrix_elementsy(1,2,:,4)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
enddo
!!! end sample code
call MPI_WIN_FENCE(0, win, ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_Win_free(win,ierr)
call MPI_FINALIZE(IERR)
end program

Related

Memory allocation when a pointer is assigned

I am trying to avoid memory allocation and local copy in my code. Below is a small example :
module test
implicit none
public
integer, parameter :: nb = 1000
type :: info
integer n(nb)
double precision d(nb)
end type info
type(info), save :: abc
type(info), target, save :: def
contains
subroutine test_copy(inf)
implicit none
type(info), optional :: inf
type(info) :: local
if (present(inf)) then
local = inf
else
local = abc
endif
local%n = 1
local%d = 1.d0
end subroutine test_copy
subroutine test_assoc(inf)
implicit none
type(info), target, optional :: inf
type(info), pointer :: local
if (present(inf)) then
local => inf
else
local => def
endif
local%n = 1
local%d = 1.d0
end subroutine test_assoc
end module test
program run
use test
use caliper_mod
implicit none
type(ConfigManager), save :: mgr
abc%n = 0
abc%d = 0.d0
def%n = 0
def%d = 0.d0
! Init caliper profiling
mgr = ConfigManager_new()
call mgr%add("runtime-report(mem.highwatermark,output=stdout)")
call mgr%start
! Call subroutine with copy
call cali_begin_region("test_copy")
call test_copy()
call cali_end_region("test_copy")
! Call subroutine with pointer
call cali_begin_region("test_assoc")
call test_assoc()
call cali_end_region("test_assoc")
! End caliper profiling
call mgr%flush()
call mgr%stop()
call mgr%delete()
end program run
As far as i understand, the subroutine test_copy should produce a local copy while the subroutine test_assoc should only assign a pointer to some existing object. However, memory profiling with caliper leads to :
$ ./a.out
Path Min time/rank Max time/rank Avg time/rank Time % Allocated MB
test_assoc 0.000026 0.000026 0.000026 0.493827 0.000021
test_copy 0.000120 0.000120 0.000120 2.279202 0.000019
What looks odd is that Caliper shows the exact same amount of memory allocated whatever the value of the parameter nb. Am I using the right tool the right way to track memory allocation and local copy ?
Test performed with gfortran 11.2.0 and Caliper 2.8.0.

The local object is just a simple small scalar, albeit containing an array component, and will be very likely placed on the stack.
A stack is a pre-allocated part of memory of fixed size. Stack "allocation" is actually just a change of value of the stack pointer which is just one integer value. No actual memory allocation from the operating system takes place during stack allocation. There will be no change in the memory the process occupies.

fortran deallocate array but not released in OS

I have the following question:how do I deallocate array memory in type? Like a%b%c,
how do I deallocate c? the specific problem is(The compiler environment I tried are gfortran version gcc4.4.7 and ifort version 18.0.1.OS:linux):
module grist_domain_types
implicit none
public :: aaa
type bbb
real (8), allocatable :: c(:)
end type bbb
type aaa
type(bbb), allocatable :: b(:)
end type aaa
end module grist_domain_types
program main
use grist_domain_types
type(aaa) :: a
integer(4) :: time,i
time=20
allocate(a%b(1:100000000))
call sleep(time)!--------------1
do i=1,100000000
allocate(a%b(i)%c(1:1))
enddo
call sleep(time)!--------------2
do i=1,100000000
deallocate(a%b(i)%c)
enddo
call sleep(time)!--------------3
deallocate(a%b)
call sleep(time)!--------------4
end program
First,"gfortran main.F90 -o main" to compile the program, and run this program. Then I use top -p processID to see memory. When the program is executed to 1, the memory is 4.5G. When the program is executed to 2, the memory is 7.5G. When the program is executed to 3, the memory is also 7.5G(but I think is 4.5G). When the program is executed to 4, the memory is 3G(I think is 0G or close to 0G). So deallocate(a%b(i)%c) does not seem to work. However, I use valgrind to see memory. the memory of this program is all deallocate...I used ifort and gfortran. This problem happens no matter which compiler I use. How to explain this question? I allocate many c array in this way,the program will finally crash due to insufficient memory. And how to solve it?

Take a look at this post from the Intel forum. There are 2 important information in there:
(From Doctor Forran):
When you do a DEALLOCATE, the memory that was allocated returns to the pool used by the memory allocator (on Linux and OS X this is the same as C's malloc/free). The memory is not released back to the OS - it is very rare that this would even be possible. What often happens is that the pattern of allocations and deallocations causes virtual memory to be fragmented, so that while the total available space may be high, there may not be sufficient contiguous space to allocate a large item. Unlike with disks, there is no way to "defrag" memory.
(From Jim Dempsey)
See if you can deallocate the memory in the reverse order in which it was allocated. This can reduce memory fragmentation.
You may also refer to this other Intel post:
During the program run, the Fortran runtime library will manage your heap. Yes, if data is DEALLOCATED, the runtime may choose to wait to release that memory. It's an optimization - if you do another ALLOCATE with the same size it will just reuse those pages. If the heap starts to run low, it will do some collection but not until it's absolutely necessary.
Also, let me add something: Check if there aren't other objects dynamically created in scope, like automatic arrays or temporal array copies. That could be demanding memory that may be freed only when they get out of scope.
Summing up, even if 'top' says the memory is still in use, you should start to worry only if your program starts to crash or if Valgrind shows something wreid.

I modified your program ( to see where it was up to ) and ran on Windows 7 / gFortran 7.2.0. It does not demonstrate the memory retention as you report, as memory reverts to 13 mb. Contrary to my comment, memory demand did not change during initialisation of c.
module grist_domain_types
implicit none
public :: aaa
type bbb
real (8), allocatable :: c(:)
end type bbb
type aaa
type(bbb), allocatable :: b(:)
end type aaa
end module grist_domain_types
program main
use grist_domain_types
type(aaa) :: a
integer(4),parameter :: million = 1000000
integer(4) :: n = 100*million
integer(4) :: time = 5, i, pass
do pass = 1,5
write (*,*) ' go #', pass
allocate(a%b(1:n))
write (*,*) 'allocate b'
call sleep(time)!--------------1
write (*,*) ' go'
do i=1,n
allocate(a%b(i)%c(1:1))
enddo
write (*,*) 'allocate c'
call sleep(time)!--------------2
write (*,*) ' go'
do i=1,n
a%b(i)%c = real(i)
enddo
write (*,*) 'use c'
call sleep(time)!--------------2a
write (*,*) ' go'
do i=1,n
deallocate(a%b(i)%c)
enddo
write (*,*) 'deallocate c'
call sleep(time)!--------------3
write (*,*) ' go'
deallocate(a%b)
write (*,*) 'deallocate b'
call sleep(time)!--------------4
end do
write (*,*) ' done : exit ?'
read (*,*) i
end program
edit: I have given the test a repeat with do pass... to repeat the memory demands. This shows no memory leakage for this Fortran program. I use Task manager to identify memory usage, both for this program and the O/S. Your particular O/S and Fortran compiler may be different.

Julia - nested loops are consuming a lot of memory

I have a nested loop iteration scheme that's taking up a lot of memory. I do understand that I should not have globals but even after I included everything in a function, the memory situation didn't improve. It just accumulates after each iteration as if there is no garbage collection.
Here is a workable example that's similar to my code.
I have two files. First, functions.jl:
##functions.jl
module functions
function getMatrix(A)
L = rand(A,A);
return L;
end
function loopOne(A, B)
res = 0;
for i = 1:B
res = inv(getMatrix(A));
end
return(res);
end
end
Second file main.jl:
##main.jl
include("functions.jl")
function main(C)
res = 0;
A = 50;
B = 30;
for i =1:C
mat = functions.loopOne(A,B);
res = mat .+ 1;
end
return res;
end
main(100)
When I execute julia main.jl, the memory increases as I extend C in main(C) (sometimes to more than millions of allocations and 10GiBs when I increase C to 1000000).
I know that the example looks useless but it resembles the structure that I have. Can someone please help? Thank you.
UPDATE:
Michael K. Borregaard gave an answer which is very helpful:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
However, when I time it, allocations and memory still increase as I dial up C.
#time some_descriptive_name(30)
0.057177 seconds (11.77 k allocations: 58.278 MiB, 9.58% gc time)
#time some_descriptive_name(60)
0.113808 seconds (23.53 k allocations: 116.518 MiB, 9.63% gc time)
I believe that the problem comes from the inv function. If I change the code to:
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= res .+ 1
end
return res
end
The memory and allocations will then stay constant:
#time some_descriptive_name(3)
0.000007 seconds (8 allocations: 39.438 KiB)
#time some_descriptive_name(60)
0.000037 seconds (8 allocations: 39.438 KiB)
Is there a way to "clear" the memory after using inv? Since I'm not creating anything new or storing anything new, the memory usage should stay constant.

A few pointers at least:
The getMatrix function allocates a new AxA matrix every time. That will certainly consume memory. It is better to avoid the allocations if you can, e.g. by using rand! to fill an existing array with random values.
The res = 0 line defines res as an Int but you subsequently assign a Matrix{Float} to it (the result of inv(getMatrix)). Changing the type of a variable in the code makes it hard for the compiler to figure out what the type is, which makes for slow code.
It seems you have a module called functions but you don't write it.
The res = inv code line constantly overwrites the value, so the loop does nothing!
The structure and code looks like C++. Try looking at the style guide.
Here's how the code would look in a more ideomatic way that avoids allocations:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
end
Comments:
Use a module if you like - it's up to you whether to put things in different files. Module name in caps.
If you can, it's an advantage to use functions that overwrite values of an existing container. Such functions end with ! to signal that they will modify the arguments (like passing a variably by reference without making it const in C++).
Use the .= operator to indicate that you're not creating a new container, you're overwriting the elements of the existing one. The rand! function overwrites mymatrix.
The return keyword is not strictly needed, but DNF suggested it is better style in a comment
The main convention isn't used in Julia, as most code gets called by the user, not by execution of a program.
Compact assignment format for multiple variables.
Note that in this case, none of these optimisation matter much, as 99% of the calculation time is spent in the expensive inv function.
RESPONSE TO THE UPDATE:
There's nothing wrong with the inv function, it just is a costly operation to do. But again I think you may be misunderstanding what the memory counting does. It is not that the memory use is increasing, like you would be looking for in C++ if you had a pointer to an object that was never released (a memory leak). The memory use is constant, but the total sum of allocations increase, because the inv function has to make some internal allocations.
Consider this example
for i in 1:n
b = [1, 2, 3, 4] # Here a length-4 Array{Int64} is initialized in memory, cost is 32 bytes
end # Here, that memory is released.
For each run through the for loop, 32 bytes is allocated, and 32 bytes is released. When the loop ends, regardless of n, 0 bytes will be allocated from this operation. But Julia's memory tracking only adds the allocations - so after running the code you will see allocation of 32*n bytes.
The reason julia does this is that allocating space in the RAM is one of the costliest operations in computing - so reducing allocations is a good way to speed up code. But you cannot avoid it.
There is thus nothing wrong with your code (in the new format) - the memory allocation and time taken you see is just a result of doing a big (expensive) operation.

Large VIRT memory in Fortran

I have a large Fortran/MPI code that when running uses a very large amount of VIRT memory (~20G) although the actual memory used (500 mb) is quite modest.
How can I profile the code to understand which part produces this huge VIRT memory?
At this stage, I'm even happy to use a brute force approach.
What I have tried is to put sleep statements in the code and recorded the memory usage through "top" to try to pin-point by bisection where the big allocation are.
However, this does not work as the sleep call puts the memory usage to 0. Is there a way to freeze the code while keeping current memory usage?
PS: I have tried VALGRIND but the code being so large, VALGRIND never finished. Is there an alternative to VALGRIND that is "easy" to used?
Thank you,
Sam

A solution for this is this modified (to get VIRT memory) subroutine from Track memory usage in Fortran 90
subroutine system_mem_usage(valueRSS)
use ifport !if on intel compiler
implicit none
integer, intent(out) :: valueRSS
character(len=200):: filename=' '
character(len=80) :: line
character(len=8) :: pid_char=' '
integer :: pid
logical :: ifxst
valueRSS=-1 ! return negative number if not found
!--- get process ID
pid=getpid()
write(pid_char,'(I8)') pid
filename='/proc/'//trim(adjustl(pid_char))//'/status'
!--- read system file
inquire (file=filename,exist=ifxst)
if (.not.ifxst) then
write (*,*) 'system file does not exist'
return
endif
open(unit=100, file=filename, action='read')
do
read (100,'(a)',end=120) line
if (line(1:7).eq.'VmSize:') then
read (line(8:),*) valueRSS
exit
endif
enddo
120 continue
close(100)
return
end subroutine system_mem_usage

Calculating factorial on FORTRAN with integer variables. Memory overflow

I'm doing a program with FORTRAN that is a bit special. I can only use integer variables, and as you know with these you've got a memory overflow when you try to calculate a factorial superior to 12 or 13. So I made this program to avoid this problem:
http://lendricheolfiles.webs.com/codigo.txt
But something very strange is happening. The program calculates the factorial well 4 or 5 times and then gives a memory overflow message. I'm using Windows 8 and I fear it might be the cause of the failure, or if it's just that I've done something wrong.
Thanks.

Try compiling with run-time subscript checking. In Fortran segmentation faults are generally caused either by subscript errors or by mismatches between actual and dummy arguments (i.e., between arguments in the call to a procedure and the arguments as declared in the procedure). I'll make a wild guess from glancing at your code that you have have a subscript error -- let the compiler find it for you by turning on run-time subscript checking. Most Fortran compilers have this as an compilation option.
P.S. You can also do calculations like this by using already written packages, e.g., the arbitrary precision arithmetic software of David Bailey, et al., available in Fortran 90 at http://crd-legacy.lbl.gov/~dhbailey/mpdist/

M.S.B.'s answer has the gist of your problem: your array indices go out of bounds at a couple of places.
In three loops, cifra - 1 == 0 is out of bounds:
do cifra=ncifras,1,-1
factor(1,cifra-1) = factor(1,cifra)/10 ! factor is (1:2, 1:ncifras)
factor(1,cifra) = mod(factor(1,cifra),10)
enddo
! :
! Same here:
do cifra=ncifras,1,-1
factor(2,cifra-1) = factor(2,cifra)/10
factor(2,cifra) = mod(factor(2,cifra),10)
enddo
!:
do cifra=ncifras,1,-1
sumaprovisional(cifra-1) = sumaprovisional(cifra-1)+(sumaprovisional(cifra)/10)
sumaprovisional(cifra) = mod(sumaprovisional(cifra),10)
enddo
In the next case, the value of cifra - (fila - 1) goes out of bounds:
do fila=1,nfilas
do cifra=1,ncifras
! Out of bounds for all cifra < fila:
sumando(fila,cifra-(fila-1)) = factor(1,cifra)*factor(2,ncifras-(fila-1))
enddo
sumaprovisional = sumaprovisional+sumando(fila,:)
enddo
You should be fine if you rewrite the first three loops as do cifra = ncifras, 2, -1 and the inner loop of the other case as do cifra = fila, ncifras. Also, in the example program you posted, you first have to allocate resultado properly before passing it to the subroutine.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart