open System
open System.Diagnostics
open System.Threading.Tasks
open System.Collections
let computation input =
input|> Array.map (fun x-> sqrt(float x))|> ignore
let computation1 (input:int[]) =
input|> Array.map (fun x-> Convert.ToString(x,2)) |> ignore
let abc p =
for i in 1..50000 do
[|for i in 1..100000 -> i|]
|> computation
Parallel.For(0,100,fun x->abc 0) |>ignore
Hi,
In the above program if I run the application with "computation()" in "abc()" I see CPU Utilization of 70-80%.
But the same code with "computation1()" in "abc()", I see just 20% CPU Utilization and I see many (half of the cores) cores are idle.
Does anyone have idea for this behaviour?
My CPU Specifications:
Intel Xeon CPU #3.40Hz with Hyper threading enabled
Total Cores: 2 CPUs with 8 cores each = 16
Total Cores with HT = 2*16 = 32
Should be pretty easy to find out with a profiler, but in this case the answer is rather obvious:
Convert.ToString
will throttle your performance considerably, making it memory-bound and (possibly even more importantly), heap/GC-bound.
computation is almost pure CPU-work. computation1 uses the heap for every single step you do - that means frequent memory access (in fact, it can't even use CPU caches very well, though GC compaction offsets this a lot), and frequently allocating and collecting strings on the heap. The heap and the GC are pretty high performing, but millions/billions of strings per second are really pushing it. Most of the time, your code is suspended because the GC is running.
Related
-module(demo).
-export([factorial/1]).
factorial(0) -> 1;
factorial(N) ->
N * factorial(N-1).
The factorial is not tail recursive but why is it not overflowing the stack? I am able to get factorial of 100,000 without stack overflow but takes some time to compute.
An Erlang process's "stack" is not stored in the stack given by the system to the process (which is usually a few megabytes) but in the heap. As far as I know, it will grow unbounded until the system refuses to give the VM more memory.
The size includes 233 words for the heap area (which includes the stack). The garbage collector increases the heap as needed.
The main (outer) loop for a process must be tail-recursive. Otherwise, the stack grows until the process terminates.
Source
If you monitor the Erlang VM process in an process monitor like Activity Monitor on OSX or top on other UNIX-like systems, you'll see that the memory usage will keep on increasing until the calculation is complete, at which point a part of the memory (the one where the "stack" is stored) will be released (this happens gradually over a few seconds after the function returns for me).
I have written a scientific code and, as usual, this boils to calculating the coefficients in an algebraic Eigenvalue equation: Calculating these coefficients requires integrating over multi-dimensional arrays and this quickly inflates memory usage drastically. Once the matrix coefficients are calculated, the original, pre-integation multi-dimensional arrays can be deallocated and intelligent solvers take over, so memory usage ceases to be the big issue. As you can see there is a bottleneck and on my 64 bit, 4 core, 8 threads, 8GB ram laptop the program crashes due to insufficient memory.
I am therefore implementing a system that keeps memory usage in check by limiting the size of the tasks that the MPI processes can take on when calculating some of the Eigenvalue matrix elements. When finished they will then look for remaining jobs to be done so the matrix still gets filled but in a more sequential and less parallel way.
I was therefore checking how much memory I can allocate and here is where the confusion starts: I allocate doubles with size 8 bytes (checked using sizeof(1)) and look at the allocation status.
Though I have 8 GB of ram available running a test with only 1 process, I can allocate an array of up to size (40000,40000), which corresponds to about 13GB of memory! My first question is thus: How is this possible? Is there so much virtual memory?
Secondly, I realized that I can also do the same thing for multiple processes: Up to 16 processes can, simultaneously allocate these massive arrays!
This cannot be right?
Does somebody understand why this happens? And whether I am doing something wrong?
Edit:
Here is a code that produces the aforementioned miracle, at least on my machine. However, when I set the elements of the arrays to some value it indeed behaves as it should and crashes--or at least starts behaving very slowly, which I guess is due to the fact that slow virtual memory is used?
program test_miracle
use ISO_FORTRAN_ENV
use MPI
implicit none
! global variables
integer, parameter :: dp = REAL64 ! double precision
integer, parameter :: max_str_ln = 120 ! maximum length of filenames
integer :: ierr ! error variable
integer :: n_procs ! MPI nr. of procs
! start MPI
call MPI_init(ierr) ! initialize MPI
call MPI_Comm_size(MPI_Comm_world,n_procs,ierr) ! nr. MPI processes
write(*,*) 'RUNNING MPI WITH', n_procs, 'processes'
! call asking for 6 GB
call test_max_memory(6000._dp)
call MPI_Barrier(MPI_Comm_world,ierr)
! call asking for 13 GB
call test_max_memory(13000._dp)
call MPI_Barrier(MPI_Comm_world,ierr)
! call asking for 14 GB
call test_max_memory(14000._dp)
call MPI_Barrier(MPI_Comm_world,ierr)
! stop MPI
call MPI_finalize(ierr)
contains
! test whether maximum memory feasible
subroutine test_max_memory(max_mem_per_proc)
! input/output
real(dp), intent(in) :: max_mem_per_proc ! maximum memory per process
! local variables
character(len=max_str_ln) :: err_msg ! error message
integer :: n_max ! maximum size of array
real(dp), allocatable :: max_mem_arr(:,:) ! array with maximum size
integer :: ierr ! error variable
write(*,*) ' > Testing whether maximum memory per process of ',&
&max_mem_per_proc/1000, 'GB is possible'
n_max = ceiling(sqrt(max_mem_per_proc/(sizeof(1._dp)*1.E-6)))
write(*,*) ' * Allocating doubles array of size', n_max
allocate(max_mem_arr(n_max,n_max),STAT=ierr)
err_msg = ' * cannot allocate this much memory. Try setting &
&"max_mem_per_proc" lower'
if (ierr.ne.0) then
write(*,*) err_msg
stop
end if
!max_mem_arr = 0._dp ! UNCOMMENT TO MAKE MIRACLE DISSAPEAR
deallocate(max_mem_arr)
write(*,*) ' * Maximum memory allocatable'
end subroutine test_max_memory
end program test_miracle
To be saved in test.f90 and subsequently compiled and run with
mpif90 test.f90 -o test && mpirun -np 2 ./test
When you do an allocate statement, you reserve a domain in the virtual memory space. The virtual space is the sum of the physical memory + the swap + maybe some extra possible space due to some overcommit possibility which will assume you will not use all the reservation.
But the memory is not yet physically reserved until you write something into it. When you write something into the memory, the system will physically allocate the corresponding page for you.
If you don't initialize your array, and if your array is very sparse, it is possible that there are many pages which are never written, so the memory is never physically fully used.
When you see the system slowing down, it may be that the system is swapping pages to disk because the physical memory is full. If you have 8GB RAM and 8GB swap on disk, your calculation can run (very slowly thow...)
This mechanism is pretty good in NUMA environments, since this "first touch policy" will allocate the memory close to the CPU which first writes into it.
In this way, you can initialize an array in an OpenMP loop to physically place the memory close to the CPUs that will use it.
I've been doing some computationally intensive work in F#. Functions like Array.Parallel.map which use the .Net Task Parallel Library have sped up my code exponentially for a really quite minimal effort.
However, due to memory concerns, I remade a section of my code so that it can be lazily evaluated inside a sequence expression (this means I have to store and pass less information). When it came time to evaluate I used:
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> = seq { ...yield one thing at a time... }
// extract results from calculations for summary data
PSeq.iter someFuncToExtractResults results
Instead of:
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] = ...do all the things...
// extract results from calculations for summary data
Array.Parallel.map someFuncToExtractResults calculations
When using any of the Array.Parallel functions I can clearly see all the cores on my computer kick into gear (~100% CPU usage). However the extra memory required means the program never finished.
With the PSeq.iter version when I run the program, there's only about 8% CPU usage (and minimal RAM usage).
So: Is there some reason why the PSeq version runs so much slower? Is it because of the lazy evaluation? Is there some magic "be parallel" stuff I am missing?
Thanks,
Other resources, source code implementations of both (they seem to use different Parallel libraries in .NET):
https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs
https://github.com/fsharp/powerpack/blob/master/src/FSharp.PowerPack.Parallel.Seq/pseq.fs
EDIT: Added more detail to code examples and details
Code:
Seq
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> =
seq {
for index in 0..data.length-1 do
yield calculationFunc data.[index]
}
// extract results from calculations for summary data (different module)
PSeq.iter someFuncToExtractResults results
Array
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] =
Array.Parallel.map calculationFunc data
// extract results from calculations for summary data (different module)
Array.Parallel.map someFuncToExtractResults calculations
Details:
The storing the intermediate array version runs quick (as far as it gets before crash) in under 10 minutes but uses ~70GB RAM before it crashes (64GB physical, the rest paged)
The seq version takes over 34mins and uses a fraction of the RAM (only around 30GB)
There's a ~billion values I'm calculating. Hence a billion doubles (at 64bits each) = 7.4505806GB. There's more complex forms of data... and a few unnecessary copies I'm cleaning up hence the current massive RAM usage.
Yes the architecture isn't great, the lazy evaluation is the first part of me attempting to optimize the program and/or batch up the data into smaller chunks
With a smaller dataset, both chunks of code output the same results.
#pad, I tried what you suggested, the PSeq.iter seemed to work properly (all cores active) when fed the Calculation[], but there is still the matter of RAM (it eventually crashed)
both the summary part of the code and the calculation part are CPU intensive (mainly because of large data sets)
With the Seq version I just aim to parallelize once
Based on your updated information, I'm shortening my answer to just the relevant part. You just need this instead of what you currently have:
let result = data |> PSeq.map (calculationFunc >> someFuncToExtractResults)
And this will work the same whether you use PSeq.map or Array.Parallel.map.
However, your real problem is not going to be solved. This problem can be stated as: when the desired degree of parallel work is reached in order to get to 100% CPU usage, there is not enough memory to support the processes.
Can you see how this will not be solved? You can either process things sequentially (less CPU efficient, but memory efficient) or you can process things in parallel (more CPU efficient, but runs out of memory).
The options then are:
Change the degree of parallelism to be used by these functions to something that won't blow your memory:
let result = data
|> PSeq.withDegreeOfParallelism 2
|> PSeq.map (calculationFunc >> someFuncToExtractResults)
Change the underlying logic for calculationFunc >> someFuncToExtractResults so that it is a single function that is more efficient and streams data through to results. Without knowing more detail, it's not simple to see how this could be done. But internally, certainly some lazy loading may be possible.
Array.Parallel.map uses Parallel.For under the hood while PSeq is a thin wrapper around PLINQ. But the reason they behave differently here is there is not enough workloads for PSeq.iter when seq<Calculation> is sequential and too slow in yielding new results.
I do not get the idea of using intermediate seq or array. Suppose data to be the input array, moving all calculations in one place is the way to go:
// Should use PSeq.map to match with Array.Parallel.map
PSeq.map (calculationFunc >> someFuncToExtractResults) data
and
Array.Parallel.map (calculationFunc >> someFuncToExtractResults) data
You avoid consuming too much memory and have intensive computation in one place which leads to better efficiency in parallel execution.
I had a problem similar to yours and solved it by adding the following to the solution's App.config file:
<runtime>
<gcServer enabled="true" />
<gcConcurrent enabled="true"/>
</runtime>
A calculation that was taking 5'49'' and showing roughly 22% CPU utilization on Process Lasso took 1'36'' showing roughly 80% CPU utilization.
Another factor that may influence the speed of parallelized code is whether hyperthreading (Intel) or SMT (AMD) is enabled in the BIOS. I have seen cases where disabling leads to faster execution.
I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?
In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.
I have just started learning Erlang and am trying out some Project Euler problems to get started. However, I seem to be able to do any operations on large sequences without crashing the erlang shell.
Ie.,even this:
list:seq(1,64000000).
crashes erlang, with the error:
eheap_alloc: Cannot allocate 467078560 bytes of memory (of type "heap").
Actually # of bytes varies of course.
Now half a gig is a lot of memory, but a system with 4 gigs of RAM and plenty of space for virtual memory should be able to handle it.
Is there a way to let erlang use more memory?
Your OS may have a default limit on the size of a user process. On Linux you can change this with ulimit.
You probably want to iterate over these 64000000 numbers without needing them all in memory at once. Lazy lists let you write code similar in style to the list-all-at-once code:
-module(lazy).
-export([seq/2]).
seq(M, N) when M =< N ->
fun() -> [M | seq(M+1, N)] end;
seq(_, _) ->
fun () -> [] end.
1> Ns = lazy:seq(1, 64000000).
#Fun<lazy.0.26378159>
2> hd(Ns()).
1
3> Ns2 = tl(Ns()).
#Fun<lazy.0.26378159>
4> hd(Ns2()).
2
Possibly a noob answer (I'm a Java dev), but the JVM artificially limits the amount of memory to help detect memory leaks more easily. Perhaps erlang has similar restrictions in place?
This is a feature. We do not want one processes to consume all memory. It like the fuse box in your house. For the safety of us all.
You have to know erlangs recovery model to understand way they let the process just die.
Also, both windows and linux have limits on the maximum amount of memory an image can occupy
As I recall on linux it is half a gigabyte.
The real question is why these operations aren't being done lazily ;)