memory failure when largely iteratating with foreach %dopar% - foreach

registerDoParallel(128)
b <- matrix(10.0003, ncol = 10, nrow = 12251352)
res1 <- foreach(i = 1:nrow(b), .combine = "rbind") %dopar% (1:10)
When run the above code, I got the memory problem.
Error in mcfork() :
unable to fork, possible reason: Cannot allocate memory
How to fully make use of the 128 cores without crashed memory?
Can one of the following be the effective way?
add for as outer loop
add for as inner loop
add foreach %do% as outer loop
add foreach %do% as inner loop
add foreach %:% as outer loop
add foreach %:% as inner loop
Another question: will foreach copy its output variable across all cores?

Related

F# seq behavior

I'm a little baffled about the inner work of the sequence expression in F#.
Normally if we make a sequential file reader with seq with no intentional caching of data
seq {
let mutable current = file.Read()
while current <> -1 do
yield current
}
We will end up with some weird behavior if we try to do some re-iterate or backtracking, My Idea of this was, since Read() is a function calling some mutable value we can't expect the output to be correct if we re-iterate. But then this behaves nicely even on boundary reading?
let Read path =
seq {
use fp = System.IO.File.OpenRead path
let buf = [| for _ in 0 .. 1024 -> 0uy |]
let mutable pos = 1
let mutable current = 0
while pos <> 0 do
if current = 0 then
pos <- fp.Read(buf, 0, 1024)
if pos > 0 && current < pos then
yield buf.[current]
current <- (current + 1) % 1024
}
let content = Read "some path"
We clearly use the same buffer to enhance performance, but assuming that we read the 1025 byte, it will trigger an update to the buffer, if we then try to read any byte with position < 1025 after we still get the correct output. How can that be and what are the difference?
Your question is a bit unclear, so I'll try to guess.
When you create a seq { }, you're essentially creating a state machine which will run only as far as it needs to. When you request the very first element from it, it'll start at the top and run until your first yield instruction. Then, when you request another value, it'll run from that point until the next yield, and so on.
Keep in mind that a seq { } produces an IEnumerable<'T>, which is like a "plan of execution". Each time you start to iterate the sequence (for example by calling Seq.head), a call to GetEnumerator is made behind the scenes, which causes a new IEnumerator<'T> to be created. It is the IEnumerator which does the actual providing of values. You can think of it in more classical terms as having an array over which you can iterate (an iterable or enumerable) and many pointers over that array, each of which are at different points in the array (many iterators or enumerators).
In your first code, file is most likely external to the seq block. This means that the file you are reading from is baked into the plan of execution; no matter how many times you start to iterate the sequence, you'll always be reading from the same file. This is obviously going to cause unpredictable behaviour.
However, in your second code, the file is opened as part of the seq block's definition. This means that you'll get a new file handle each time you iterate the sequence or, essentially, a new file handle per enumerator. The reason this code works is that you can't reverse an enumerator or iterate over it multiple times, not with a single thread at least.
(Now, if you were to manually get an enumerator and advance it over multiple threads, you'd probably run into problems very quickly. But that is a different topic.)

erlang has no shared memory. So what happens with the sum function for example?

Erlang has no shared memory. Look at the sum function,
sum(H|T)->H+sum(T);
sum([])->0
So
sum([1,2,3])=1+2+3+0
Now what happens? Does erlang creates an array with [1,1+2,1+2+3,1+2+3+0]?
This is what happens:
sum([1,2,3]) = 1 + sum([2,3])
=> sum[2, 3] = 2 + sum([3])
=> sum([3]) = 3 + sum([])
=> sum([]) = 0
Now sum([3]) can be evaluated:
sum([3]) = 3 + sum([]) = 3 + 0 = 3
which means that sum([2, 3]) can be evaluated:
sum([2, 3]) = 2 + sum([3]) = 2 + 3 = 5
which means that sum([1, 2, 3]) can be evaluated:
sum([1,2,3]) = 1 + sum([2,3]) = 1 + 5 = 6
Response to comment:
Okay, I figured what you were really asking about was immutable variables. Suppose you have the following C code:
int x = 0;
x += 1;
Does that code somehow demonstrate shared memory? If not, then C does not use shared memory for int variables...and neither does erlang.
In C you introduce a variable, sum, give it an initial value, 0, and
after that you add values to it. Erlang does not do this. What does
Erlang do?
Erlang allocates a new frame on the stack for each recursive function call. Each frame stores the local variables and their values, e.g. the parameter variables, for that particular function call. There can be multiple frames on the stack each storing a variable named X, but they are separate variables, so none of the X variables is ever mutated--instead a new X variable is created for each new frame, and the new X is given a new value.
Now, if the stack really worked like that in erlang, then a recursive function that executed millions of times would add millions of frames to the stack and in the process would probably use up its allocated memory and crash your program. To avoid using excessive amounts of memory, erlang employs tail call optimization, which allows the amount of memory that a function uses to remain constant. Tail call optimization allows erlang to replace the first frame on the stack with a subsequent frame of the same size, which keeps the memory usage constant. In addition, even when a function is not defined in a tail recursive format, like your sum() function, erlang can optimize the code so that it uses constant memory (see the Seven Myths of Erlang Performance).
In your sum() function, no variables are mutated and no memory is shared. In effect, though, function parameter variables do act like mutable variables.
My first diagram above is a representation of the stack adding a new frame for each recursive function call. If you redefine sum() to be tail recursive, like this:
sum(List)->
sum(List, 0).
sum([H|T], Total) ->
sum(T, Total+H);
sum([], Total)->
Total.
then below is a diagram of a recursive function executing that represents frames being replaced on the stack to keep the memory usage constant:
sum([1, 2, 3]) => sum([1, 2, 3], 0) [H=1, T=[2,3], Total=0]
=> sum([2,3], 1) [H=2, T=[3], Total=1]
=> sum([3], 3]) [H=3, T=[], Total=3]
=> sum([], 6) [Total=6]
=> 6
you're making recursive calls. the scope for every function body is not terminated until it returns something. so the invariable variable H for every call will be kept till the base case happen.
it could be tail recursive with the help of an accumulator in the function arguments, which is lighter on memory by calculating the H part first and then calling the successor recursive and giving the calculated value to the successor as the accumulator.
so in both ways there's nothing used outside of your function scopes.

F# todo list using immutable objects

I'm trying to figure out how to do a to do list in F# using immutable objects. The to do list (not necessarily an F# list) might be pulled from a database or collected from user input or read from XML or JSON, etc. That part is not so important.
Pseudo code:
do for some length of time:
for each item in the to do list:
if item is ready to do:
do item
if it worked:
remove from the todo list
wait a bit before trying again
report on items that weren't ready or that failed.
The to do list will be some collection of F# records which will have at least an instruction ("Send Email", "Start a process", "Copy a File", "Ask for a raise") along with parameters as a sub-collection.
Can such a thing be done with immutable objects alone? Or must I use a .NET List or some other mutable object?
I don't need fully-fleshed out working code, just some ideas about how I'd put such a thing together.
UPDATE: First attempt at (half-)coding this thing:
let processtodo list waittime deadline =
let rec inner list newlist =
match list with
| [] when not List.isEmpty newlist ->
inner newlist []
| head :: tail when head.isReady->
let res = head.action
inner tail ( if res = true then tail else list)
| head :: tail when not head.isReady ->
inner tail list
| _ when deadline not passed ->
// wait for the waittime
inner list
| _ -> report on unfinished list
inner list []
I tried to write this in the typical fashion seen in many examples. I assumed that the items support the "isReady" and "action" methods. The thing I don't like is its not tail-call recursive, so will consume stack space for each recursion.
Recursion and/or continuations are the typical strategies to transform code with mutable structures in loops to immutable structures. If you know how to write a recursive "List.filter" then you'll probably have some ideas to be on the right track.

Lua table memory leak?

I have a memory leak issue about the usage of lua table, the code is below:
function workerProc()
-- a table holds some objects (userdata, the __gc is implememted correctly)
local objs = {createObj(), createObj(), ...}
while isWorking() do
-- ...
local query = {unpack(objs)}
repeat
-- ...
table.remove(query, queryIndex)
until #query == 0
sleep(1000)
end
end
the table objs is initialized with some userdata objects and these objects are always available in the while loop so no gc will performed on these objs. In the while loop the query table is initialize with all the elements from objs (use unpack function). While running the script I found that the memory keeps increasing but when I comment out local query = {unpack(objs)} it disappears.
I don't think this piece of code have memory leak problem cause the query var is local and it should be unavailable after each iteration of while loop, but the fact is. Anybody know why the memory is swallowed by that table?
Judging from your code example, the likely explanation for what you are seeing is perhaps the gc doesn't get a chance to perform a full collection cycle while inside the loop.
You can force a collection right after the inner loop using collectgarbage() and see if that resolves the memory issue:
while isWorking() do
-- ..
local query = {unpack(objs)}
repeat
-- ..
table.remove(query, queryIndex)
until #query == 0
collectgarbage()
sleep(1000)
end
Another possibility is to move local query outside the loop and create the table once instead of creating a new table on every iteration in the outter loop.

Is there a difference between foreach and map?

Ok this is more of a computer science question, than a question based on a particular language, but is there a difference between a map operation and a foreach operation? Or are they simply different names for the same thing?
Different.
foreach iterates over a list and performs some operation with side effects to each list member (such as saving each one to the database for example)
map iterates over a list, transforms each member of that list, and returns another list of the same size with the transformed members (such as converting a list of strings to uppercase)
The important difference between them is that map accumulates all of the results into a collection, whereas foreach returns nothing. map is usually used when you want to transform a collection of elements with a function, whereas foreach simply executes an action for each element.
In short, foreach is for applying an operation on each element of a collection of elements, whereas map is for transforming one collection into another.
There are two significant differences between foreach and map.
foreach has no conceptual restrictions on the operation it applies, other than perhaps accept an element as argument. That is, the operation may do nothing, may have a side-effect, may return a value or may not return a value. All foreach cares about is to iterate over a collection of elements, and apply the operation on each element.
map, on the other hand, does have a restriction on the operation: it expects the operation to return an element, and probably also accept an element as argument. The map operation iterates over a collection of elements, applying the operation on each element, and finally storing the result of each invocation of the operation into another collection. In other words, the map transforms one collection into another.
foreach works with a single collection of elements. This is the input collection.
map works with two collections of elements: the input collection and the output collection.
It is not a mistake to relate the two algorithms: in fact, you may view the two hierarchically, where map is a specialization of foreach. That is, you could use foreach and have the operation transform its argument and insert it into another collection. So, the foreach algorithm is an abstraction, a generalization, of the map algorithm. In fact, because foreach has no restriction on its operation we can safely say that foreach is the simplest looping mechanism out there, and it can do anything a loop can do. map, as well as other more specialized algorithms, is there for expressiveness: if you wish to map (or transform) one collection into another, your intention is clearer if you use map than if you use foreach.
We can extend this discussion further, and consider the copy algorithm: a loop which clones a collection. This algorithm too is a specialization of the foreach algorithm. You could define an operation that, given an element, will insert that same element into another collection. If you use foreach with that operation you in effect performed the copy algorithm, albeit with reduced clarity, expressiveness or explicitness. Let's take it even further: We can say that map is a specialization of copy, itself a specialization of foreach. map may change any of the elements it iterates over. If map doesn't change any of the elements then it merely copied the elements, and using copy would express the intent more clearly.
The foreach algorithm itself may or may not have a return value, depending on the language. In C++, for example, foreach returns the operation it originally received. The idea is that the operation might have a state, and you may want that operation back to inspect how it evolved over the elements. map, too, may or may not return a value. In C++ transform (the equivalent for map here) happens to return an iterator to the end of the output container (collection). In Ruby, the return value of map is the output sequence (collection). So, the return value of the algorithms is really an implementation detail; their effect may or may not be what they return.
Array.protototype.map method & Array.protototype.forEach are both quite similar.
Run the following code: http://labs.codecademy.com/bw1/6#:workspace
var arr = [1, 2, 3, 4, 5];
arr.map(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
});
console.log();
arr.forEach(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
});
They give the exact ditto result.
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
But the twist comes when you run the following code:-
Here I've simply assigned the result of the return value from the map and forEach methods.
var arr = [1, 2, 3, 4, 5];
var ar1 = arr.map(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
return val;
});
console.log();
console.log(ar1);
console.log();
var ar2 = arr.forEach(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
return val;
});
console.log();
console.log(ar2);
console.log();
Now the result is something tricky!
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
[ 1, 2, 3, 4, 5 ]
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
undefined
Conclusion
Array.prototype.map returns an array but Array.prototype.forEach doesn't. So you can manipulate the returned array inside the callback function passed to the map method and then return it.
Array.prototype.forEach only walks through the given array so you can do your stuff while walking the array.
the most 'visible' difference is that map accumulates the result in a new collection, while foreach is done only for the execution itself.
but there are a couple of extra assumptions: since the 'purpose' of map is the new list of values, it doesn't really matters the order of execution. in fact, some execution environments generate parallel code, or even introduce some memoizing to avoid calling for repeated values, or lazyness, to avoid calling some at all.
foreach, on the other hand, is called specifically for the side effects; therefore the order is important, and usually can't be parallelised.
Short answer: map and forEach are different. Also, informally speaking, map is a strict superset of forEach.
Long answer: First, let's come up with one line descriptions of forEach and map:
forEach iterates over all elements, calling the supplied function on each.
map iterates over all elements, calling the supplied function on each, and produces a transformed array by remembering the result of each function call.
In many languages, forEach is often called just each. The following discussion uses JavaScript only for reference. It could really be any other language.
Now, let's use each of these functions.
Using forEach:
Task 1: Write a function printSquares, which accepts an array of numbers arr, and prints the square of each element in it.
Solution 1:
var printSquares = function (arr) {
arr.forEach(function (n) {
console.log(n * n);
});
};
Using map:
Task 2: Write a function selfDot, which accepts an array of numbers arr, and returns an array wherein each element is the square of the corresponding element in arr.
Aside: Here, in slang terms, we are trying to square the input array. Formally put, we are trying to compute it's dot product with itself.
Solution 2:
var selfDot = function (arr) {
return arr.map(function (n) {
return n * n;
});
};
How is map a superset of forEach?
You can use map to solve both tasks, Task 1 and Task 2. However, you cannot use forEach to solve the Task 2.
In Solution 1, if you simply replace forEach by map, the solution will still be valid. In Solution 2 however, replacing map by forEach will break your previously working solution.
Implementing forEach in terms of map:
Another way of realizing map's superiority is to implement forEach in terms of map. As we are good programmers, we'll won't indulge in namespace pollution. We'll call our forEach, just each.
Array.prototype.each = function (func) {
this.map(func);
};
Now, if you don't like the prototype nonsense, here you go:
var each = function (arr, func) {
arr.map(func); // Or map(arr, func);
};
So, umm.. Why's does forEach even exist?
The answer is efficiency. If you are not interested in transforming an array into another array, why should you compute the transformed array? Only to dump it? Of course not! If you don't want a transformation, you shouldn't do a transformation.
So while map can be used to solve Task 1, it probably shouldn't. For each is the right candidate for that.
Original answer:
While I largely agree with #madlep 's answer, I'd like to point out that map() is a strict super-set of forEach().
Yes, map() is usually used to create a new array. However, it may also be used to change the current array.
Here's an example:
var a = [0, 1, 2, 3, 4], b = null;
b = a.map(function (x) { a[x] = 'What!!'; return x*x; });
console.log(b); // logs [0, 1, 4, 9, 16]
console.log(a); // logs ["What!!", "What!!", "What!!", "What!!", "What!!"]
In the above example, a was conveniently set such that a[i] === i for i < a.length. Even so, it demonstrates the power of map().
Here's the official description of map(). Note that map() may even change the array on which it is called! Hail map().
Hope this helped.
Edited 10-Nov-2015: Added elaboration.
Here is an example in Scala using lists: map returns list, foreach returns nothing.
def map(f: Int ⇒ Int): List[Int]
def foreach(f: Int ⇒ Unit): Unit
So map returns the list resulting from applying the function f to each list element:
scala> val list = List(1, 2, 3)
list: List[Int] = List(1, 2, 3)
scala> list map (x => x * 2)
res0: List[Int] = List(2, 4, 6)
Foreach just applies f to each element:
scala> var sum = 0
sum: Int = 0
scala> list foreach (sum += _)
scala> sum
res2: Int = 6 // res1 is empty
If you're talking about Javascript in particular, the difference is that map is a loop function while forEach is an iterator.
Use map when you want to apply an operation to each member of the list and get the results back as a new list, without affecting the original list.
Use forEach when you want to do something on the basis of each element of the list. You might be adding things to the page, for example. Essentially, it's great for when you want "side effects".
Other differences: forEach returns nothing (since it is really a control flow function), and the passed-in function gets references to the index and the whole list, whereas map returns the new list and only passes in the current element.
ForEach tries to apply a function such as writing to db etc on each element of the RDD without returning anything back.
But the map() applies some function over the elements of rdd and returns the rdd. So when you run the below method it won't fail at line3 but while collecting the rdd after applying foreach it will fail and throw an error which says
File "<stdin>", line 5, in <module>
AttributeError: 'NoneType' object has no attribute 'collect'
nums = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
num2 = nums.map(lambda x: x+2)
print ("num2",num2.collect())
num3 = nums.foreach(lambda x : x*x)
print ("num3",num3.collect())

Resources