Is it better to `compute` for control flow or build a fully-`delayed` task graph? - dask

I have an existing Pandas codebase and have just started trying to convert it to Dask. I am still trying to wrap my head around Dask dataframe, delayed, and distributed. From reading over the dask.delayed docs, it seems like the ideal case would be to build up a task/computation graph for the entire set of operations I want to do, including delayed functions for user messages, and then running all computations in one large chunk at the end. That way, the calling process wouldn't need to keep running while the Dask cluster performs the actual work.
The problem that I've been facing, though, is that there seem to be situations where this is not feasible, particular when it comes to Python control flow. For example:
df = dd.read_csv(...)
if df.isnull().any():
# early exit
raise ValueError()
df = some(df)
df = more(df)
df = calculations(df)
# and potentially more complex control flow
I don't really see how something like that can be done without calling df.isnull().any().compute().
I also don't know right now whether there's anything 'bad' (counter to best practices) about calling compute() or persist() in a script. When looking at a lot of the examples online, they seem to be based on an experimental/Jupyter-based environment, where load -> preparation -> persist() -> experimentation seems to be the standard approach. Since I have a relatively linear set of operations (load -> op1 -> op2 -> ... -> opn -> save), I thought that I should try to simply schedule tasks without doing any computation as quickly as possible and avoid compute/persist, which I now feel has led me into a bit of a dead end.
So to summarise I guess I have two questions I would like answered, the first being 'is it bad to use compute?', and the second being 'if yes, how can I avoid compute but still have good & readable control flow?'.

It is totally ok to call compute whenever you need a concrete value. Control flow is an excellent example of this.
You might want to call .persist() first on the main trunk of your computation and then call .compute() for the control flow bits, just to make sure that you don't repeat the load -> op1 -> op2 -> ... parts of your computation.

Related

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

How to maintain state in Erlang?

I have seen people use dict, ordict, record for maintaining state in many blogs that I have read. I find it as very vital concept.
Generally I understand the meaning of maintaining state and recursions but when it comes to Erlang..I am a little vague about how it is handled.
Any help?
State is the present arrangement of data. It is sometimes hard to remember this for two reasons:
State means both the data in the program and the program's current point of execution and "mode".
We build this up to be some magical thing unnecessarily.
Consider this:
"What is the process's state?" is asking about the present value of variables.
"What state is the process in?" usually refers to the mode, options, flags or present location of execution.
If you are a Turing machine then these are the same question; we have separated the ideas to give us handy abstractions to build on (like everything else in programming).
Let's think about state variables for a moment...
In many older languages you can alter state variables from whatever context you like, whether the modification of state is appropriate or not, because you manage this directly. In more modern languages this is a bit more restricted by imposing type declarations, scoping rules and public/private context to variables. This is really a rules arms-race, each language finding more ways to limit when assignment is permitted. If scheduling is the Prince of Frustration in concurrent programming, assignment is the Devil Himself. Hence the various cages built to manage him.
Erlang restricts the situations that assignment is permitted in a different way by setting the basic rule that assignment is only once per entry to a function, and functions are themselves the sole definition of procedural scope, and that all state is purely encapsulated by the executing process. (Think about the statement on scope to understand why many people feel that Erlang macros are a bad thing.)
These rules on assignment (use of state variables) encourage you to think of state as discreet slices of time. Every entry to a function starts with a clean slate, whether the function is recursive or not. This is a fundamentally different situation than the ongoing chaos of in-place modifications made from anywhere to anywhere in most other languages. In Erlang you never ask "what is the value of X right now?" because it can only ever be what it was initially assigned to be in the context of the current run of the current function. This significantly limits the chaos of state changes within functions and processes.
The details of those state variables and how they are assigned is incidental to Erlang. You already know about lists, tuples, ETS, DETS, mnesia, db connections, etc. Whatever. The core idea to understand about Erlang's style is how assignment is managed, not the incidental details of this or that particular data type.
What about "modes" and execution state?
If we write something like:
has_cheeseburger(BurgerName) ->
receive
{From, ask, burger_name} ->
From ! {ok, BurgerName},
has_cheeseburger(BurgerName);
{From, new_burger, _SomeBurger} ->
From ! {error, already_have_a_burger},
has_cheeseburger(BurgerName);
{From, eat_burger} ->
From ! {ok, {ate, BurgerName}},
lacks_cheeseburger()
end.
lacks_cheeseburger() ->
receive
{From, ask, burger_name} ->
From ! {error, no_burger},
lacks_cheeseburger();
{From, new_burger, BurgerName} ->
From ! {ok, thanks},
has_cheeseburger(BurgerName);
{From, eat_burger} ->
From ! {error, no_burger},
lacks_cheeseburger()
end.
What are we looking at? A loop. Conceptually its just one loop. Quite often a programmer would choose to write just one loop in code and add an argument like IsHoldingBurger to the loop and check it after each message in the receive clause to determine what action to take.
Above, though, the idea of two operating modes is both more explicit (its baked into the structure, not arbitrary procedural tests) and less verbose. We have separated the context of execution by writing basically the same loop twice, once for each condition we might be in, either having a burger or lacking one. This is at the heart of how Erlang deals with a concept called "finite state machines" and its really useful. OTP includes a tool build around this idea in the gen_fsm module. You can write your own FSMs by hand as I did above or use gen_fsm -- either way, when you identify you have a situation like this writing code in this style makes reasoning much easier. (For anything but the most trivial FSM you will really appreciate gen_fsm.)
Conclusion
That's it for state handling in Erlang. The chaos of untamed assignment is rendered impotent by the basic rules of single-assignment and absolute data encapsulation within each process (this implies that you shouldn't write gigantic processes, by the way). The supremely useful concept of a limited set of operating modes is abstracted by the OTP module gen_fsm or can be rather easily written by hand.
Since Erlang does such a good job limiting the chaos of state within a single process and makes the nightmare of concurrent scheduling among processes entirely invisible, that only leaves one complexity monster: the chaos of interactions among loosely coupled actors. In the mind of an Erlanger this is where the complexity belongs. The hard stuff should generally wind up manifesting there, in the no-man's-land of messages, not within functions or processes themselves. Your functions should be tiny, your needs for procedural checking relatively rare (compared to C or Python), your need for mode flags and switches almost nonexistant.
Edit
To reiterate Pascal's answer, in a super limited way:
loop(State) ->
receive
{async, Message} ->
NewState = do_something_with(Message),
loop(NewState);
{sync, From, Message} ->
NewState = do_something_with(Message),
Response = process_some_response_on(NewState),
From ! {ok, Response},
loop(NewState);
shutdown ->
exit(shutdown);
Any ->
io:format("~p: Received: ~tp~n", [self(), Any]),
loop(State)
end.
Re-read tkowal's response for the most minimal version of this. Re-read Pascal's for an expansion of the same idea to include servicing messages. Re-read the above for a slightly different style of the same pattern of state handling with the addition of ouputting unexpected messages. Finally, re-read the two-state loop I wrote above and you'll see its actually just another expansion on this same idea.
Remember, you can't re-assign a variable within the same iteration of a function but the next call can have different state. That is the extent of state handling in Erlang.
These are all variations on the same thing. I think you're expecting there to be something more, a more expansive mechanism or something. There is not. Restricting assignment eliminates all the stuff you're probably used to seeing in other languages. In Python you do somelist.append(NewElement) and the list you had now has changed. In Erlang you do NewList = lists:append(NewElement, SomeList) and SomeList is sill exactly the same as it used to be, and a new list has been returned that includes the new element. Whether this actually involves copying in the background is not your problem. You don't handle those details, so don't think about them. This is how Erlang is designed, and that leaves single assignment and making fresh function calls to enter a fresh slice of time where the slate has been wiped clean again.
The easiest way to maintain state is using gen_server behaviour. You can read more on Learn you some Erlang and in the docs.
gen_server is process, that can be:
initialised with given state,
can have defined synchronous and asynchronous callbacks (synchronous for querying the data in "request-response style" and asynchronous for changing the state with "fire and forget" style)
It also has couple of nice OTP mechanisms:
it can be supervised
it gives you basic logging
its code can be upgraded while the server is running without loosing the state
and so on...
Conceptually gen_server is an endless loop, that looks like this:
loop(State) ->
NewState = handle_requests(State),
loop(NewState).
where handle requests receives messages. This way all requests are serialised, so there are no race conditions. Of course it is a little bit more complicated to give you all the goodies, that I described.
You can choose what data structure you want to use for State. It is common to use records, because they have named fields, but since Erlang 17 maps can come in handy. This one depends on, what you want to store.
Variable are not mutable, so when you want to have an evolution of state, you create a new variable, and later recall the same function with this new state as parameter.
This structure is meant for processes like server, there is no base condition as in the factorial usual example, generally there is a specific message to stop the server smoothly.
loop(State) ->
receive
{add,Item} -> NewState = [Item|State], % create a new variable
loop(NewState); % recall loop with the new variable
{remove,Item} -> NewState = lists:filter(fun(X) -> X /= Item end,State) , % create a new variable
loop(NewState); % recall loop with the new variable
{items,Pid} -> Pid ! {items,State},
loop(State);
stop -> stopped; % this will be the stop condition
_ -> loop(State) % ignoring other message may be interesting in a never ending loop
end

Erlang: When to use functions vs processes?

My task is to process files inside a zip file. So I write bunch of independent functions and compose them to get the desired result. That's one way of doing things. Now instead of having it all written as functions, I write some of them as processes with selective receives and all, and every things is cool. But then pondering on this a bit further, I'm thinking like, do we need functions at all? Couldn't I replace or convert all those functions into processes that communicates to itself and to other processes? So there lies my doubt. When to use functions and when to use processes? Is there any advantage from performance standpoint in using functions (like caching)? Doesn't code blocks in the processes get cached similarly?
So in our example what's the standard idiom to proceed with? Current pseudo code below.
start() ->
FL = extract("..path"),
FPids = lists:map(open_file, FL), % get file Pids,
lists:foreach(fun(FPid) ->
CPid = spawn_compute_process(),
rpc(CPid,{compute,FPid})
end, FPids).
compute() ->
receive
{Pid,{..}} ->
Line = read_line(..),
TL = tidy_line(Line), % an independent function. But couldn't it be a guard within this process?
..
end.
extract(FilePath) -> FilesList.
read_line(FPid) -> line.
So how do you actually write code? Like, write smaller independent functions first and then wrap them up inside processes?
Thanks.
The short answer is that you use processes to exploit concurrency. Replacing functions with processes where you sequentially run one process, then send its value to another process which then does its work and sends its result to the next process etc each process terminating after its done its bit is the wrong use of processes. Here you are just evaluating something sequentially by sending data from one process to another instead of calling functions.
If, however, you intend this chain of processes to be able to process multiple sequences of "calls" concurrently then it is a different matter. Then you are using the processes for concurrency. The more general way of doing this in erlang is to create a separate process for each sequence and exploit the concurrency in that manner.
Another use of processes is to manage state.

How do I create an atom dynamically in Erlang?

I am trying to register a couple processess with atom names created dynamically, like so:
keep_alive(Name, Fun) ->
register(Name, Pid = spawn(Fun)),
on_exit(Pid, fun(_Why) -> keep_alive(Name, Fun) end).
monitor_some_processes(N) ->
%% create N processes that restart automatically when killed
for(1, N, fun(I) ->
Mesg = io_lib:format("I'm process ~p~n", [I]),
Name = list_to_atom(io_lib:format("zombie~p", [I])),
keep_alive(Name, fun() -> zombie(Mesg) end)
end).
for(N, N, Fun) -> [Fun(N)];
for(I, N, Fun) -> [Fun(I)|for(I+1, N, Fun)].
zombie(Mesg) ->
io:format(Mesg),
timer:sleep(3000),
zombie(Mesg).
That list_to_atom/1 call though is resulting in an error:
43> list_to_atom(io_lib:format("zombie~p", [1])).
** exception error: bad argument
in function list_to_atom/1
called as list_to_atom([122,111,109,98,105,101,"1"])
What am I doing wrong?
Also, is there a better way of doing this?
TL;DR
You should not dynamically generate atoms. From what your code snippet indicates you are probably trying to find some way to flexibly name processes, but atoms are not it. Use a K/V store of some type instead of register/2.
Discussion
Atoms are restrictive for a reason. They should represent something about the eternal structure of your program, not the current state of it. Atoms are so restrictive that I imagine what you really want to be able to do is register a process using any arbitrary Erlang value, not just atoms, and reference them more freely.
If that is the case, pick from one of the following four approaches:
Keep Key/Value pairs somewhere to act as your own registry. This could be a separate process or a list/tree/dict/map handler to store key/value pairs of #{Name => Pid}.
Use the global module (which, like gproc below, has features that work across a cluster).
Use a registry solution like Ulf Wiger's nice little project gproc. It is awesome for the times when you actually need it (which are, honestly, not as often as I see it used). Here is a decent blog post about its use and why it works the way it does: http://blog.rusty.io/2009/09/16/g-proc-erlang-global-process-registry/. An added advantage of gproc is that nearly every Erlanger you'll meet is at least passingly familiar with it.
A variant on the first option, structure your program as a tree of service managers and workers (as in the "Service -> Worker Pattern"). A side effect of this pattern is that very often the service manager winds up needing to monitor its process for one reason or another if you're doing anything non-trivial, and that makes it an ideal candidate for a place to keep a Key/Value registry of Pids. It is quite common for this sort of pattern to wind up emerging naturally as a program matures, especially if that program has high robustness requirements. Structuring it as a set of semi-independent services with an abstract management interface at the top of each from the outset is often a handy evolutionary shortcut.
io_lib:format returns a potentially "deep list" (i.e. it may contain other lists), while list_to_atom requires a "flat list". You can wrap the io_lib:format call in a call to lists:flatten:
list_to_atom(lists:flatten(io_lib:format("zombie~p", [1]))).

What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().
From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.

Resources