The setup is similar to this.
One agent, (dataSource) is generating data, and a single agent (dataProcessor) is processing the data. There is a lot more data being generated than dataProcessor can process, and I am not interested in processing all messages, just processing the latest piece of data.
One possible solution, proposed there by Jon Harrop there "is to greedily eat all messages in the inbox when one arrives and discard all but the most recent".
Another approach is not to listen for all messages, but rather for dataProcessor to PostAndReply dataSource to get the latest piece of data.
What are the pros and cons of these approaches?
This is an intriguing question and there are quite likely several possible perspectives. I think the most notable aspect is that the choice will affect how you design the API at the interface between the two components:
In "Consume all" approach, the producer has a very simple API where it triggers some event whenever a value is produced and your consumer will subscribe to it. This means that you could have other subscribers listening to updates from the producer and doing something else than your consumer from this question.
In "Call to get latest" approach, the producer will presumably need to be written so that it keeps the current state and discards old values. It will then provide blocking async API to get the latest value. It could still expose an event for other consumers though. The consumer will need to actively poll for changes (in a busy loop of some sorts).
You could also have a producer with an event as in "Consume all", but then create another component that listens to any given event, keeps the latest value and makes it available via a blocking async call to any other client.
Here some advantages/disadvantages I can think of:
In (1) the producer is very simple; the consumer is harder to write
In (2) the producer needs to do a bit more work, but the consumer is simple
In (3), you are adding another layer, but in a fairly reusable way.
I would probably go with either (2) (if I only need this for one data source) or with (3) after checking that it does not affect the performance.
As for (3), the sketch of what I was thinking would look something like this:
type KeepLastMessage<'T> =
| Update of 'T
| Get of AsyncReplyChannel<'T>
type KeepLast<'T>(initial:'T, event:IObservable<'T>) =
let agent = MailboxProcessor.Start(fun inbox ->
let rec loop last = async {
let! msg = inbox.Receive()
match msg with
| Update last -> return! loop last
| Get ch -> ch.Reply(last); return! loop last }
loop initial)
member x.AsyncGet() = agent.PostAndAsyncReply(Get)
Related
I've a requirement where I need to cache the results of an event into an asyncSeq and iterate though this asyncSeq once a long running function has returned. The idea here is to iterate through profiles and apply runFunction to both the cached results as well as the future results (hence the use of ofObservableBuffered).
I would like to know what's the best way to do this. I'm using AsyncSeq.Cache as shown below. AsyncSeq.Cache is using a combination of ResizeArray and MailboxProcessor to accomplish the caching. However, I'm not sure if this will lead to a memory leak.
let profiles =
client.ProfileReceived
|> AsyncSeq.ofObservableBuffered
|> AsyncSeq.cache
do! longRunningFunction()
do! function2 (fun _ ->
async {
do! profiles
|> AsyncSeq.iterAsync(fun profile -> async {
do! runFunction profile
return()
})
}
Your question does not give all details about your scenario, but I think the answer is that you do not need AsyncSeq.cache and using just AsyncSeq.ofObservableBuffered should be enough.
Asynchronous sequences generate values on demand ("pull based") which means that elements are only generated when needed. Observables are "push based" which means that they generate data whenever the data source decides.
To map from "push based" to "pull based", you either need to drop data (when the listener is not ready to accept the next item) or cache data. If you cache, then you may potentially run out of memory if the producer is faster than the consumer - but this is inevitable by design problem. The AsyncSeq.ofObservableBuffered function does the latter.
AsyncSeq.cache is useful if you have one data source, but want to consume it from multiple different places. Without this, the data source will generate data repeatedly for each consumer, so cache enables generating the data just once. However, if you are using ofObservableBuffered is already doing the same thing - i.e. caching all the generated values.
I believe the only reason why you might want to keep cache is if you subscribe to the observable at a later time - in which case, it would keep elements generated before you started consuming values (and use two caches, one of which would keep not-yet-consumed elements (possibly safe if the consumer is fast enough) and one which would keep all the already generated elements (certainly potential to run out of memory over a longer period of time)).
I would like to design a process hierarchy where there is a a parent process P which acts like a gatekeeper and delegates the work(messages/events from its client processes) to it's children processes C1,C2..Cn which collaborate with each other and may send the result back to P. The children processes cannot talk to any process outside, only P.
The challenge is that though P may have multiple messages from its clients, it should accept only one message, delegate the work to C1..Cn and ONLY accept the next message from its clients
when all its children processes are done(or idle) and there are no more messages circulating between C1 to Cn.
P finishes accepting messages from C1..Cn so that it can return the result to its clients
Constraints:
Idle for me is when they are waiting with a receive (blocking) or simply exited.
C1 to Cn are finite state machines. Some or all of them may send messages back to C. Or there may be no messages to be sent back to C. Even if no messages are sent back to C, C has to figure out that all of them are done with no messages between them.
If any of C1 to Cn have been pre-empted, then it is considered busy(this may be obvious but I thought I'll put it here for completion) and C will not receive the next message
Is there an OTP pattern or library which will do this for me (before I hack something?). I know that process_info can let me know if the mailbox of a process are empty and I could keep on checking the children's mailboxes from P but it would be unnecessary polling from P.
EDIT GENERAL: I am trying to implement a reactive variant of Flow Based Programming on the Erlang platform. This has the notion of 'hierarchical processes' or composites which themselves may contain composite processes until we reach some boxes of actual code...I am going to research(looking at monitor,process_info,process_flag) but I wanted to respond to your excellent answers
EDIT RECURSIVE PARENTS: Each of C1 and Cn can themselves be parent/composite processes. If I just spawn processes and let them exit immediately, I'll have to create the chain of Composites everytime as C1..Cn may themselves be composites (which spawn composites..and so on). Finally, when we reach a leaf box(which is not a composite node), they are supposed to be finite state machines.. so I'm not sure of spawning and making them exit quickly if the are FSMs.
EDIT TKOWAL: Since I am trying to create a generic parent/composite process, it does not know 'when' the task ends. All it does is relay the messages it receives from its children to it's siblings with the 'constraint' that it will not accept the next message from its client/siblings until its children are 'done'. The children C1..Cn may send not just one but many messages. I understand from your proposal, that wait_for_task_finish will stop blocking the moment it gets the first message. But more messages may be emitted too by P's children. P should wait for all messages. Also, having a task_end symbol will not work for the same reason(i.e. multiple messages possible from the children)
Given how inexpensive it is to start up Erlang processes, your gatekeeper could start new children for each incoming task, and then wait for them all to exit normally once they complete their work.
But in general, it sounds like you're looking for a process pool. There are a few of these already available, such as poolboy and sidejob. Pools can be harder to get right than you think, so I advise using an existing proven pool implementation before attempting to write your own.
After edits, this became entirely different question, so I am posting new answer.
If you are trying to write Flow Based Programming, then you are probably solving wrong problem. FBP is effective, because almost everything is asynchronous and you start processing next request immediately after you finished with previous one.
So, the answer is - don't wait for children to finish:
In FBP, there is no time dependencies between the components. So if I
have a chunk of data, it should be able to flow from one end of the
diagram to the other regardless of how any other pieces of data are
being handled. In order to program an FBP system, you have to minimize
your dependencies.
source
When creating parent and children, you know all the connections between blocks, so just configure children to send processed data directly to next block. For example: P1 has children C1 and C2. You send message to P1, it delegates it to C1, packet flows couple of times between C1 and C2 and after that, C1 or C2 sends it directly to P2.
Blocks should be stateless. They output should not depend on previous requests, so even if C1 and C2 are processing data from two different requests to P1 - it is OK. There could be situations, where P1 gets data packet D1 and then D2, but will output answers in different order R2 and then R1. It is also OK. You can use Erlang reference to tag messages and then check, which response is from which request.
I don't think, there is ready library for that, but it is really easy to hack, unless I missed something. Your P process should look like this:
ready_for_next_task() ->
receive
{task, Task, CallerPid} ->
send_task_to_workers(Task)
end,
wait_for_task_finish(CallerPid).
wait_for_task_finish(CallerPid) ->
receive
{task_end, Response} ->
CallerPid ! Response
end,
ready_for_next_task().
In wait_for_task_finish/1 you have only one clause for receive, so it will not accept next task, until current one is finished. If you are waiting for multiple responses from workers, you can simply add second clause to receive with some partial response and call wait_for_task_finish/1 recursively.
It is always better to have some indicator, that the processing ended, because you don't have guarantees on message delivery time. This means, that you could check, that all processes currently are waiting for message and think, that they ended processing, but actually, they did not started yet or one of them send message to other and you caught them before the second one had it in message box.
If the processes C1..Cn have only parts of actual work and don't know about the progress, than the gatekeeper P should know how many parts there were, receive all of them one by one and then call ready_for_next_task/1.
Elixir streams provide iterables, but I couldn't find any information on observables (Google was no help here). I'd greatly appreciate it if someone could point me to resources for the same.
You can combine Stream and Enum to write observable-style code. Here's an example of an echo server written in observable fashion:
IO.stream(:stdio, :line)
|> Stream.map(&String.upcase/1)
|> Enum.each(&IO.write(&1))
Basically, for each line you send to standard input, it will be converted to uppercase and then printed back to standard output. This is a simple example, but the point is that all you need to compose an observable is already available via Stream and Enum.
Streams in Elixir are abstractions over function composition. In the end, all you get is a function, calling which will loop over the input stream and transform it.
In order to build stateful streams like the example in Twitter4j (buffering new twitter statutes during one second and dispatching them all in one list), you'll need to use the building blocks that can have state. In Elixir, it is common to encapsulate state in processes.
The example might look like this
tweetsPerSecond =
twitterStream
|> SS.buffer({1, :second})
|> SS.map(&length(&1))
SS.subscribe(tweetsPerSecond, fn n -> IO.puts "Got #{n} tweets in the last second" end)
SS.subscribe(tweetsPerSecond, fn n -> IO.puts "Second subscriber" end)
SS is a new module we need to write to implement the observable functionality. The core idea (as far as I get it) is being able to subscribe to a stream without modifying it.
In order for this to work, the twitterStream itself should be a process emitting events for others to consume. You can't use Stream in this case because it has "blocking pull" semantics, i.e. you won't be able to interrupt waiting on the next element in a stream after some fixed amount of time has elapsed.
To achieve the equivalent functionality in Elixir, take a look at the GenEvent module. It provides the ability to emit and subscribe to events. There is no stream-like interface for it though, not that I'm aware of.
I have built a PoC of a Pub-Sub system where I have followed a kind of "Observable Pattern": http://mendrugory.weebly.com/blog/pub-sub-system-in-elixir.
In order keep the state (what process has to be informed) I have used an Agent.
HOW I WISH I HAD PHRASED MY QUESTION TO BEGIN WITH
Take a table with 26 keys, a-z and let them have integer values.
Create a process, Ouch, that does two things over and over again
In one transaction, write random values for a, b, and c such that those values always sum to 10
In another transaction, read the values for a, b and c and complain if their values do not sum to 10
If you spin-up even a few of these processes you will see that very quickly a, b and c are in a state where their values do not sum to 10. I believe there is no way to ask mnesia to "lock these 3 records before starting the writes (or reads)", one can only have mnesia lock the records as it gets to them (so to speak) which allows for the set of records' values to violate my "must sum to 10" constraint.
If I am right, solutions to this problem include
lock the entire table before writing (or reading) the set of 3 records -- I hate to lock whole table for 3 recs,
Create a process that keeps track of who is reading or writing which keys and protects bulk operations from anyone else writing or reading until the operation is completed. Of course I would have to make sure that all processes made use of this... crap, I guess this means writing my own AccessMod as the fourth parameter to activity/4 which seems like a non-trivial exercise
Some other thing that I am not smart enough to figure out.
thoughts?
Ok, I'm an ambitious Erlang newbee, so sorry if this is a dumb question, but
I am building an application-specific, in-memory distributed cache and I need to be able to write sets of Key, Value pairs in one transaction and also retrieve sets of values in one transaction. In other words I need to
1) Write 40 key,value pairs into the cache and ensure that no one else can read or write any of these 40 keys during this multi-key write operation; and,
2) Read 40 keys in one operation and get back 40 values knowing that all 40 values have been unchanged from the moment that this read operation started until it ended.
The only way I can think of doing this is to lock the entire table at the beginning of the fetch_keylist([ListOfKeys]) or at the beginning of the write_keylist([KeyValuePairs], but I don't want to do this because I have many processes simultaneously doing their own multi_key reads and writes and I don't want to lock the entire table any time any process needs to read/write a relatively small subset of records.
Help?
Trying to be more clear: I do not think this is just about using vanilla transactions
I think I am asking a more subtle question than this. Imagine that I have a process that, within a transaction, iterates through 10 records, locking them as it goes. Now imagine this process starts but before it iterates to the 3rd record ANOTHER process updates the 3rd record. This will be just fine as far as transactions go because the first process hadn't locked the 3rd record (yet) and the OTHER process modified it and released it before the first process got to it. What I want is to be guaranteed that once my first process starts that no other process can touch the 10 records until the first process is done with them.
PROBLEM SOLVED - I'M AN IDIOT... I guess...
Thank you all for your patients, especially Hynek -Pichi- Vychodil!
I prepared my test code to show the problem, and I could in fact reproduce the problem. I then simplified the code for readability and the problem went away. I was not able to again reproduce the problem. This is both embarrassing and mysterious to me since I had this problem for days. Also mnesia never complained that I was executing operations outside of a transaction and I have no dirty transactions anywhere in my code, I have no idea how I was able to introduce this bug into my code!
I have pounded the notion of Isolation into my head and will not doubt that it exists again.
Thanks for the education.
Actually, turns out the problem was using try/catch around mnesia operations within a transaction. See here for more.
Mnesia transaction will do exactly this thing for you. It is what is transaction for unless you do dirty operations. So just place your write and read operations to one transaction a mnesia will do rest. All operations in one transaction is done as one atomic operation. Mnesia transaction isolation level is what is sometimes known as "serializable" i.e. strongest isolation level.
Edit:
It seems you missed one important point about concurrent processes in Erlang. (To be fair it is not only true in Erlang but in any truly concurrent environment and when someone arguing else it is not really concurrent environment.) You can't distinguish which action happen first and which happen second unless you do some synchronization. Only way you can do this synchronization is using message passing. You have guaranteed only one thing about messages in Erlang, ordering of messages sent from one process to other process. It means when you send two messages M1 and M2 from process A to process B they arrives in same order. But if you send message M1 from A to B and message M2 from C to B they can arrive in any order. Simply because how you can even tell which message you sent first? It is even worse if you send message M1 from A to B and then M2 from A to C and when M2 arrives to C send M3 from C to B you don't have guarantied that M1 arrives to B before M3. Even it will happen in one VM in current implementation. But you can't rely on it because it is not guaranteed and can change even in next version of VM just due message passing implementation between different schedulers.
It illustrates problems of event ordering in concurrent processes. Now back to the mnesia transaction. Mnesia transaction have to be side effect free fun. It means there may not be any message sending outside from transaction. So you can't tell which transaction starts first and when starts. Only thing you can tell if transaction succeed and they order you can only determine by its effect. When you consider this your subtle clarification makes no sense. One transaction will read all keys in atomic operation even it is implemented as reading one key by one in transaction implementation and your write operation will be also performed as atomic operation. You can't tell if write to 4th key in second transaction was happen after you read 1st key in first transaction because there it is not observable from outside. Both transaction will be performed in particular order as separate atomic operation. From outside point of view all keys will be read in same point of time and it is work of mnesia to force it. If you send message from inside of transaction you violate mnesia transaction property and you can't be surprised it will behave strange. To be concrete, this message can be send many times.
Edit2:
If you spin-up even a few of these processes you will see that very
quickly a, b and c are in a state where their values do not sum to 10.
I'm curious why you think it would happen or you tested it? Show me your test case and I will show mine:
-module(transactions).
-export([start/2, sum/0, write/0]).
start(W, R) ->
mnesia:start(),
{atomic, ok} = mnesia:create_table(test, [{ram_copies,[node()]}]),
F = fun() ->
ok = mnesia:write({test, a, 10}),
[ ok = mnesia:write({test, X, 0}) || X <-
[b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]],
ok
end,
{atomic, ok} = mnesia:transaction(F),
F2 = fun() ->
S = self(),
erlang:send_after(1000, S, show),
[ spawn_link(fun() -> writer(S) end) || _ <- lists:seq(1,W) ],
[ spawn_link(fun() -> reader(S) end) || _ <- lists:seq(1,R) ],
collect(0,0)
end,
spawn(F2).
collect(R, W) ->
receive
read -> collect(R+1, W);
write -> collect(R, W+1);
show ->
erlang:send_after(1000, self(), show),
io:format("R: ~p, W: ~p~n", [R,W]),
collect(R, W)
end.
keys() ->
element(random:uniform(6),
{[a,b,c],[a,c,b],[b,a,c],[b,c,a],[c,a,b],[c,b,a]}).
sum() ->
F = fun() ->
lists:sum([X || K<-keys(), {test, _, X} <- mnesia:read(test, K)])
end,
{atomic, S} = mnesia:transaction(F),
S.
write() ->
F = fun() ->
[A, B ] = L = [ random:uniform(10) || _ <- [1,2] ],
[ok = mnesia:write({test, K, V}) || {K, V} <- lists:zip(keys(),
[10-A-B|L])],
ok
end,
{atomic, ok} = mnesia:transaction(F),
ok.
reader(P) ->
case sum() of
10 ->
P ! read,
reader(P);
_ ->
io:format("ERROR!!!~n",[]),
exit(error)
end.
writer(P) ->
ok = write(),
P ! write,
writer(P).
If it would not work it would be really serious problem. There are serious applications including payment systems which rely on it. If you have test case which shows it is broken, please report bug at erlang-bugs#erlang.org
Have you tried mnesia Events ? You can have the reader subscribe to mnesia's Table Events especially write events so as not to interrupt the process doing the writing. In this way, mnesia just keeps sending a copy of what has been written in real-time to the other process which checks what the values are at any one time. take a look at this:
subscriber()->
mnesia:subscribe({table,YOUR_TABLE_NAME,simple}),
%% OR mnesia:subscribe({table,YOUR_TABLE_NAME,detailed}),
wait_events().
wait_events()->
receive
%% For simple events
{mnesia_table_event,{write, NewRecord, ActivityId}} ->
%% Analyse the written record as you wish
wait_events();
%% For detailed events
{mnesia_table_event,{write, YOUR_TABLE, NewRecord, [OldRecords], ActivityId}} ->
%% Analyse the written record as you wish
wait_events();
_Any -> wait_events()
end.
Now you spawn your analyser as a process like this:
spawn(?MODULE,subscriber,[]).
This makes the whole process to run without any process being blocked, mnesia needs not lock any tabel or record because now what you have is a writer process and an analyser process. The whole thing will run in real-time. Remember that there are many other events that you can make use of if you wish by pattern matching them in the subscriber wait_events() receive body.
Its possible to build a heavy duty gen_server or complete application intended for reception and analysis of all your mnesia events. Its usually better to have one capable subscriber than many failing event subscribers. If i have understood you question well, this unblocking solution fits your requirements.
mnesia:read/3 with write locks seems to be suffient.
Mnesia's transaction is implemented by read-write lock and locks are well-formed (holding lock untill the end of transaction). So the isolation level is serializable.
The granularity of locks are per record as long as you access by primary key.
Are messages processed in a first-come-first-serve basis or are they sorted by timestamp or something like that?
Order of messages is preserved between a process and another one. Reading from the FAQ:
10.9 Is the order of message reception guaranteed?
Yes, but only within one process.
If there is a live process and you send it message A and then message
B, it's guaranteed that if message B arrived, message A arrived before
it.
On the other hand, imagine processes P, Q and R. P sends message A to
Q, and then message B to R. There is no guarantee that A arrives
before B. (Distributed Erlang would have a pretty tough time if this
was required!)
#knutin is right regarding how you can consume messages within a process. As an addition, note that you might use two subsequent receive statements to ensure that a certain message is consumed after another one:
receive
first ->
do_first()
end,
receive
second ->
do_second()
end
The receive statement is blocking. This will ensure that you never do_second() before you do_first(). The difference from #knutin's second solution is that, in that case, if something not important arrives just before an important one, you queue the important bit.
The mailbox is always kept in the order the messages arrived.
However, the order the messages are consumed is determined by your code.
If you have a plain process with a generic receive clause that receives anything, the order you get messages are the same as the order they arrived in.
loop() ->
receive
Any ->
do_something(Any),
loop()
end.
However, if you have a selective receive with match clauses, it will search the mailbox for messages of this specific type and consume the first matching message, effectively skipping non-matching messages. In the following example, if there are messages tagged as important in the queue, they will be processed before any other message. Note: Matching like this will search all messages in the queue, which is a problem for many messages. There has been some developments in this area, but I'm not up to speed.
loop() ->
receive
{important, Stuff} ->
do_something_important(Stuff),
loop();
Any ->
do_something(Any)
loop()
end.
to further define the answer, I would like to point the fact that, as stated above, messages that don't pattern match are skipped, but in reality they are simply put aside and then reintroduced in order (so first that any other message arrived after the not matching messages) for next receive pattern matching.
This problem really shows its worst when you have, for example, a gen_server behaviour module because in this case, having always the same pattern matching call scheme, messages not in scope are going to flood message queue unless you define a (ugly and error prone, IMHO) match-all pattern like:
receive
... -> ...;
... -> ...;
MatchAllPatterns -> ok.
end