Related
When using a list an pipes does it create mutliple intermediate lists? If so is this not very bad for garbage collection?
let mylist = [...]
let filterByPipes ls =
somefilter ls //is a list created here
|> otherFiler // and then again here
|> filtersForDays // and finally returned here
Yes, F# uses applicative evaluation order, which means that every value has to be fully evaluated before it can be passed as parameter to a function, which means that yes, at every step a new list is created.
As to whether it's bad for garbage collection... "First measure then optimize" - the first rule of optimization.
In most circumstances, the amount of data is so miniscule, it doesn't really make any difference, definitely not enough to make up for the massive gain in code readability.
But if your measurements do determine a bottleneck in this particular place, and it turns out to be significant for the end goal, - then yes, you'll have to manually fuse the processing chain. Or perhaps employ a more efficient data structure. It all would depend on what the optimization goal is.
The setup is similar to this.
One agent, (dataSource) is generating data, and a single agent (dataProcessor) is processing the data. There is a lot more data being generated than dataProcessor can process, and I am not interested in processing all messages, just processing the latest piece of data.
One possible solution, proposed there by Jon Harrop there "is to greedily eat all messages in the inbox when one arrives and discard all but the most recent".
Another approach is not to listen for all messages, but rather for dataProcessor to PostAndReply dataSource to get the latest piece of data.
What are the pros and cons of these approaches?
This is an intriguing question and there are quite likely several possible perspectives. I think the most notable aspect is that the choice will affect how you design the API at the interface between the two components:
In "Consume all" approach, the producer has a very simple API where it triggers some event whenever a value is produced and your consumer will subscribe to it. This means that you could have other subscribers listening to updates from the producer and doing something else than your consumer from this question.
In "Call to get latest" approach, the producer will presumably need to be written so that it keeps the current state and discards old values. It will then provide blocking async API to get the latest value. It could still expose an event for other consumers though. The consumer will need to actively poll for changes (in a busy loop of some sorts).
You could also have a producer with an event as in "Consume all", but then create another component that listens to any given event, keeps the latest value and makes it available via a blocking async call to any other client.
Here some advantages/disadvantages I can think of:
In (1) the producer is very simple; the consumer is harder to write
In (2) the producer needs to do a bit more work, but the consumer is simple
In (3), you are adding another layer, but in a fairly reusable way.
I would probably go with either (2) (if I only need this for one data source) or with (3) after checking that it does not affect the performance.
As for (3), the sketch of what I was thinking would look something like this:
type KeepLastMessage<'T> =
| Update of 'T
| Get of AsyncReplyChannel<'T>
type KeepLast<'T>(initial:'T, event:IObservable<'T>) =
let agent = MailboxProcessor.Start(fun inbox ->
let rec loop last = async {
let! msg = inbox.Receive()
match msg with
| Update last -> return! loop last
| Get ch -> ch.Reply(last); return! loop last }
loop initial)
member x.AsyncGet() = agent.PostAndAsyncReply(Get)
I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);
I have a general objective-c pattern/practice question relative to a problem I'm trying to solve with my app. I could not find a similar objective-c focused question/answer here, yet.
My app holds a mutable array of objects which I call "Records". The app gathers records and puts them into the that array in one of two ways:
It reads data from a SQLite database available locally within the App's sand box. The read is usually very fast.
It requests data asynchronously from a web service, waits for it to finish then parses the data. The read can be fast, but often it is not.
Sometimes the app reads from the database (1) and requests data from the web service (2) at essentially the same time. It is often the case that (1) will finish before (2) finishes and adding Records to the mutable array does not cause a conflict.
I am worried that at some point my SQLite read process will take a bit longer than expected and it will try to add objects to the mutable array at the exact same time the async request finishes and does the same; or vice-versa. These are edge cases that seem difficult to test for but that surely would make my app crash or at the very least cause issues with my array of records.
I should also point out that the Records are to be merged into the mutable array. For example: if (1) runs first and returns 10 records, then shortly after (2) finishes and returns 5 records, my mutable array will contain all 15 records. I'm combining the data rather than overwriting it.
What I want to know is:
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
I appreciate any info you could share.
[EDIT #1]
For posterity, I found this URL to be a great help in understanding how to use NSOperations and an NSOperationQueue. It is a bit out of date, but works, none the less:
http://www.raywenderlich.com/19788/how-to-use-nsoperations-and-nsoperationqueues
Also, It doesn't talk specifically about the problem I'm trying to solve, but the example it uses is practical and easy to understand.
[EDIT #2]
I've decided to go with the approach suggested by danh, where I'll read locally and as needed hit my web service after the local read finished (which should be fast anyway). Taht said, I'm going to try and avoid synchronization issues altogether. Why? Because Apple says so, here:
http://developer.apple.com/library/IOS/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html#//apple_ref/doc/uid/10000057i-CH8-SW8
Avoid Synchronization Altogether
For any new projects you work on, and even for existing projects, designing your code and data structures to avoid the need for synchronization is the best possible solution. Although locks and other synchronization tools are useful, they do impact the performance of any application. And if the overall design causes high contention among specific resources, your threads could be waiting even longer.
The best way to implement concurrency is to reduce the interactions and inter-dependencies between your concurrent tasks. If each task operates on its own private data set, it does not need to protect that data using locks. Even in situations where two tasks do share a common data set, you can look at ways of partitioning that set or providing each task with its own copy. Of course, copying data sets has its costs too, so you have to weigh those costs against the costs of synchronization before making your decision.
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Absolutely not. NSArray, along with the rest of the collection classes, are not synchronized. You can use them in conjunction with some kind of lock when you add and remove objects, but that's definitely a lot slower than just making two arrays (one for each operation), and merging them when they both finish.
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Unfortunately, no. The most you can come up with is tripping a Boolean, or incrementing an integer to a certain number in a common callback. To see what I mean, here's a little pseudo-code:
- (void)someAsyncOpDidFinish:(NSSomeOperation*)op {
finshedOperations++;
if (finshedOperations == 2) {
finshedOperations = 0;
//Both are finished, process
}
}
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
Yes, see above.
You should either lock around your array modifications, or schedule your modifications in the main thread. The SQL fetch is probably running in the main thread, so in your remote fetch code you could do something like:
dispatch_async(dispatch_get_main_queue(), ^{
[myArray addObject: newThing];
}];
If you are adding a bunch of objects this will be slow since it is putting a new task on the scheduler for each record. You can bunch the records in a separate array in the thread and add the temp array using addObjectsFromArray: if that is the case.
Personally, I'd be inclined to have a concurrent NSOperationQueue and add the two retrieval operations operations, one for the database operation, one for the network operation. I would then have a dedicated serial queue for adding the records to the NSMutableArray, which each of the two concurrent retrieval operations would use to add records to the mutable array. That way you have one queue for adding records, but being fed from the two retrieval operations running on the other, concurrent queue. If you need to know when the two concurrent retrieval operations are done, I'd add a third operation to that concurrent queue, set its dependencies to be the two retrieval operations, which would fire automatically when the two retrieval operations are done.
In addition to the good suggestions above, consider not launching the GET and the sql concurrently.
[self doTheLocalLookupThen:^{
// update the array and ui
[self doTheServerGetThen:^{
// update the array and ui
}];
}];
- (void)doTheLocalLookupThen:(void (^)(void))completion {
if ([self skipTheLocalLookup]) return completion();
// do the local lookup, invoke completion
}
- (void)doTheServerGetThen:(void (^)(void))completion {
if ([self skipTheServerGet]) return completion();
// do the server get, invoke completion
}
HOW I WISH I HAD PHRASED MY QUESTION TO BEGIN WITH
Take a table with 26 keys, a-z and let them have integer values.
Create a process, Ouch, that does two things over and over again
In one transaction, write random values for a, b, and c such that those values always sum to 10
In another transaction, read the values for a, b and c and complain if their values do not sum to 10
If you spin-up even a few of these processes you will see that very quickly a, b and c are in a state where their values do not sum to 10. I believe there is no way to ask mnesia to "lock these 3 records before starting the writes (or reads)", one can only have mnesia lock the records as it gets to them (so to speak) which allows for the set of records' values to violate my "must sum to 10" constraint.
If I am right, solutions to this problem include
lock the entire table before writing (or reading) the set of 3 records -- I hate to lock whole table for 3 recs,
Create a process that keeps track of who is reading or writing which keys and protects bulk operations from anyone else writing or reading until the operation is completed. Of course I would have to make sure that all processes made use of this... crap, I guess this means writing my own AccessMod as the fourth parameter to activity/4 which seems like a non-trivial exercise
Some other thing that I am not smart enough to figure out.
thoughts?
Ok, I'm an ambitious Erlang newbee, so sorry if this is a dumb question, but
I am building an application-specific, in-memory distributed cache and I need to be able to write sets of Key, Value pairs in one transaction and also retrieve sets of values in one transaction. In other words I need to
1) Write 40 key,value pairs into the cache and ensure that no one else can read or write any of these 40 keys during this multi-key write operation; and,
2) Read 40 keys in one operation and get back 40 values knowing that all 40 values have been unchanged from the moment that this read operation started until it ended.
The only way I can think of doing this is to lock the entire table at the beginning of the fetch_keylist([ListOfKeys]) or at the beginning of the write_keylist([KeyValuePairs], but I don't want to do this because I have many processes simultaneously doing their own multi_key reads and writes and I don't want to lock the entire table any time any process needs to read/write a relatively small subset of records.
Help?
Trying to be more clear: I do not think this is just about using vanilla transactions
I think I am asking a more subtle question than this. Imagine that I have a process that, within a transaction, iterates through 10 records, locking them as it goes. Now imagine this process starts but before it iterates to the 3rd record ANOTHER process updates the 3rd record. This will be just fine as far as transactions go because the first process hadn't locked the 3rd record (yet) and the OTHER process modified it and released it before the first process got to it. What I want is to be guaranteed that once my first process starts that no other process can touch the 10 records until the first process is done with them.
PROBLEM SOLVED - I'M AN IDIOT... I guess...
Thank you all for your patients, especially Hynek -Pichi- Vychodil!
I prepared my test code to show the problem, and I could in fact reproduce the problem. I then simplified the code for readability and the problem went away. I was not able to again reproduce the problem. This is both embarrassing and mysterious to me since I had this problem for days. Also mnesia never complained that I was executing operations outside of a transaction and I have no dirty transactions anywhere in my code, I have no idea how I was able to introduce this bug into my code!
I have pounded the notion of Isolation into my head and will not doubt that it exists again.
Thanks for the education.
Actually, turns out the problem was using try/catch around mnesia operations within a transaction. See here for more.
Mnesia transaction will do exactly this thing for you. It is what is transaction for unless you do dirty operations. So just place your write and read operations to one transaction a mnesia will do rest. All operations in one transaction is done as one atomic operation. Mnesia transaction isolation level is what is sometimes known as "serializable" i.e. strongest isolation level.
Edit:
It seems you missed one important point about concurrent processes in Erlang. (To be fair it is not only true in Erlang but in any truly concurrent environment and when someone arguing else it is not really concurrent environment.) You can't distinguish which action happen first and which happen second unless you do some synchronization. Only way you can do this synchronization is using message passing. You have guaranteed only one thing about messages in Erlang, ordering of messages sent from one process to other process. It means when you send two messages M1 and M2 from process A to process B they arrives in same order. But if you send message M1 from A to B and message M2 from C to B they can arrive in any order. Simply because how you can even tell which message you sent first? It is even worse if you send message M1 from A to B and then M2 from A to C and when M2 arrives to C send M3 from C to B you don't have guarantied that M1 arrives to B before M3. Even it will happen in one VM in current implementation. But you can't rely on it because it is not guaranteed and can change even in next version of VM just due message passing implementation between different schedulers.
It illustrates problems of event ordering in concurrent processes. Now back to the mnesia transaction. Mnesia transaction have to be side effect free fun. It means there may not be any message sending outside from transaction. So you can't tell which transaction starts first and when starts. Only thing you can tell if transaction succeed and they order you can only determine by its effect. When you consider this your subtle clarification makes no sense. One transaction will read all keys in atomic operation even it is implemented as reading one key by one in transaction implementation and your write operation will be also performed as atomic operation. You can't tell if write to 4th key in second transaction was happen after you read 1st key in first transaction because there it is not observable from outside. Both transaction will be performed in particular order as separate atomic operation. From outside point of view all keys will be read in same point of time and it is work of mnesia to force it. If you send message from inside of transaction you violate mnesia transaction property and you can't be surprised it will behave strange. To be concrete, this message can be send many times.
Edit2:
If you spin-up even a few of these processes you will see that very
quickly a, b and c are in a state where their values do not sum to 10.
I'm curious why you think it would happen or you tested it? Show me your test case and I will show mine:
-module(transactions).
-export([start/2, sum/0, write/0]).
start(W, R) ->
mnesia:start(),
{atomic, ok} = mnesia:create_table(test, [{ram_copies,[node()]}]),
F = fun() ->
ok = mnesia:write({test, a, 10}),
[ ok = mnesia:write({test, X, 0}) || X <-
[b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]],
ok
end,
{atomic, ok} = mnesia:transaction(F),
F2 = fun() ->
S = self(),
erlang:send_after(1000, S, show),
[ spawn_link(fun() -> writer(S) end) || _ <- lists:seq(1,W) ],
[ spawn_link(fun() -> reader(S) end) || _ <- lists:seq(1,R) ],
collect(0,0)
end,
spawn(F2).
collect(R, W) ->
receive
read -> collect(R+1, W);
write -> collect(R, W+1);
show ->
erlang:send_after(1000, self(), show),
io:format("R: ~p, W: ~p~n", [R,W]),
collect(R, W)
end.
keys() ->
element(random:uniform(6),
{[a,b,c],[a,c,b],[b,a,c],[b,c,a],[c,a,b],[c,b,a]}).
sum() ->
F = fun() ->
lists:sum([X || K<-keys(), {test, _, X} <- mnesia:read(test, K)])
end,
{atomic, S} = mnesia:transaction(F),
S.
write() ->
F = fun() ->
[A, B ] = L = [ random:uniform(10) || _ <- [1,2] ],
[ok = mnesia:write({test, K, V}) || {K, V} <- lists:zip(keys(),
[10-A-B|L])],
ok
end,
{atomic, ok} = mnesia:transaction(F),
ok.
reader(P) ->
case sum() of
10 ->
P ! read,
reader(P);
_ ->
io:format("ERROR!!!~n",[]),
exit(error)
end.
writer(P) ->
ok = write(),
P ! write,
writer(P).
If it would not work it would be really serious problem. There are serious applications including payment systems which rely on it. If you have test case which shows it is broken, please report bug at erlang-bugs#erlang.org
Have you tried mnesia Events ? You can have the reader subscribe to mnesia's Table Events especially write events so as not to interrupt the process doing the writing. In this way, mnesia just keeps sending a copy of what has been written in real-time to the other process which checks what the values are at any one time. take a look at this:
subscriber()->
mnesia:subscribe({table,YOUR_TABLE_NAME,simple}),
%% OR mnesia:subscribe({table,YOUR_TABLE_NAME,detailed}),
wait_events().
wait_events()->
receive
%% For simple events
{mnesia_table_event,{write, NewRecord, ActivityId}} ->
%% Analyse the written record as you wish
wait_events();
%% For detailed events
{mnesia_table_event,{write, YOUR_TABLE, NewRecord, [OldRecords], ActivityId}} ->
%% Analyse the written record as you wish
wait_events();
_Any -> wait_events()
end.
Now you spawn your analyser as a process like this:
spawn(?MODULE,subscriber,[]).
This makes the whole process to run without any process being blocked, mnesia needs not lock any tabel or record because now what you have is a writer process and an analyser process. The whole thing will run in real-time. Remember that there are many other events that you can make use of if you wish by pattern matching them in the subscriber wait_events() receive body.
Its possible to build a heavy duty gen_server or complete application intended for reception and analysis of all your mnesia events. Its usually better to have one capable subscriber than many failing event subscribers. If i have understood you question well, this unblocking solution fits your requirements.
mnesia:read/3 with write locks seems to be suffient.
Mnesia's transaction is implemented by read-write lock and locks are well-formed (holding lock untill the end of transaction). So the isolation level is serializable.
The granularity of locks are per record as long as you access by primary key.