Is multiple-producer, single-consumer possible in a lockfree setting? - message

I have a bunch of threads that are doing lots of communication with each other.
I would prefer this be lock free.
For each thread, I want to have a mailbox, where other threads can send it messages, (but only the owner can remove messages). This is a multiple-producer single-consumer situation. is it possible for me to do this in a lockfree / high performance matter? (This is in the inner loop of a gigantic simulation.)

Lock-free Multiple Producer Single Consumer (MPSC) Queue is one of the easiest lock-free algorithms to implement.
The most basic implementation requires a simple lock-free singly-linked list (SList) with only push() and flush(). The functions are available in the Windows API as InterlockedFlushSList() and InterlockedPushEntrySList() but these are very easy to roll on your own.
Multiple Producer push() items onto the SList using a CAS (interlocked compare-and-swap).
The Single Consumer does a flush() which swaps the head of the SList with a NULL using an XCHG (interlocked exchange). The Consumer then has a list of items in the reverse-order.
To process the items in order, you must simply reverse the list returned from flush() before processing it. If you do not care about order, you can simply walk the list immediately to process it.
Two notes if you roll your own functions:
1) If you are on a system with weak memory ordering (i.e. PowerPC), you need to put a "release memory barrier" at the beginning of the push() function and an "aquire memory barrier" at the end of the flush() function.
2) You can make the functions considerably simplified and optimized because the ABA-issue with SLists occur during the pop() function. You can not have ABA-issues with a SList if you use only push() and flush(). This means you can implement it as a single pointer very similar to the non-lockfree code and there is no need for an ABA-prevention sequence counter.

Sure, if you have an atomic CompareAndSwap instruction:
for (i = 0; ; i = (i + 1) % MAILBOX_SIZE)
{
if ((mailbox[i].owned == false) &&
(CompareAndSwap(&mailbox[i].owned, true, false) == false))
break;
}
mailbox[i].message = message;
mailbox[i].ready = true;
After reading a message, the consuming thread just sets mailbox[i].ready = false; mailbox[i].owned = false; (in that order).

Here's a paper from the University of Rochester illustrating a non-blocking concurrent queue. The algorithm described in the paper shows one technique for making a lockless queue.

may want to look at Intel thread building blocks, I recall being to lecture by Intel developer that mentioned something along those lines.

Related

How to monitor Flux.onBackpressureBuffer() queue size

In my reactive application I have hot Publisher with slow Subscriber. To handle lack of demand I am using onBackpressureBuffer but possible overflow errors are kinda scary.
How can I monitor number of elements present in the queue created by Flux.onBackpressureBuffer(maxSize)? Preferably with built-in reactor metrics() method. I am using Spring Boot + Micrometer if it makes any difference.
Although we didn't we find an easy way to this in Reactor, but we found a bit "hacky" one. Here it is: https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/utils/ReactorUtils.kt#L34
This function measures buffer size of various Flux operators. It is not guaranteed to work on every operator, but it was tested on onBackpressureBuffer with positive results.
It is written in Kotlin, but it should be very easy to port it to Java.
The essence of this code in case of onBackpressureBuffer is to cast Subscription to Scannable, and then use BUFFERED attribute:
flux
.onBackressureBuffer(maxSize)
.doOnSubscribe { subscription ->
// ...
val queueSize = Scannable.from(subscription).scan(Scannable.Attr.BUFFERED)
// ...
}

Splitting a futures::Stream into multiple streams based on a property of the stream item

I have a Stream of items (u32, Bytes) where the integer is an index in the range 0..n I would like to split this stream into n streams, basically filtering by the integer.
I considered several possibilities, including
creating n streams each of which peeks at the underlying stream to determine if the next item is for it
pushing the items to one of n sinks when they arrive, and then use the other side of the sink as a stream again. (This seems to be related to
Forwarding from a futures::Stream to a futures::Sink.).
I feel that neither of these possibilities is convincing. The first one seems to create unnecessary overhead and the second one is just not elegant (if it even works, I am not sure).
What's a good way of splitting the stream?
At one point I had a similar requirement and wrote a group_by operator for Stream.
I haven't yet published this to crates.io as I didn't really feel it was ready for consumption but feel free to take a look at the code at https://github.com/Lukazoid/lz_stream_tools or attempt to use it for yourself.
Add the following to your cargo.toml:
[dependencies]
lz_stream_tools = { git = "https://github.com/Lukazoid/lz_stream_tools" }
And extern crate lz_stream_tools; to your bin.rs/lib.rs.
Then from your code you may use it like so:
use lz_stream_tools::StreamTools;
let groups = some_stream.group_by(|x| x.0);
groups will now be a Stream of (u32, Stream<Item=Bytes)).
You could use channels to represent the index-specific streams. You'd have to spawn one Task that pulls from the original stream and has a map of Senders.

Mnesia: How to lock multiple rows simultaneously so that I can write/read a "consistent" set of of records

HOW I WISH I HAD PHRASED MY QUESTION TO BEGIN WITH
Take a table with 26 keys, a-z and let them have integer values.
Create a process, Ouch, that does two things over and over again
In one transaction, write random values for a, b, and c such that those values always sum to 10
In another transaction, read the values for a, b and c and complain if their values do not sum to 10
If you spin-up even a few of these processes you will see that very quickly a, b and c are in a state where their values do not sum to 10. I believe there is no way to ask mnesia to "lock these 3 records before starting the writes (or reads)", one can only have mnesia lock the records as it gets to them (so to speak) which allows for the set of records' values to violate my "must sum to 10" constraint.
If I am right, solutions to this problem include
lock the entire table before writing (or reading) the set of 3 records -- I hate to lock whole table for 3 recs,
Create a process that keeps track of who is reading or writing which keys and protects bulk operations from anyone else writing or reading until the operation is completed. Of course I would have to make sure that all processes made use of this... crap, I guess this means writing my own AccessMod as the fourth parameter to activity/4 which seems like a non-trivial exercise
Some other thing that I am not smart enough to figure out.
thoughts?
Ok, I'm an ambitious Erlang newbee, so sorry if this is a dumb question, but
I am building an application-specific, in-memory distributed cache and I need to be able to write sets of Key, Value pairs in one transaction and also retrieve sets of values in one transaction. In other words I need to
1) Write 40 key,value pairs into the cache and ensure that no one else can read or write any of these 40 keys during this multi-key write operation; and,
2) Read 40 keys in one operation and get back 40 values knowing that all 40 values have been unchanged from the moment that this read operation started until it ended.
The only way I can think of doing this is to lock the entire table at the beginning of the fetch_keylist([ListOfKeys]) or at the beginning of the write_keylist([KeyValuePairs], but I don't want to do this because I have many processes simultaneously doing their own multi_key reads and writes and I don't want to lock the entire table any time any process needs to read/write a relatively small subset of records.
Help?
Trying to be more clear: I do not think this is just about using vanilla transactions
I think I am asking a more subtle question than this. Imagine that I have a process that, within a transaction, iterates through 10 records, locking them as it goes. Now imagine this process starts but before it iterates to the 3rd record ANOTHER process updates the 3rd record. This will be just fine as far as transactions go because the first process hadn't locked the 3rd record (yet) and the OTHER process modified it and released it before the first process got to it. What I want is to be guaranteed that once my first process starts that no other process can touch the 10 records until the first process is done with them.
PROBLEM SOLVED - I'M AN IDIOT... I guess...
Thank you all for your patients, especially Hynek -Pichi- Vychodil!
I prepared my test code to show the problem, and I could in fact reproduce the problem. I then simplified the code for readability and the problem went away. I was not able to again reproduce the problem. This is both embarrassing and mysterious to me since I had this problem for days. Also mnesia never complained that I was executing operations outside of a transaction and I have no dirty transactions anywhere in my code, I have no idea how I was able to introduce this bug into my code!
I have pounded the notion of Isolation into my head and will not doubt that it exists again.
Thanks for the education.
Actually, turns out the problem was using try/catch around mnesia operations within a transaction. See here for more.
Mnesia transaction will do exactly this thing for you. It is what is transaction for unless you do dirty operations. So just place your write and read operations to one transaction a mnesia will do rest. All operations in one transaction is done as one atomic operation. Mnesia transaction isolation level is what is sometimes known as "serializable" i.e. strongest isolation level.
Edit:
It seems you missed one important point about concurrent processes in Erlang. (To be fair it is not only true in Erlang but in any truly concurrent environment and when someone arguing else it is not really concurrent environment.) You can't distinguish which action happen first and which happen second unless you do some synchronization. Only way you can do this synchronization is using message passing. You have guaranteed only one thing about messages in Erlang, ordering of messages sent from one process to other process. It means when you send two messages M1 and M2 from process A to process B they arrives in same order. But if you send message M1 from A to B and message M2 from C to B they can arrive in any order. Simply because how you can even tell which message you sent first? It is even worse if you send message M1 from A to B and then M2 from A to C and when M2 arrives to C send M3 from C to B you don't have guarantied that M1 arrives to B before M3. Even it will happen in one VM in current implementation. But you can't rely on it because it is not guaranteed and can change even in next version of VM just due message passing implementation between different schedulers.
It illustrates problems of event ordering in concurrent processes. Now back to the mnesia transaction. Mnesia transaction have to be side effect free fun. It means there may not be any message sending outside from transaction. So you can't tell which transaction starts first and when starts. Only thing you can tell if transaction succeed and they order you can only determine by its effect. When you consider this your subtle clarification makes no sense. One transaction will read all keys in atomic operation even it is implemented as reading one key by one in transaction implementation and your write operation will be also performed as atomic operation. You can't tell if write to 4th key in second transaction was happen after you read 1st key in first transaction because there it is not observable from outside. Both transaction will be performed in particular order as separate atomic operation. From outside point of view all keys will be read in same point of time and it is work of mnesia to force it. If you send message from inside of transaction you violate mnesia transaction property and you can't be surprised it will behave strange. To be concrete, this message can be send many times.
Edit2:
If you spin-up even a few of these processes you will see that very
quickly a, b and c are in a state where their values do not sum to 10.
I'm curious why you think it would happen or you tested it? Show me your test case and I will show mine:
-module(transactions).
-export([start/2, sum/0, write/0]).
start(W, R) ->
mnesia:start(),
{atomic, ok} = mnesia:create_table(test, [{ram_copies,[node()]}]),
F = fun() ->
ok = mnesia:write({test, a, 10}),
[ ok = mnesia:write({test, X, 0}) || X <-
[b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]],
ok
end,
{atomic, ok} = mnesia:transaction(F),
F2 = fun() ->
S = self(),
erlang:send_after(1000, S, show),
[ spawn_link(fun() -> writer(S) end) || _ <- lists:seq(1,W) ],
[ spawn_link(fun() -> reader(S) end) || _ <- lists:seq(1,R) ],
collect(0,0)
end,
spawn(F2).
collect(R, W) ->
receive
read -> collect(R+1, W);
write -> collect(R, W+1);
show ->
erlang:send_after(1000, self(), show),
io:format("R: ~p, W: ~p~n", [R,W]),
collect(R, W)
end.
keys() ->
element(random:uniform(6),
{[a,b,c],[a,c,b],[b,a,c],[b,c,a],[c,a,b],[c,b,a]}).
sum() ->
F = fun() ->
lists:sum([X || K<-keys(), {test, _, X} <- mnesia:read(test, K)])
end,
{atomic, S} = mnesia:transaction(F),
S.
write() ->
F = fun() ->
[A, B ] = L = [ random:uniform(10) || _ <- [1,2] ],
[ok = mnesia:write({test, K, V}) || {K, V} <- lists:zip(keys(),
[10-A-B|L])],
ok
end,
{atomic, ok} = mnesia:transaction(F),
ok.
reader(P) ->
case sum() of
10 ->
P ! read,
reader(P);
_ ->
io:format("ERROR!!!~n",[]),
exit(error)
end.
writer(P) ->
ok = write(),
P ! write,
writer(P).
If it would not work it would be really serious problem. There are serious applications including payment systems which rely on it. If you have test case which shows it is broken, please report bug at erlang-bugs#erlang.org
Have you tried mnesia Events ? You can have the reader subscribe to mnesia's Table Events especially write events so as not to interrupt the process doing the writing. In this way, mnesia just keeps sending a copy of what has been written in real-time to the other process which checks what the values are at any one time. take a look at this:
subscriber()->
mnesia:subscribe({table,YOUR_TABLE_NAME,simple}),
%% OR mnesia:subscribe({table,YOUR_TABLE_NAME,detailed}),
wait_events().
wait_events()->
receive
%% For simple events
{mnesia_table_event,{write, NewRecord, ActivityId}} ->
%% Analyse the written record as you wish
wait_events();
%% For detailed events
{mnesia_table_event,{write, YOUR_TABLE, NewRecord, [OldRecords], ActivityId}} ->
%% Analyse the written record as you wish
wait_events();
_Any -> wait_events()
end.
Now you spawn your analyser as a process like this:
spawn(?MODULE,subscriber,[]).
This makes the whole process to run without any process being blocked, mnesia needs not lock any tabel or record because now what you have is a writer process and an analyser process. The whole thing will run in real-time. Remember that there are many other events that you can make use of if you wish by pattern matching them in the subscriber wait_events() receive body.
Its possible to build a heavy duty gen_server or complete application intended for reception and analysis of all your mnesia events. Its usually better to have one capable subscriber than many failing event subscribers. If i have understood you question well, this unblocking solution fits your requirements.
mnesia:read/3 with write locks seems to be suffient.
Mnesia's transaction is implemented by read-write lock and locks are well-formed (holding lock untill the end of transaction). So the isolation level is serializable.
The granularity of locks are per record as long as you access by primary key.

What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().
From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.

MailboxProcessor usage guidelines?

I've just discovered the MailboxProcessor in F# and it's usage as a "state machine" ... but I can't find much on the recommended usage of them.
For example... say I'm making a simple game with 100 on-screen enemies should I use a MailboxProcessor to change enemy position and health; giving me 200 active MailboxProcessor?
Is there any clever thread management going on under the hood? should I try and limit the amount of active MailboxProcessor I have or can I keep banging them out willy-nilly?
Thanks in advance,
JD.
A MailboxProcessor for enemy simulation might look like this:
MailboxProcessor.Start(fun inbox ->
async {
while true do
let! message = inbox.Receive()
processMessage(message)
})
It does not consume a thread while it waits for a message to arrive (let! message = line). However, once message arrives it will consume a thread (on a thread pool). If you have a 100 mailbox processors that all receive a message simultaneously, they will all attempt to wake up and consume a thread. Since here message processing is CPU bound, 100s of mailbox processors will all wake up and start spawning (thread pool) threads. This is not a great performance.
One situation mailbox processors excel in is the situation where there is a lot of concurrent clients all sending messages to one processor (imagine several parallel web crawlers all downloading pages and sinking results to a queue). On-screen enemies case appears different - it is many entities responding to a single source of messages (player movement/time ticks).
Another example where thousands of MailboxProcessors is a great solution is I/O bound MailboxProcessor:
MailboxProcessor.Start(fun inbox ->
async {
while true do
let! message = inbox.Receive()
match message with
| ->
do! AsyncWrite("something")
let! response = AsyncResponse()
...
})
Here after receiving a message the agent very quickly yields a thread but still needs to maintain state across asynchronous operations. This will scale very very well in practice - you can run thousands and thousands of such agents: this is a great way to write a web server.
As per
http://blogs.msdn.com/b/dsyme/archive/2010/02/15/async-and-parallel-design-patterns-in-f-part-3-agents.aspx
you can bang them out willy-nilly. Try it! They use the ThreadPool. I have not tried this for a real-time GUI game app, but I would not be surprised if this is 'good enough'.
say I'm making a simple game with 100 on-screen enemies should I use a MailboxProcessor to change enemy position and health; giving me 200 active MailboxProcessor?
I don't see any reason to try to use MailboxProcessor for that. A serial loop is likely to be much simpler and faster.
Is there any clever thread management going on under the hood?
Yes, lots. But is it designed for asynchronous concurrent programming (particularly scalable IO) and your program isn't really doing that.
should I try and limit the amount of active MailboxProcessor I have or can I keep banging them out willy-nilly?
You can bang them out willy-nilly but they are far from optimized and performance is much worse than serial code.
Maybe this or this can help?

Resources