How to handle split-brain? - orleans

I have read in Orleans FAQ when split-brain could happen but I don't understand what bad can happen and how to handle it properly.
FAQ says something vague like:
You just need to consider the rare possibility of having two instances of an actor while writing your application.
But how actually should I consider this and what can happen if I won't?
Orleans Paper (http://research.microsoft.com/pubs/210931/Orleans-MSR-TR-2014-41.pdf) says this:
application can rely on external persistent
storage to provide stronger data consistency
But I don't understand what this means.
Suppose split brain happened. Now I have two instances of one grain. When I'll send a few messages they could be received by these two (or there can be even more?) different instances. Suppose each instance prior to receiving these messages had same state. Now, after processing these messages they have different states.
How they should persist their states? There could be a conflict.
When another instances will be destroyed and only one will remain what will happen to the states of destroyed instances? It'll be like messages processed by them has never been processed? Then client state and server state could be desyncronized IIUC.
I see this (split-brain) as a big problem and I don't understand why there is so little attention to it.

Orleans leverages the consistency guarantees of the storage provider. When you call this.WriteStateAsync() from a grain, the storage provider ensures that the grain has seen all previous writes. If it has not, an exception is thrown. You can catch that exception and call DeactivateOnIdle() and rethrow the exception or call ReadStateAsync() and retry. So if you have 2 grains during a split-brain scenario, which ever one calls WriteStateAsync() first prevents the other one from writing state without first having read the most up-to-date state.
Update: Starting in Orleans v1.5.0, a grain which allows an InconsistentStateException to be thrown back to the caller will automatically be deactivated when the currently executing calls complete. A grain can catch and handle the exception to avoid automatic deactivation.

Related

Is there any latency in SQS while creating it using AWS API and sending messages immediately after creating it

I want to create SQS using code whenever it is required to send messages and delete it after all messages are consumed.
I just wanted to know if there is some delay required between creating an SQS using Java code and then sending messages to it.
Thanks.
Virendra Agarwal
You'll have to try it and make observations. SQS is a dostributed system, so there is a possibility that a queue might not immediately be usable, though I did not find a direct documentation reference for this.
Note the following:
If you delete a queue, you must wait at least 60 seconds before creating a queue with the same name.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_CreateQueue.html
This means your names will always need to be different, but it also implies something about the internals of SQS -- deleting a queue is not an instantaneous process. The same might be true of creation, though that is not necessarily the case.
Also, there is no way to know with absolute certainty that a queue is truly empty. A long poll that returns no messages is a strong indication that there are no messages remaining, as long as there are also no messages in flight (consumed but not deleted -- these will return to visibility if the consumer resets their visibility or improperly handles an exception and does not explicitly reset their visibility before the visibility timeout expires).
However, GetQueueAttributes does not provide a fail-safe way of assuring a queue is truly empty, because many of the counter attributes are the approximate number of messages (visible, in-flight, etc.). Again, this is related to the distributed architecture of SQS. Certain rare, internal failures could potentially cause messages to be stranded internally, only to appear later. The significance of this depends on the importance of the messages and the life cycle of the queue, and the risks of any such an issue seem -- to me -- increased when a queue does not have an indefinite lifetime (i.e. when the plan for a queue is to delete it when it is "empty"). This is not to imply that SQS is unreliable, only to make the point that any and all systems do eventually behave unexpectedly, however rare or unlikely.

How should I use a storage provider with Orleans

I'm a newbie in Orleans. I'd like to know how I can use the grain storage feature in Orleans. Should I use it like a message queue? Does it store my state temporary
and keep the data available even it throw exceptions or the
server crashed.
Thanks!
Grains that extend the Grain<T> class and are annotated with a [StorageProvider] attribute, will write their current state to the specified provider when you call base.WriteStateAsync().
If the grain is deactivated for any reason (including server crashed) then upon reactivation the grain will be initialized with the state that was last saved down.
I like to think of it as a cache, rather than a queue. Hope that helps, and, like the previous poster said, read the documentation, it's useful.
I had written a couple of articles to guide you step-by-step into getting used to the Storage Provider API and setting up your persistence store:
Introduction to Grain Persistence with Microsoft Orleans
Orleans Grain Persistence with the ADO .NET Storage Provider
Basically, Orleans gives you a very simple API (image taken from the first article above):
Your grain will inherit from Grain<T>, where T is your own class containing the state that you want to persist. The State property from Grain<T> lets you access it and read/modify state. The remaining async methods let you save changes to the persistence store, read them back, or clear the state. You typically don't need to read the state; it is done automatically when the grain gets activated.
There are no message queues involved. When you call one of these three methods, they will use the underlying storage provider to talk to whatever database you are using. This may fail due to store-specific errors (e.g. deadlocks), or due to an InconsistentStateException that is the result of a failed optimistic concurrency control check.
Whatever storage provided you decide to use (e.g. SQL Server, Azure Table Storage, in-memory, etc) must be configured via either XML config or code, and given a name. This name is then used in a [StorageProvider] attribute that goes over the grain class; in this way, the grain knows what storage provider to use when doing its persistence work (you could have various in your system).
The details of how all this is done are a bit lengthy to include here (which is why I wrote articles on the subject). You can find more information about this either in my articles linked above, or the Grain Persistence documentation.

How to implement status in Erlang?

I am thinking an Erlang program that has many workers (loop receive), these workers almost always manipulate their status at the same time, ie. massive concurrent, the amount of workers is so big that keep their status in mnesia will cause performance problem, so I am thinking pass the status as args in each loop, then write to mnesia some time later. Is this a good practice? Is there a better way to do this? (roughly speaking, I'm looking for something like an instance with attributes in the object oriented language)
Thanks.
With Erlang, it is a good habit to see the processes as actor with a dedicated and limited role. With this in mind you will see that you will split your problem in different categories like:
Maintain the state of a connection with a user over the Internet,
Keep information such as login, user profile, friends, shop-cart...
log events
...
for each role you will have to decide if the state information must survive to the process.
In a lot of cases it is not necessary (case 1) and the solution is simply to keep the state in the argument of loop funtion of the process. I encourage you to look at the OTP behaviors, the gen_server and gen_fsm are made for this.
The case 2 obviously manipulates permanent data which must survive to a process crash or even a hardware crash. These data will be stored using dets, mnesia or any database adapted to your problem (Redis, CouchDB ...).
It is important to limit the information stored into external database, otherwise you will not benefit of this very powerful feature which is the lack of side effect. In other words, it is a very bad idea to have process behavior which depends on external information.

handle saving of transient gen_servers states when using a key-to-pid mechanism

I would like to know how to handle saving of transient gen_servers states when they are associated with a key.
To associate keys with processes, I use a process called pidstore. Pidstore eventually start processes.
I give a Key and a M,F,A to pidstore, it looks for the key in global, then either returns the pid if found or apply MFA (which must return {ok, Pid}), registers the Pid with the key in global and returns the Pid.
I may have many inactive gen_servers with a possibly huge state. So, i've set the handle_info callback to save the state in my database and then stops the process. The gen_servers are considered transient in their supervisor, so they won't be restarted until something needs them again.
Here starts the problems : If I call a process with its key, say {car, 23}, during the saving step of handle_info in the process which represents {car, 23}, i'll get the pid back as intended, because the process is saving and not finished. So i'll call my process with gen_server:call but i'll never have a response (and hit default 5 sec. timeout) because the process is stopping. (PROBLEM A)
To solve this problem, the process could unregister itself from global, then save its state, then stop. But if I need it after it's unregistered but before save is finished, I will load a new process, this process could load non-updated values in the database. (PROBLEM B)
To solve this again, I could ensure that loading and saving in the db are enqueued and can not be concurrent. This could be a bottleneck. (PROBLEM C)
I've thinking about another solution : my processes, before saving, could tell the pidstore that they are busy. The pidstore would keep a list of busy processes, and respond 'busy' to any demand on theese keys.
when the save is done, the pidstore would be told no_more_busy by the process and could start a new process when asked a key. (Even if the old process is not finished, it's done saving so it can just take his time to die alone).
This seems a bit messy to me but it feels simpler to make several attemps to get the Pid from the key instead of wrapping every call to a gen_server to handle the possible timeouts. (when the process is finishing but still registrered in global).
I'm a bit confused about all of theese half-problems and half-solutions. What is the design you use in this situation, or how can I avoid this situation ?
I hope my message is legible, please tell me about english errors too.
Thank You
Maybe you want to do the save to DB part in a gen_server:call. That would prevent other calls from coming in while you are writing to DB.
Generally it sounds to like you have created a process register. You might want to look into gproc (https://github.com/uwiger/gproc) which does a very good job at that if you want register locally. With gproc you can do exactly what you described above, use a key to register a process. Maybe it would be good enough if you register with gproc in your init function and unregister when writing to DB. You could also write to DB in your terminate function.
For now i decided to stick with erlang « let it crash » philosophy. If a process recieves messages as it is shuting down, those messages will not be answered and will trigger a gen_server:call/* timeout.
I think it will be boring to handle this timeout in the right place, i have not decided where at this time, but this is specific to my application so it is pointless here.

Database resiliency

I'm designing an application that relies heavily on a database. I need my application to be resilient to short losses of connectivity to the database (network going down for a few seconds for example). What is the usual patterns that people use for these kind of problems. Is there something that I can do on the database access layer to handle gracefully a small glitch in the network connection to the db (i'm using hibernate + oracle jdbc + dbcp pool).
I'll assume you have hidden every database access behind a DAO or something similiar.
Now create wrappers around these DAOs, that try to call them, and in case of an exception wait a second and retry. Of course this will cause 'hanging' of the application during db-outage, but it will come back to live when the database becomes available.
If this is not acceptable you'll have to move the cut up closer to the ui layer. Consider the following approach.
User causes a
request.
wrap all the request information in a message and put it in the queue.
return to the user, telling him that his request will get processed in a short time.
A worker registered on the queue will process the request, retrying when database problems exist.
Note that you are now deep in concurrency land. So you must handle things like requests referencing an entity which already got deleted.
Read up on 'eventual consistency'
Since you are using hibernate, you'll have to deal with lazy loading. An interruption in connectivity will kill your session, so for you it might be best not to use lazy loading at all, but work with detached objects.

Resources