Is this the right way of building an Erlang network server for multi-client apps? - erlang

I'm building a small network server for a multi-player board game using Erlang.
This network server uses a local instance of Mnesia DB to store a session for each connected client app. Inside each client's record (session) stored in this local Mnesia, I store the client's PID and NODE (the node where a client is logged in).
I plan to deploy this network server on at least 2 connected servers (Node A & B).
So in order to allow a Client A who is logged in on Node A to search (query to Mnesia) for a Client B who is logged in on Node B, I replicate the Mnesia session table from Node A to Node B or vise-versa.
After Client A queries the PID and NODE of the Client B, then Client A and B can communicate with each other directly.
Is this the right way of establishing connection between two client apps that are logged-in on two different Erlang nodes?

Creating a system where two or more nodes are perfectly in sync is by definition impossible. In practice however, you might get close enough that it works for your particular problem.
You don't say the exact reason behind running on two nodes, so I'm going to assume it is for scalability. With many nodes, your system will also be more available and fault-tolerant if you get it right. However, the problem could be simplified if you know you only ever will run in a single node, and need the other node as a hot-slave to take over if the master is unavailable.
To establish a connection between two processes on two different nodes, you need some global addressing(user id 123 is pid<123,456,0>). If you also care about only one process running for User A running at a time, you also need a lock or allow only unique registrations of the addressing. If you also want to grow, you need a way to add more nodes, either while your system is running or when it is stopped.
Now, there are already some solutions out there that helps solving your problem, with different trade-offs:
gproc in global mode, allows registering a process under a given key(which gives you addressing and locking). This is distributed to the entire cluster, with no single point of failure, however the leader election (at least when I last looked at it) works only for nodes that was available when the system started. Adding new nodes requires an experimental version of gen_leader or stopping the system. Within your own code, if you know two players are only going to ever talk to each other, you could start them on the same node.
riak_core, allows you to build on top of the well-tested and proved architecture used in riak KV and riak search. It maps the keys into buckets in a fashion that allows you to add new nodes and have the keys redistributed. You can plug into this mechanism and move your processes. This approach does not let you decide where to start your processes, so if you have much communication between them, this will go across the network.
Using mnesia with distributed transactions, allows you to guarantee that every node has the data before the transaction is commited, this would give you distribution of the addressing and locking, but you would have to do everything else on top of this(like releasing the lock). Note: I have never used distributed transactions in production, so I cannot tell you how reliable they are. Also, due to being distributed, expect latency. Note2: You should check exactly how you would add more nodes and have the tables replicated, for example if it is possible without stopping mnesia.
Zookeper/doozer/roll your own, provides a centralized highly-available database which you may use to store the addressing. In this case you would need to handle unregistering yourself. Adding nodes while the system is running is easy from the addressing point of view, but you need some way to have your application learn about the new nodes and start spawning processes there.
Also, it is not necessary to store the node, as the pid contains enough information to send the messages directly to the correct node.
As a cool trick which you may already be aware of, pids may be serialized (as may all data within the VM) to a binary. Use term_to_binary/1 and binary_to_term/1 to convert between the actual pid inside the VM and a binary which you may store in whatever accepts binary data without mangling it in some stupid way.

Related

Can I have some keyspaces replicated to some nodes?

I am trying to build multiple API for which I want to store the data with Cassandra. I am designing it as if I would have multiple hosts but, the hosts I envisioned would be of two types: trusted and non-trusted.
Because of that I have certain data which I don't want to end up replicated on a group of the hosts but the rest of the data to be replicated everywhere.
I considered simply making a node for public data and one for protected data but that would require the trusted hosts to run two nodes and it would also complicate the way the API interacts with the data.
I am building it in a docker container also, I expect that there will be frequent node creation/destruction both trusted and not trusted.
I want to know if it is possible to use keyspaces in order to achieve my required replication strategy.
You could have two Datacenters one having your public data and the other the private data. You can configure keyspace replication to only replicate that data to one (or both) DCs. See the docs on replication for NetworkTopologyStrategy
However there are security concerns here since all the nodes need to be able to reach one another via the gossip protocol and also your client applications might need to contact both DCs for different reads and writes.
I would suggest you look into configuring security perhaps SSL for starters and then perhaps internal authentication. Note Kerberos is also supported but this might be too complex for what you need at least now.
You may also consider taking a look at the firewall docs to see what ports are used between nodes and from clients so you know which ones to lock down.
Finally as the above poster mentions, the destruction / creation of nodes too often is not good practice. Cassandra is designed to be able to grow / shrink your cluster while running, but it can be a costly operation as it involves not only streaming data from / to the node being removed / added but also other nodes shuffling around token ranges to rebalance.
You can run nodes in docker containers, however note you need to take care not to do things like several containers all accessing the same physical resources. Cassandra is quite sensitive to io latency for example, several containers sharing the same physical disk might render performance problems.
In short: no you can't.
All nodes in a cassandra cluster from a complete ring where your data will be distributed with your selected partitioner.
You can have multiple keyspaces and authentication and authorziation within cassandra and split your trusted and untrusted data into different keyspaces. Or you an go with two clusters for splitting your data.
From my experience you also should not try to create and destroy cassandra nodes as your usual daily business. Adding and removing nodes is costly and needs to be monitored as your cluster needs to maintain repliaction and so on. So it might be good to split cassandra clusters from your api nodes.

Ejabberd Clustering understanding

Let assume I have two ejabberd server consider X and Y which has the same source and i did ejabberd clustering for those server by using this. Now consider A and B are user and those are connected in X server. Both A and B are in ONLINE state and those are connected via X server. If suppose X server is get shutdown or crashed by some issue. In this sceneraio whether the A and B are get OFFLINE state or A and B are in ONLINE state which is handle by Y server. I don't know whether my thought is right or not. If any one give me the suggestion about it.
If you have nodes in different physical locations, you should set them up as separate clusters (even if it's a cluster of 1 node) and federate them. Clustering should only be done at datacenter level since there are mnesia transactional locks between all nodes in a cluster (e.g. creating a MUC room).
"Load balancing" is not what you are describing in your question.
In load balancing, a incoming connections are distributed in a balanced fashion over multiple nodes. This is so that no one server has too high a load (hence the name "load balancing"). It also provides fail-over capability if your load balancer is smart enough to detect and remove dead nodes.
A smart load balancer can make it so that new connections always succeed as long as there is at least one working node in your cluster. However, in your question, you talk about clients "maintaining the connection". That's something quite different.
To do that, you'd either need the connection to be stateless or you'd need each client to connect to all nodes. That's not how XMPP works: it's a stateful connection to a single server. You must rely on your clients to reconnect if they get disconnected.

Erlang clusters

I'm trying to implement a cluster using Erlang as the glue that holds it all together. I like the idea that it creates a fully connected graph of nodes, but upon reading different articles online, it seems as though this doesn't scale well (having a max of 50 - 100 nodes). Did the developers of OTP impose this limitation on purpose? I do know that you can setup nodes to have explicit connections only as well as have hidden nodes, etc. But, it seems as though the default out-of-the-box setup isn't very scalable.
So to the questions:
If you had 5 nodes (A, B, C, D, E) that all had explicit connections such that A-B-C-D-E. Does Erlang/OTP allow A to talk directly to E or does A have to pass messages from B through D to get to E, and thus that's the reason for the fully connected graph? Again, it makes sense but it doesn't scale well from what I've seen.
If one was to try and go for a scalable and fault-tolerant system, what are your options? It seems as though, if you can't create a fully connected graph because you have too many nodes, the next best thing would be to create a tree of some kind. But, this doesn't seem very fault-tolerant because if the root or any parent of children nodes dies, you would lose a significant portion of your cluster.
In looking into supervisors and workers, all of the examples I've seen apply this to processes on a single node. Could it be applied to a cluster of nodes to help implement fault-tolerance?
Can nodes be part of several clusters?
Thanks for your help, if there is a semi-recent website or blogpost (roughly 1-year old) that I've missed, I'd be happy to look at those. But, I've scoured the internet pretty well.
Yes, you can send messages to a process on any remote node in a cluster, for example, by using its process identifier (pid). This is called location transparency. And yes, it scales well (see Riak, CouchDB, RabbitMQ, etc).
Note that one node can run hundred thousands of processes. Erlang has proven to be very scalable and was built for fault tolerance. There are other approaches to build bigger, e.g. SOA approach of CloudI (see comments). You also could build clusters that use hidden nodes if you really really need to.
At the node level you would take a different approach, for example, build identical nodes that are easy to replace if they fail and the work is taken over by the remaining nodes. Check out how Riak handles this (look into riak_core and check the blog post Introducing Riak Core).
Nodes can leave and enter a cluster but cannot be part of multiple clusters at the same time. Connected nodes share one cluster cookie which is used to identify connected nodes. You can set the cookie while the VM is running (see Distributed Erlang).
Read http://learnyousomeerlang.com/ for greater good.
The distribution protocol is about providing robustness, not scalability. What you want to do is to group your cluster into smaller areas and then use connections, which are not distribution in Erlang but in, say, TCP sessions. You could run 5 groups of 10 machines each. This means the 10 machines have seamless Pid distribution: you can call a pid on another machine. But distributing to another group means you can't seamlessly address the group like that.
You generally want some kind of "route reflection" as in BGP.
1) I think you need a direct connection between nodes to communicate between processes. This does, however, mean that you don't need persistent connections between all the nodes if two will never communicate (say if they're only workers, not coordinators).
2) You can create a not-fully-connected graph of erlang nodes. The documentation is hard to find, and comes with problems - you disable the global system which handles global names in the cluster, so you have to do everything by locally registered names, or locally registered names on remote nodes. Or just use Pids, as they work too. To start an erlang node like this, use erl ... -connect_all false .... I hope you know what you're up to, as I couldn't trust myself to do that.
It also turns out that a not-fully-connected graph of erlang nodes is a current research topic. The RELEASE Project is currently working on exactly that, and have come up with a concept of S-groups, which are essentially fully-connected groups. However, nodes can be members of more than one S-group and nodes in separate s-groups don't have to be fully connected but can establish the connections they need on demand to do direct node-to-node communication. It's worth finding presentations of theirs because the research is really interesting.
Another thing worth pointing out is that several people have found that you can get up to 150-200 nodes in a fully-connected cluster. Do you really have a use-case for more nodes than that? Surely 150-200 incredibly beefy computers would do most things you could throw at them, unless you have a ridiculous project to do.
3) While you can't start processes on a different node using gen_server:start_link/3,4, you can certainly call servers on a foreign node very easily. It seems that they've overlooked being able to start servers on foreign nodes, but there's probably good reason for it - such as a ridiculous number of error cases.
4) Try looking at hidden nodes, and at having a not-fully-connected cluster. They should allow you to group nodes as you see fit.
TL;DR: Scaling is hard, let's go shopping.
There are some good answers already, so I'm trying to be simple.
1) No, if A and E are not connected directly, A cannot talk to E. The distribution protocol runs on direct TCP connection - no routing included.
2) I think a tree structure is good enough - trade-offs always exist.
3) There's no 'supervisor for nodes', but erlang:monitor_node is your friend.
4) Yes. A node can talk to nodes from different 'clusters'. In the local node, use erlang:set_cookie(OtherNode, OtherCookie) to access a remote node with a different cookie.
1)
yes. they talk to each other
2) 3) and 4)
Generally speaking, when building a scalable and fault tolerant system, you would want, or more over, need to divide the work load to different "regions" or "clusters". Supervisor/Worker model has this envisioned thus the topology. What you need is a few processes coordinating work between clusters and all workers within one single cluster will talk to each other to balance out within group.
As you can see, with this topology, the "limitation" is not really a limitation as long as you divide your tasks carefully and in a balanced fashion. Personally, I believe a tree like structure for supervisor processes is not avoidable in large scale systems, and this is the practice I'm following. Reasons are vary but boils down to scalability, fault tolerance as fall back policy implementation, maintenance need and portability of the clusters.
So in conclusion,
2) use a tree-like topology for your supervisors. let workers explicitly connect to each other and talk within their own domain with the supervisors.
3) while this is the native designed environment, as I presume, I'm pretty sure a supervisor can talk to a worker on a different machine. I would not suggest this as fault tolerance can be hell in remote worker scenario.
4) you should never let a node be part of two different cluster at the same moment. You can switch it from one cluster to another though.

What's the best way to run a gen_server on all nodes in an Erlang cluster?

I'm building a monitoring tool in Erlang. When run on a cluster, it should run a set of data collection functions on all nodes and record that data using RRD on a single "recorder" node.
The current version has a supervisor running on the master node (rolf_node_sup) which attempts to run a 2nd supervisor on each node in the cluster (rolf_service_sup). Each of the on-node supervisors should then start and monitor a bunch of processes which send messages back to a gen_server on the master node (rolf_recorder).
This only works locally. No supervisor is started on any remote node. I use the following code to attempt to load the on-node supervisor from the recorder node:
rpc:call(Node, supervisor, start_child, [{global, rolf_node_sup}, [Services]])
I've found a couple of people suggesting that supervisors are really only designed for local processes. E.g.
Starting processes at remote nodes
how: distributed supervision tree
What is the most OTP way to implement my requirement to have supervised code running on all nodes in a cluster?
A distributed application is suggested as one alternative to a distributed supervisor tree. These don't fit my use case. They provide for failover between nodes, but keeping code running on a set of nodes.
The pool module is interesting. However, it provides for running a job on the node which is currently the least loaded, rather than on all nodes.
Alternatively, I could create a set of supervised "proxy" processes (one per node) on the master which use proc_lib:spawn_link to start a supervisor on each node. If something goes wrong on a node, the proxy process should die and then be restarted by it's supervisor, which in turn should restart the remote processes. The slave module could be very useful here.
Or maybe I'm overcomplicating. Is directly supervising nodes a bad idea, instead perhaps I should architect the application to gather data in a more loosely coupled way. Build a cluster by running the app on multiple nodes, tell one to be master, leave it at that!
Some requirements:
The architecture should be able to cope with nodes joining and leaving the pool without manual intervention.
I'd like to build a single-master solution, at least initially, for the sake of simplicity.
I would prefer to use existing OTP facilities over hand-rolled code in my implementation.
Interesting challenges, to which there are multiple solutions. The following are just my suggestions, which hopefully makes you able to better make the choice on how to write your program.
As I understand your program, you want to have one master node where you start your application. This will start the Erlang VM on the nodes in the cluster. The pool module uses the slave module to do this, which require key-based ssh communication in both directions. It also requires that you have proper dns working.
A drawback of slave is that if the master dies, so does the slaves. This is by design as it probably fit the original use case perfectly, however in your case it might be stupid (you may want to still collect data, even if the master is down, for example)
As for the OTP applications, every node may run the same application. In your code you can determine the nodes role in the cluster using configuration or discovery.
I would suggest starting the Erlang VM using some OS facility or daemontools or similar. Every VM would start the same application, where one would be started as the master and the rest as slaves. This has the drawback of marking it harder to "automatically" run the software on machines coming up in the cluster like you could do with slave, however it is also much more robust.
In every application you could have a suitable supervision tree based on the role of the node. Removing inter-node supervision and spawning makes the system much simpler.
I would also suggest having all the nodes push to the master. This way the master does not really need to care about what's going on in the slave, it might even ignore the fact that the node is down. This also allows new nodes to be added without any change to the master. The cookie could be used as authentication. Multiple masters or "recorders" would also be relatively easy.
The "slave" nodes however will need to watch out for the master going down and coming up and take appropriate action, like storing the monitoring data so it can send it later when the master is back up.
I would look into riak_core. It provides a layer of infrastructure for managing distributed applications on top of the raw capabilities of erlang and otp itself. Under riak_core, no node needs to be designated as master. No node is central in an otp sense, and any node can take over other failing nodes. This is the very essence of fault tolerance. Moreover, riak_core provides for elegant handling of nodes joining and leaving the cluster without needing to resort to the master/slave policy.
While this sort of "topological" decentralization is handy, distributed applications usually do need logically special nodes. For this reason, riak_core nodes can advertise that they are providing specific cluster services, e.g., as embodied by your use case, a results collector node.
Another interesting feature/architecture consequence is that riak_core provides a mechanism to maintain global state visible to cluster members through a "gossip" protocol.
Basically, riak_core includes a bunch of useful code to develop high performance, reliable, and flexible distributed systems. Your application sounds complex enough that having a robust foundation will pay dividends sooner than later.
otoh, there's almost no documentation yet. :(
Here's a guy who talks about an internal AOL app he wrote with riak_core:
http://www.progski.net/blog/2011/aol_meet_riak.html
Here's a note about a rebar template:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-March/003632.html
...and here's a post about a fork of that rebar template:
https://github.com/rzezeski/try-try-try/blob/7980784b2864df9208e7cd0cd30a8b7c0349f977/2011/riak-core-first-multinode/README.md
...talk on riak_core:
http://www.infoq.com/presentations/Riak-Core
...riak_core announcement:
http://blog.basho.com/2010/07/30/introducing-riak-core/

Erlang remote procedure call module internals

I have Several Erlang applications on Node A and they are making rpc calls to Node B onto which i have Mnesia Stored procedures (Database querying functions) and my Mnesia DB as well. Now, occasionally, the number of simultaneous processes making rpc calls to Node B for data can rise to 150. Now, i have several questions:
Question 1: For each rpc call to a remote Node, does Node A make a completely new (say TCP/IP or UDP connection or whatever they use at the transport) CONNECTION? or there is only one connection and all rpc calls share this one (since Node A and Node B are connected [got to do with that epmd process])?
Question 2: If i have data centric applications on one node and i have a centrally managed Mnesia Database on another and these Applications' tables share the same schema which may be replicated, fragmented, indexed e.t.c, which is a better option: to use rpc calls in order to fetch data from Data Nodes to Application nodes or to develope a whole new framework using say TCP/IP (the way Scalaris guys did it for their Failure detector) to cater for network latency problems?
Question 3: Has anyone out there ever tested or bench marked the rpc call efficiency in a way that can answer the following?
(a) What is the maximum number of simultaneous rpc calls an Erlang Node can push onto another without breaking down?
(b) Is there a way of increasing this number, either by a system configuration or operating system setting? (refer to Open Solaris for x86 in your answer)
(c) Is there any other way of applications to request data from Mnesia running on remote Erlang Nodes other than rpc? (say CORBA, REST [requires HTTP end-to-end], Megaco, SOAP e.t.c)
Mnesia runs over erlang distribution, and in Erlang distribution there is only one tcp/ip connection between any pair of nodes (usually in a fully mesh arrangement, so one connection for every pair of nodes). All rpc/internode communication will happen over this distribution connection.
Additionally, it's guaranteed that message ordering is preserved between any pair of communicating processes over distribution. Ordering between more than two processes is not defined.
Mnesia gives you a lot of options for data placement. If you want your persistent storage on node B, but processing done on node A, you could have disc_only_copies of your tables on B and ram_copies on node A. That way applications on node A can get quick access to data, and you'll still get durable copies on node B.
I'm assuming that the network between A and B is a reliable LAN that is seldom going to partition (otherwise you're going to spend a bunch of time getting mnesia back online after a partition).
If both A and B are running mnesia, then I would let mnesia do all the RPC for me - this is what mnesia is built for and it has a number of optimizations. I wouldn't roll my own RPC or distribution mechanism without a very good reason.
As for benchmarks, this is entirely dependent on your hardware, mnesia schema and network between nodes (as well as your application's data access patterns). No one can give you these benchmarks, you have to run them yourself.
As for other RPC mechanisms for accessing mnesia, I don't think there are any out of the box, but there are many RPC libraries you could use to present the mnesia API to the network with a small amount of effort on your part.

Resources