More specifically, at the core of erl, what algorithm is used to understand presence and availability of other nodes? How does it handle network partitioning? Are all the nodes just constantly pinging each other?
For example, if there are two nodes, and the network cable is pulled, how does it decide what to do? Presumably one node should go idle as it's orphaned, while the other carries on, otherwise you get a split-brain behavior..
In reading up on paxos and raft, it seems like it must be doing leader election internally, but I can't seem to find any comprehensible explanation -- I left my PhD in my other pants.. Can anyone explain this voodoo in english?
I've got the same question here and it was answered with this very good article where you can find detailed explanation to your question. The main idea is: Erlang nodes in distributed mode are connected through mesh network. They monitor each other through pings that are done in period of time decided by net_tick_time constant. Pings are used to detect network splits or halted nodes unable to communicate. Other failures such as VM crashes, cable unplugs are detected immediately (within few ms) through underlying network connection.
Related
I'm trying to implementing tree topology with Cooja/contiky. Finding through examples i've not been able to find an a good example to find what i need.
In short :
I'd need to implementing a topology of this type(picture here under) with cooja end contiky, is there someone that could give me some advice?
Thanks in advance
I don't really use Contiki Operating System, I have only ever used TinyOS but a network topology such as the one you have should be easily achievable.
For TinyOS, the mote-to-mote radio tutorial HERE will show you how to two different sensor nodes can communicate with each other (a gateway is basically just a sensor node connected to a PC) and the mote-to-PC communication tutorial HERE will show you how a gateway node can forward information from itself to the PC it is connected to. When the network is running you basically have a Java application listening to USB port and receiving packets from gateway node. Once the packet has been received on the Java application then you can send it to an external network server.
It may sound difficult if you have never developed on TinyOS but what you want to do is very common and so there will be complete programs in the tutorial section of a typical TinyOS distribution showing you how to achieve most of the things you need you need to achieve. There should also be similar examples in Contiki.
Let assume I have two ejabberd server consider X and Y which has the same source and i did ejabberd clustering for those server by using this. Now consider A and B are user and those are connected in X server. Both A and B are in ONLINE state and those are connected via X server. If suppose X server is get shutdown or crashed by some issue. In this sceneraio whether the A and B are get OFFLINE state or A and B are in ONLINE state which is handle by Y server. I don't know whether my thought is right or not. If any one give me the suggestion about it.
If you have nodes in different physical locations, you should set them up as separate clusters (even if it's a cluster of 1 node) and federate them. Clustering should only be done at datacenter level since there are mnesia transactional locks between all nodes in a cluster (e.g. creating a MUC room).
"Load balancing" is not what you are describing in your question.
In load balancing, a incoming connections are distributed in a balanced fashion over multiple nodes. This is so that no one server has too high a load (hence the name "load balancing"). It also provides fail-over capability if your load balancer is smart enough to detect and remove dead nodes.
A smart load balancer can make it so that new connections always succeed as long as there is at least one working node in your cluster. However, in your question, you talk about clients "maintaining the connection". That's something quite different.
To do that, you'd either need the connection to be stateless or you'd need each client to connect to all nodes. That's not how XMPP works: it's a stateful connection to a single server. You must rely on your clients to reconnect if they get disconnected.
I'm trying to implement a cluster using Erlang as the glue that holds it all together. I like the idea that it creates a fully connected graph of nodes, but upon reading different articles online, it seems as though this doesn't scale well (having a max of 50 - 100 nodes). Did the developers of OTP impose this limitation on purpose? I do know that you can setup nodes to have explicit connections only as well as have hidden nodes, etc. But, it seems as though the default out-of-the-box setup isn't very scalable.
So to the questions:
If you had 5 nodes (A, B, C, D, E) that all had explicit connections such that A-B-C-D-E. Does Erlang/OTP allow A to talk directly to E or does A have to pass messages from B through D to get to E, and thus that's the reason for the fully connected graph? Again, it makes sense but it doesn't scale well from what I've seen.
If one was to try and go for a scalable and fault-tolerant system, what are your options? It seems as though, if you can't create a fully connected graph because you have too many nodes, the next best thing would be to create a tree of some kind. But, this doesn't seem very fault-tolerant because if the root or any parent of children nodes dies, you would lose a significant portion of your cluster.
In looking into supervisors and workers, all of the examples I've seen apply this to processes on a single node. Could it be applied to a cluster of nodes to help implement fault-tolerance?
Can nodes be part of several clusters?
Thanks for your help, if there is a semi-recent website or blogpost (roughly 1-year old) that I've missed, I'd be happy to look at those. But, I've scoured the internet pretty well.
Yes, you can send messages to a process on any remote node in a cluster, for example, by using its process identifier (pid). This is called location transparency. And yes, it scales well (see Riak, CouchDB, RabbitMQ, etc).
Note that one node can run hundred thousands of processes. Erlang has proven to be very scalable and was built for fault tolerance. There are other approaches to build bigger, e.g. SOA approach of CloudI (see comments). You also could build clusters that use hidden nodes if you really really need to.
At the node level you would take a different approach, for example, build identical nodes that are easy to replace if they fail and the work is taken over by the remaining nodes. Check out how Riak handles this (look into riak_core and check the blog post Introducing Riak Core).
Nodes can leave and enter a cluster but cannot be part of multiple clusters at the same time. Connected nodes share one cluster cookie which is used to identify connected nodes. You can set the cookie while the VM is running (see Distributed Erlang).
Read http://learnyousomeerlang.com/ for greater good.
The distribution protocol is about providing robustness, not scalability. What you want to do is to group your cluster into smaller areas and then use connections, which are not distribution in Erlang but in, say, TCP sessions. You could run 5 groups of 10 machines each. This means the 10 machines have seamless Pid distribution: you can call a pid on another machine. But distributing to another group means you can't seamlessly address the group like that.
You generally want some kind of "route reflection" as in BGP.
1) I think you need a direct connection between nodes to communicate between processes. This does, however, mean that you don't need persistent connections between all the nodes if two will never communicate (say if they're only workers, not coordinators).
2) You can create a not-fully-connected graph of erlang nodes. The documentation is hard to find, and comes with problems - you disable the global system which handles global names in the cluster, so you have to do everything by locally registered names, or locally registered names on remote nodes. Or just use Pids, as they work too. To start an erlang node like this, use erl ... -connect_all false .... I hope you know what you're up to, as I couldn't trust myself to do that.
It also turns out that a not-fully-connected graph of erlang nodes is a current research topic. The RELEASE Project is currently working on exactly that, and have come up with a concept of S-groups, which are essentially fully-connected groups. However, nodes can be members of more than one S-group and nodes in separate s-groups don't have to be fully connected but can establish the connections they need on demand to do direct node-to-node communication. It's worth finding presentations of theirs because the research is really interesting.
Another thing worth pointing out is that several people have found that you can get up to 150-200 nodes in a fully-connected cluster. Do you really have a use-case for more nodes than that? Surely 150-200 incredibly beefy computers would do most things you could throw at them, unless you have a ridiculous project to do.
3) While you can't start processes on a different node using gen_server:start_link/3,4, you can certainly call servers on a foreign node very easily. It seems that they've overlooked being able to start servers on foreign nodes, but there's probably good reason for it - such as a ridiculous number of error cases.
4) Try looking at hidden nodes, and at having a not-fully-connected cluster. They should allow you to group nodes as you see fit.
TL;DR: Scaling is hard, let's go shopping.
There are some good answers already, so I'm trying to be simple.
1) No, if A and E are not connected directly, A cannot talk to E. The distribution protocol runs on direct TCP connection - no routing included.
2) I think a tree structure is good enough - trade-offs always exist.
3) There's no 'supervisor for nodes', but erlang:monitor_node is your friend.
4) Yes. A node can talk to nodes from different 'clusters'. In the local node, use erlang:set_cookie(OtherNode, OtherCookie) to access a remote node with a different cookie.
1)
yes. they talk to each other
2) 3) and 4)
Generally speaking, when building a scalable and fault tolerant system, you would want, or more over, need to divide the work load to different "regions" or "clusters". Supervisor/Worker model has this envisioned thus the topology. What you need is a few processes coordinating work between clusters and all workers within one single cluster will talk to each other to balance out within group.
As you can see, with this topology, the "limitation" is not really a limitation as long as you divide your tasks carefully and in a balanced fashion. Personally, I believe a tree like structure for supervisor processes is not avoidable in large scale systems, and this is the practice I'm following. Reasons are vary but boils down to scalability, fault tolerance as fall back policy implementation, maintenance need and portability of the clusters.
So in conclusion,
2) use a tree-like topology for your supervisors. let workers explicitly connect to each other and talk within their own domain with the supervisors.
3) while this is the native designed environment, as I presume, I'm pretty sure a supervisor can talk to a worker on a different machine. I would not suggest this as fault tolerance can be hell in remote worker scenario.
4) you should never let a node be part of two different cluster at the same moment. You can switch it from one cluster to another though.
I am working on an MPI application, which uses threaded-MPI calls between processes. Threads are added and removed as per the load requirements. Now, I have a question, which I could not find answer in the open-mpi forum.
If a set of MPI processes ("ranks") already has a connection, ie, they are already making send-receive calls, and then a new thread comes in (either processes) which also makes the send-receive calls between the same MPI peers, would MPI open up new set of sockets?
I know that the details are implementation dependent, so there may not be a general answer. But, is there a way to find out?
There are questions on the scalability of this technique, which was chosen for other reasons. It would be great to get some stats, on the number of new sockets per connection.
Anyone knows how to do this? For instance, query which socket is a particular instance of MPI_Send writing to?
I already tried adding --mca btl self,sm,tcp --mca btl_base_verbose 30 -v -report-pid -display-map -report-bindings -leave-session-attached
Thanks a lot.
To answer my own question, here is what I learnt from brilliant folks at Open-MPI:
On Jan 24, 2012, at 5:34 PM, devendra rai wrote:
I am trying to find out how many separate connections are opened by MPI as messages are sent. Basically, I have threaded-MPI calls to a bunch of different MPI processes (who, in turn have threaded MPI calls).
The point is, with every thread added, are new ports opened (even if the sender-receiver pairs already have a connection between them)?
In Open MPI: no. The underlying connections are independent of how many threads you have.
Is there any way to find out? I went through MPI APIs, and the closest thing I found was related to cartographic information. This is not sufficient, since this only tells me the logical connections (or does it)?
MPI does not have a user-level concept of a connection. You send a message, a miracle occurs, and the message is received on the other side. MPI doesn't say anything about how it got there (e.g., it may have even been routed through some other process).
Reading Open MPI FAQ, I thought adding "--mca btl self,sm,tcp --mca btl_base_verbose 30 -display-map" to mpirun would help. But I am not getting what I need. Basically, I want to know how many ports each process is accessing (reading as well as writing).
For Open MPI's TCP implementation, it's basically one TCP socket per peer (plus a few other utility fd's). But TCP sockets are only opened lazily, meaning that we won't open the socket until you actually send to a peer.
--
Jeff Squyres
Credits to Jeff Squyres.
is it easy to write a script to test whether the network is ever down for the next 24 or 48 hours? I can use ssh to connect to a shell and come back 48 hours later to see if it is still connected to see if the network has ever been down, but can i do it programmatically easily?
The Internet (and your ethernet) is a packet-switched network, which makes the definition of 'down' difficult.
Some things are obvious; for example, if your ethernet card doesn't report a link, then it's down (unless you have redundant connections). But having a link doesn't mean its up.
The basic, 100 mile view of how the Internet works is that your computer splits the data it wants to send into ~1500-byte segments called packets. It then, pretty much blindly, sends them on their way, however your routing table says to. Then that machine repeats the process. Eventually, through many repetitions, it reaches the remote host.
So, you may be tempted to define up as the packet reached its destination. But, what happens if the packet gets corrupted, e.g., due to faulty hardware or interference? The next router will just discard it. Ok, that's fine, you may well want to consider that down. What if a router on the path is too busy, or the link it needs to be sent on is full? The packet will be dropped. You probably don't want to count that as down.
Packets routinely get lost; higher-level protocols (e.g., TCP) deal with it and retransmit the packet. In fact, TCP uses the packet drop on link full behavior to discover the link bandwidth.
You can monitor packet loss with a tool like ping, as the other answer states.
If your need is more adminstrative in nature, and using existing software is an option, you could try something like monit:
http://mmonit.com/monit/
Wikipedia has a list of similar software:
http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems
You should consider also whether very short outages need to be detected. Obviously, a periodic reachability test cannot guarantee detecting outages shorter than the testing interval.
If you only care about whether there was an outage, not how many there were or how long they lasted, you should be able to automate your existing ssh technique using expect pretty easily.
http://expect.nist.gov/
Most platforms support a ping command that can be used to find out if a network path exists to an IP address somewhere "else". Where, exactly, to check depends on what you are really trying to answer.
If the goal is to discover the reliability of your first hop to your ISP's network, then pinging a router or their DNS regularly might be sufficient.
If your concern is really the connection to a single node (your mention of leaving an ssh session open implies this) then just pinging that node is probably the best idea. The ping command will usually fail in a way that a script can test if the connection times out.
Regardless, it is probably a good idea to check at a rate no faster than once a minute, and slower than that is probably sufficient.