question about inter-process messaging - erlang

Hi
I am implementing an IM-server-like application in Erlang. I used one agent process for each client connecting to the server, and the agent process is responsible for sending the messages to a message gateway, which in turn sends the message to another agent process. It seems that erlang inter-process messaging is implemented as tcp connections. So there will be one connection for each agent process to the gateway. Does this mean that the number of agents on a single machine will never exceed over 65,535 due to the restriction of port numbers?
Thanks in advance!

Some limits for Erlang: Efficiency Guide
User's Guide / 10 Advanced:
Distributed nodes
Known nodes
A remote node Y has to be known to node X if there exist any pids, ports, references, or funs (Erlang data types) from Y on X, or if X and Y are connected. The maximum number of remote nodes simultaneously/ever known to a node is limited by the maximum number of atoms available for node names. All data concerning remote nodes, except for the node name atom, are garbage-collected.
Connected nodes
The maximum number of simultaneously connected nodes is limited by either the maximum number of simultaneously known remote nodes, the maximum number of (Erlang) ports available, or the maximum number of sockets available.

Erlang connects Erlang nodes over the network, not Erlang processes. (And each Erlang node is an Operating System process).
So you will run out of TCP connections when you have several dozen Ks of Erlang nodes on a single machine (which is unreasonable), not when you have several dozen Ks of Erlang processes on a single Erlang node.
(I am not giving any absolute numbers as each Erlang node will also need to communicate with epmd which produces another network connection. But you will probably not have 30+K of Erlang nodes aka OS processes on a single machine anyway.)

Related

How build a very large cluster with Erlang?

I'm a newbie to Erlang. In Erlang, node is represented by an atom, like 'name#host'. I want to ask how a node can communicate with the other nodes without increasing the number of atoms of it?
I want to build a very distributed storage system which may contain thousands of nodes.
For a specified node A, it can send/receive messages to/from any other nodes in the cluster, for example:
rpc:call(Node, Module, Method, [])
But with the node joining and leaving the cluster, node A may have communicated to thousands of nodes,
in this way, the number of atom of node A will keep increasing and finally reach the limit.
How to avoid this from happending?
If I use the Pid instead of Node to communicate, for example,
Pid ! Message
Will this way increase the number of atoms in node A? It is said that Pid contains the information of a remote node.
The maximum number of atoms is 1048576 and you can raise it with +t. see: erlang docs
There's no chance you hit the limit with erlang clustering. If you were to scale to the 10s of thousands to millions range, you are very likely to have several separate clusters.
Distributed Erlang keeps the cluster alive with tcp heartbeats between nodes. You probably don't really want a single cluster more than a few thousand nodes.

Ejabberd Clustering understanding

Let assume I have two ejabberd server consider X and Y which has the same source and i did ejabberd clustering for those server by using this. Now consider A and B are user and those are connected in X server. Both A and B are in ONLINE state and those are connected via X server. If suppose X server is get shutdown or crashed by some issue. In this sceneraio whether the A and B are get OFFLINE state or A and B are in ONLINE state which is handle by Y server. I don't know whether my thought is right or not. If any one give me the suggestion about it.
If you have nodes in different physical locations, you should set them up as separate clusters (even if it's a cluster of 1 node) and federate them. Clustering should only be done at datacenter level since there are mnesia transactional locks between all nodes in a cluster (e.g. creating a MUC room).
"Load balancing" is not what you are describing in your question.
In load balancing, a incoming connections are distributed in a balanced fashion over multiple nodes. This is so that no one server has too high a load (hence the name "load balancing"). It also provides fail-over capability if your load balancer is smart enough to detect and remove dead nodes.
A smart load balancer can make it so that new connections always succeed as long as there is at least one working node in your cluster. However, in your question, you talk about clients "maintaining the connection". That's something quite different.
To do that, you'd either need the connection to be stateless or you'd need each client to connect to all nodes. That's not how XMPP works: it's a stateful connection to a single server. You must rely on your clients to reconnect if they get disconnected.

how to listen to a broadcast on -any- port within a linux distributed system

I am working on a distributed application in which a set of logical nodes communicate with each other.
In the initial discovery phase, each logical node starts up and sends out a UDP broadcast packet to the network to inform the rest of the nodes of its existence.
With different physical hosts, this can easily be handled by agreeing on a port number and keeping track of UDP broadcasts received from other hosts.
My problem is - I need to be able to be able to handle the case of multiple logical nodes on the same machine as well.
So in this case, it seems I cannot bind to the same port twice. How do I handle the node discovery case if there are two logical nodes on the same box ?? Thanks a lot in advance !!
Your choices are:
Create a RAW socket and listen to all packets on a particular NIC, this way ,by looking at the content of each packet, the process will identify if the packet is for destined for itself. The problem with this is tons of packets you would have to process. This is why kernels of our operating systems bind sockets to processes, so the traffic gets distributed optimally.
Create a specialized service, i.e. a daemon that will handle announcements of new processes that will be available to execute the work. When launched, the process will have to announce its port number to the service. This is usually how it is done.
Use virtual IP addresses for each process you want to run, each process binds to different IP address. If you are running on a local network, this is the simplest way.
Define range of ports and scan this range on all ip addresses you have defined.

How scalable is distributed Erlang?

Part A:
Erlang has a lot of success stories about running concurrent agents e.g. the millions of simultaneous Facebook chats. That's millions of agents, but of course it's not millions of CPUs across a network. I'm having trouble finding metrics on how well Erlang scales when scaling is "horizontal" across a LAN/WAN.
Let's assume that I have many (tens of thousands) physical nodes (running Erlang on Linux) that need to communicate and synchronize small infrequent amounts of data across the LAN/WAN. At what point will I have communications bottlenecks, not between agents, but between physical nodes? (Or will this just work, assuming a stable network?)
Part B:
I understand (as an Erlang newbie, meaning I could be totally wrong) that Erlang nodes attempt to all connect to and be aware of each other, resulting in an N^2 connection point-to-point network. Assuming that part A won't just work with N = 10K's, can Erlang be configured easily (using out-of-the-box config or trivial boilerplate, not writing a full implementation of grouping/routing algorithms myself) to cluster nodes into manageable groups and route system -wide messages through the cluster/group hierarchy?
We should specify that we talk about horizontal scalability of physical machines -- that's the only problem. CPUs on one machine will be handled by one VM, no matter what the number of those is.
node = machine.
To begin, I can say that 30-60 nodes you get out of the box (vanilla OTP installation) with any custom application written on the top of that (in Erlang). Proof: ejabberd.
~100-150 is possible with optimized custom application. I means, it has to be good code, written with knowledge about GC, characteristic of data types, message passing etc.
over +150 is all right but when we talk about numbers like 300, 500 it will require optimizations & customizations of TCP layer. Also, our app has to be aware of cost of e.g. sync calls across the cluster.
The other thing is DB layer. Mnesia (built-in) due its features will not be effective over 20 nodes (my experience - I may be wrong). Solution: just use something else: dynamo DBs, separate cluster of MySQLs, HBase etc.
The most common technique to leverage cost of creating high quality application and scalability are federations of ~20-50 nodes clusters. So internally its an efficient mesh of ~50 erlang nodes and its connected via any suitable protocol with N another 50 nodes clusters. To sum up, such a system is federation of N erlang clusters.
Distributed erlang is designed to run in one data center. If you need more, geographically distant nodes, then use federations.
There are lots of config options e.g. which do not connect all nodes to each other. It may be helpful, however in ~50 cluster erlang overhead is not significant. Also you can create a graph of erlang nodes using 'hidden' connection, which doesn't join this full mesh, but also it cannot benefit from connection to all nodes.
The biggest problem I see, in this kind of systems, is designing it as master-less system. If you do not need that, everything should be ok.

Erlang remote procedure call module internals

I have Several Erlang applications on Node A and they are making rpc calls to Node B onto which i have Mnesia Stored procedures (Database querying functions) and my Mnesia DB as well. Now, occasionally, the number of simultaneous processes making rpc calls to Node B for data can rise to 150. Now, i have several questions:
Question 1: For each rpc call to a remote Node, does Node A make a completely new (say TCP/IP or UDP connection or whatever they use at the transport) CONNECTION? or there is only one connection and all rpc calls share this one (since Node A and Node B are connected [got to do with that epmd process])?
Question 2: If i have data centric applications on one node and i have a centrally managed Mnesia Database on another and these Applications' tables share the same schema which may be replicated, fragmented, indexed e.t.c, which is a better option: to use rpc calls in order to fetch data from Data Nodes to Application nodes or to develope a whole new framework using say TCP/IP (the way Scalaris guys did it for their Failure detector) to cater for network latency problems?
Question 3: Has anyone out there ever tested or bench marked the rpc call efficiency in a way that can answer the following?
(a) What is the maximum number of simultaneous rpc calls an Erlang Node can push onto another without breaking down?
(b) Is there a way of increasing this number, either by a system configuration or operating system setting? (refer to Open Solaris for x86 in your answer)
(c) Is there any other way of applications to request data from Mnesia running on remote Erlang Nodes other than rpc? (say CORBA, REST [requires HTTP end-to-end], Megaco, SOAP e.t.c)
Mnesia runs over erlang distribution, and in Erlang distribution there is only one tcp/ip connection between any pair of nodes (usually in a fully mesh arrangement, so one connection for every pair of nodes). All rpc/internode communication will happen over this distribution connection.
Additionally, it's guaranteed that message ordering is preserved between any pair of communicating processes over distribution. Ordering between more than two processes is not defined.
Mnesia gives you a lot of options for data placement. If you want your persistent storage on node B, but processing done on node A, you could have disc_only_copies of your tables on B and ram_copies on node A. That way applications on node A can get quick access to data, and you'll still get durable copies on node B.
I'm assuming that the network between A and B is a reliable LAN that is seldom going to partition (otherwise you're going to spend a bunch of time getting mnesia back online after a partition).
If both A and B are running mnesia, then I would let mnesia do all the RPC for me - this is what mnesia is built for and it has a number of optimizations. I wouldn't roll my own RPC or distribution mechanism without a very good reason.
As for benchmarks, this is entirely dependent on your hardware, mnesia schema and network between nodes (as well as your application's data access patterns). No one can give you these benchmarks, you have to run them yourself.
As for other RPC mechanisms for accessing mnesia, I don't think there are any out of the box, but there are many RPC libraries you could use to present the mnesia API to the network with a small amount of effort on your part.

Resources