Can I use module pool to decide load balance? - erlang

I'm looking for an automatic way to do my load balance and this module attracted me.
As the manual says,
pool can be used to run a set of Erlang nodes as a pool of computational processors.
It is organized as a master and a set of slave nodes and includes the following features:
The slave nodes send regular reports to the master about their current load.
Queries can be sent to the master to determine which node will have the least load.
The BIF statistics(run_queue) is used for estimating future loads.
It returns the length of the queue of ready to run processes in the Erlang runtime system.
What's the frequency and load for the slave nodes to send regular reports?
Is it a proper way to make load balance?

Reports are sent every 2 seconds and use information gathered from statistics(run_queue) to determine the node with the least load. run_queue returns the queue size of the current node's scheduler.
When you call pool:get_node/0 you are getting the node with the lowest number of tasks waiting to be executed on it's scheduler. Keep in mind that nodes are kept in sorted order so calls to pool:get_node/0 do not directly query nodes, but rather rely on information that could be up to 2 seconds old.
If you need a load balanced pool of nodes, pool works great.
Here is some more info from the pool.erl source:
%% Supplies a computational pool of processors.
%% The chief user interface function here is get_node()
%% Which returns the name of the nodes in the pool
%% with the least load !!!!
%% This function is callable from any node including the master
%% That is part of the pool
%% nodes are scheduled on a per usgae basis and per load basis,
%% Whenever we use a node, we put at the end of the queue, and whenever
%% a node report a change in load, we insert it accordingly

Related

Distributing in-memory linked list

I have a program which is build based on a singly linked list. There are different programs which creates some form of data and this data sent to this linked list module to be added. As long as I've RAM available, program working as intended. Periodically -about every year-, I archive the entire linked list to the disk -due to requirement, I'm archiving all-. So far so good.
What happens if I wanted to add new node to the list whilst RAM is full and I haven't archived and freed the memory on RAM? This might occur when producer count goes up or regardless of producer count, there may be more data created depending or where it's used etc. I couldn't find a clear solution the scale the on-memory linked list. There is a workaround in my head but don't know even if it works so I thought better to ask here.
When the RAM start to get almost full, I would create a new instance
of the linked list program -just another machine on the cloud or new
physical computer on premise, whatever -.
I do have an service discovery module -something like ZooKeeper-, this discovery module will detect the newly created machine and adds to the list.
When first instance is almost in it's limits, it will check if there is an available instance, if there is; it will relay the node to the next instance and it will update its last node's next pointer to something special. If you wanted to traverse the list from start to finish across all the machines every time you come to this special node, it will have the information of the which machine has the next node. Traversal will continue from the next machine that the last node points to.
Since this this not a hash map or something in that nature, I can't just add replicate the service and for example relay the incoming request based on a given key to a particular machine.
Rather than archiving part of the old data and loading that to the RAM and continuing on like that, I thought it would be better to have a last pointer to point to a different machine and continue reading from that machine. My choice for a network call seemed better because this program will be used in a intranet, but still I couldn't find a solid solution on paper.
Is there a such example that I can study on and try to find a better solution? Is this solution feasible?
An example:
Machine 1:
1st node : [data:x, *next: 2nd Node address],
2nd node : [data:123, *next: 3rd Node address],
...
// at this point RAM is almost full
// receive next instance's ip
(n-1)th node : [data:987, *next: nth Node address],
nth node : [data:x2t, type: LastNodeInMachine, *next: nullptr]
Machine 2:
1st node == (n+1) node : [data:x, *next: 2nd Node address],
... and so on

CAN bus sending data from two masters with equal balance

I have two master nodes connected to the same CAN bus, both send data to my PC.
first master ID = 0xFFA1
second master ID = 0xFFA2
Since the first master ID is lower than the second it takes control of the bus more than the second master. And this causes some delay in the data.
Is there a way to make load balancing between two nodes so that each node send an almost equal amount of messages.
I tried making the first node send data while switching between two IDs 0xFFA1 and 0xFFB2,
and the second node sends data with ID 0xFFB1. And it didn't help.
There is no such thing as "masters" in CAN, nor in higher layer protocols like CANopen for that matter (a "master" in CANopen is just a supervisor node). Who gets to send what is defined by the CAN identifiers - CAN is primarily focusing on data, not nodes. What matters is what is sent, rather than who is sending/receiving, since every message is broadcasted.
It sounds as if you have 2 nodes that wildly spam the bus with identifier 0xFFA1 and 0xFFA2 messages, as fast as they are able, leading to 100% bus load. Then the node sending 0xFFA2 will "starve". Sending data "as fast as you are able" is never the correct way to use CAN.
Instead you need to define a higher layer protocol that dictates real-time characteristics. In control systems, this is most commonly done by having nodes send data at fixed intervals, such as once per 10ms or 100ms. This alone should fix your starvation problem.
If you want to prevent nodes from sending at the same time, then you could provide a means for them to synchronize. A trick used in CANopen and other protocols, is to have one node send out a "sync" message at given fixed time intervals.
After reading this sync message, all nodes should act within x ms from receiving it.

Apache Storm - use multiple spouts?

So I'm trying to configure my spout(s) to read from an Amazon SQS queue. Now, I want a situation wherein I can share the load across multiple spouts.
I understand it's possible to have multiple threads, but can I have two or more different spout instances/applications which are reading from the same queue and emitting to the same topology? For eg. Spout A and Spout B read from the SQS and then both emit to bolt C?
Of course, you can have multiple spouts, but you have to define them accordingly to prevent double submit of the same element (or your topology does accept that by design). Multiple processes of the same element imply bad counters for instance.
Check Storm concurrency as a start with executors (threads) and tasks (instances) per spout / bolt and choose the number you want.
In your code, you have to be sure that you don't manage the same tuples twice or more, either you do it before storm (a queue which doesn't accept the same element twice which is processed / emptied by many spouts for instance, or multiple queues - one for each spout, beware of transactions) or you do it in storm (process messages only with x param in one spout, with y in another and a message cannot be x and y at the same time).
SQS Queue -----> Spout (N Number of Executors).
This model will perfectly fine. as soon as, any of executor instance will pick up message, message will become invisible from SQS.
Keep Message Invisibility time Much higher than Message Processing time with in Storm Topology.
You can keep delete SQS message logic inside ack method.

The impact of a distributed application configuration on node discovery via net_adm:ping/0

I am experiencing different behavior with respect to net_adm:ping/1 when being done in the context of a Distributed Application.
I have an application that pings a well-known node on start-up and in that way discovers all nodes in a mesh of connected nodes.
When I start this application on a single node (non-distributed configuration), the net_adm:ping/1 followed by a nodes/0 reports 4 other nodes (this is correct). The 4 nodes are on 2 different physical machines, so what is returned is the following n1#machine_1, n2#machine_2, n3#machine_2, n4#machine_1 (ip addresses are actually returned, not machine_x).
When part of a two-node distributed application, on the node where the application starts, the net_adm:ping/1 followed by a nodes/0 reports 2 nodes, one from each machine(n1#machine1, n2#machine2). A second call to nodes/0 after about a 750 ms delay results in the correct 5 nodes being found. Two of the three missing nodes are required for my application to work and so, not finding them, the application dies.
I am using R15B02
Is latency regarding the transitive node-discovery process known to be different when some of the nodes in the mesh are participating in distributed application configuration?
The kernel application documentation mentions the way to synchronize nodes in order to stop the boot phase until ready to move forward and everything is in place. Here are the options:
sync_nodes_mandatory = [NodeName]
Specifies which other nodes must be alive in order for this node to start properly. If some node in the list does not start within the specified time, this node will not start either. If this parameter is undefined, it defaults to [].
sync_nodes_optional = [NodeName]
Specifies which other nodes can be alive in order for this node to start properly. If some node in this list does not start within the specified time, this node starts anyway. If this parameter is undefined, it defaults to the empty list.
A file using them could look as follows:
[{kernel,
[{sync_nodes_mandatory, [b#ferdmbp, c#ferdmbp]},
{sync_nodes_timeout, 30000}]
}].
Starting the node a#ferdmbp by calling erl -sname a -config config-file-above. The downside of this approach is that each node needs its own config file.

Parallel depth-first search in Erlang is slower than its sequential counterpart

I am trying to implement a modified parallel depth-first search algorithm in Erlang (let's call it *dfs_mod*).
All I want to get is all the 'dead-end paths' which are basically the paths that are returned when *dfs_mod* visits a vertex without neighbours or a vertex with neighbours which were already visited. I save each path to ets_table1 if my custom function fun1(Path) returns true and to ets_table2 if fun1(Path) returns false(I need to filter the resulting 'dead-end' paths with some customer filter).
I have implemented a sequential version of this algorithm and for some strange reason it performs better than the parallel one.
The idea behind the parallel implementation is simple:
visit a Vertex from [Vertex|Other_vertices] = Unvisited_neighbours,
add this Vertex to the current path;
send {self(), wait} to the 'collector' process;
run *dfs_mod* for Unvisited_neighbours of the current Vertex in a new process;
continue running *dfs_mod* with the rest of the provided vertices (Other_vertices);
when there are no more vertices to visit - send {self(), done} to the collector process and terminate;
So, basically each time I visit a vertex with unvisited neighbours I spawn a new depth-first search process and then continue with the other vertices.
Right after spawning a first *dfs_mod* process I start to collect all {Pid, wait} and {Pid, done} messages (wait message is to keep the collector waiting for all the done messages). In N milliseconds after waiting the collector function returns ok.
For some reason, this parallel implementation runs from 8 to 160 seconds while the sequential version runs just 4 seconds (the testing was done on a fully-connected digraph with 5 vertices on a machine with Intel i5 processor).
Here are my thoughts on such a poor performance:
I pass the digraph Graph to each new process which runs *dfs_mod*. Maybe doing digraph:out_neighbours(Graph) against one digraph from many processes causes this slowness?
I accumulate the current path in a list and pass it to each new spawned *dfs_mod* process, maybe passing so many lists is the problem?
I use an ETS table to save a path each time I visit a new vertex and add it to the path. The ETS properties are ([bag, public,{write_concurrency, true}), but maybe I am doing something wrong?
each time I visit a new vertex and add it to the path, I check a path with a custom function fun1() (it basically checks if the path has vertices labeled with letter "n" occurring before vertices with "m" and returns true/false depending on the result). Maybe this fun1() slows things down?
I have tried to run *dfs_mod* without collecting done and wait messages, but htop shows a lot of Erlang activity for quite a long time after *dfs_mod* returns ok in the shell, so I do not think that the active message passing slows things down.
How can I make my parallel dfs_mod run faster than its sequential counterpart?
Edit: when I run the parallel *dfs_mod*, pman shows no processes at all, although htop shows that all 4 CPU threads are busy.
There is no quick way to know without the code, but here's a quick list of why this might fail:
You might be confusing parallelism and concurrency. Erlang's model is shared-nothing and aims for concurrency first (running distinct units of code independently). Parallelism is only an optimization of this (running some of the units of code at the same time). Usually, parallelism will take form at a higher level, say you want to run your sorting function on 50 different structures -- you then decide to run 50 of the sequential sort functions.
You might have synchronization problems or sequential bottlenecks, effectively changing your parallel solution into a sequential one.
The overhead of copying data, context switching and whatnot dwarfs the gains you have in terms of parallelism. This former is especially true of large data sets that you break into sub data sets, then join back into a large one. The latter is especially true of highly sequential code, as seen is the process ring benchmarks.
If I wanted to optimize this, I would try to reduce message passing and data copying to a minimum.
If I were the one working on this, I would keep the sequential version. It does what it says it should do, and when part of a larger system, as soon as you have more processes than core, parallelism will come from the many calls to the sort function rather than branches of the sort function. In the long run, if part of a server or service, using the sequential version N times should have no more negative impact than a parallel one that ends up creating many, many more processes to do the same task, and risk overloading the system more.

Resources