Flume automatic scalability and failover - flume

My company is considering using flume for some fairly high volume log processing. We believe that the log processing needs to be distributed, both for volume (scalability) and failover (reliability) reasons, and Flume seems the obvious choice.
However, we think we must be missing something obvious, because we don't see how Flume provides automatic scalability and failover.
I want to define a flow that says for each log line, do thing A, then pass it along and do thing B, then pass it along and do thing C, and so on, which seems to match well with Flume. However, I want to be able to define this flow in purely logical terms, and then basically say, "Hey Flume, here are the servers, here is the flow definition, go to work!". Servers will die, (and ops will restart them), we will add servers to the cluster, and retire others, and flume will just direct the work to whatever nodes have available capacity.
This description is how Hadoop map-reduce implements scalability and failover, and I assumed that Flume would be the same. However, the documentation sees to imply that I need to manually configure which physical servers each logical node runs on, and configure specific failover scenarios for each node.
Am I right, and Flume does not serve our purpose, or did I miss something?
Thanks for your help.

Depending on whether you are using multiple masters, you can code your configuration to follow a failover pattern.
This is fairly detailed in the guide: http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_automatic_failover_chains
To answer your question, bluntly, Flume does not yet have an ability to figure out a failover scheme automatically.

Related

Where should I put shared services for multiple kubernetes-clusters?

Our company is developing an application which runs in 3 seperate kubernetes-clusters in different versions (production, staging, testing).
We need to monitor our clusters and the applications over time (metrics and logs). We also need to run a mailserver.
So basically we have 3 different environments with different versions of our application. And we have some shared services that just need to run and we do not care much about them:
Monitoring: We need to install influxdb and grafana. In every cluster there's a pre-installed heapster, that needs to send data to our tools.
Logging: We didn't decide yet.
Mailserver (https://github.com/tomav/docker-mailserver)
independant services: Sentry, Gitlab
I am not sure where to run these external shared services. I found these options:
1. Inside each cluster
We need to install the tools 3 times for the 3 environments.
Con:
We don't have one central point to analyze our systems.
If the whole cluster is down, we cannot look at anything.
Installing the same tools multiple times does not feel right.
2. Create an additional cluster
We install the shared tools in an additional kubernetes-cluster.
Con:
Cost for an additional cluster
It's probably harder to send ongoing data to external cluster (networking, security, firewall etc.).
3) Use an additional root-server
We run docker-containers on an oldschool-root-server.
Con:
Feels contradictory to use root-server instead of cutting-edge-k8s.
Single point of failure.
We need to control the docker-containers manually (or attach the machine to rancher).
I tried to google for the problem but I cannot find anything about the topic. Can anyone give me a hint or some links on this topic?
Or is it just no relevant problem that a cluster might go down?
To me, the second option sound less evil but I cannot estimate yet if it's hard to transfer data from one cluster to another.
The important questions are:
Is it a problem to have monitoring-data in a cluster because one cannot see the monitoring-data if the cluster is offline?
Is it common practice to have an additional cluster for shared services that should not have an impact on other parts of the application?
Is it (easily) possible to send metrics and logs from one kubernetes-cluster to another (we are running kubernetes in OpenTelekomCloud which is basically OpenStack)?
Thanks for your hints,
Marius
That is a very complex and philosophic topic, but I will give you my view on it and some facts to support it.
I think the best way is the second one - Create an additional cluster, and that's why:
You need a point which should be accessible from any of your environments. With a separate cluster, you can set the same firewall rules, routes, etc. in all your environments and it doesn't affect your current workload.
Yes, you need to pay a bit more. However, you need resources to run your shared applications, and overhead for a Kubernetes infrastructure is not high in comparison with applications.
With a separate cluster, you can setup a real HA solution, which you might not need for staging and development clusters, so you will not pay for that multiple times.
Technically, it is also OK. You can use Heapster to collect data from multiple clusters; almost any logging solution can also work with multiple clusters. All other applications can be just run on the separate cluster, and that's all you need to do with them.
Now, about your questions:
Is it a problem to have monitoring-data in a cluster because one cannot see the monitoring-data if the cluster is offline?
No, it is not a problem with a separate cluster.
Is it common practice to have an additional cluster for shared services that should not have an impact on other parts of the application?
I think, yes. At least I did it several times, and I know some other projects with similar architecture.
Is it (easily) possible to send metrics and logs from one kubernetes-cluster to another (we are running kubernetes in OpenTelekomCloud which is basically OpenStack)?
Yes, nothing complex there. Usually, it does not depend on the platform.

Which approach is better for discovering container readiness?

This question is discussed many times but I'd like to hear some best practices and real-world examples of using each of the approaches below:
Designing containers which are able to check the health of dependent services. Simple script whait-for-it can be usefull for this kind of developing containers, but aren't suitable for more complex deployments. For instance, database could accept connections but migrations aren't applyied yet.
Make container able to post own status in Consul/etcd. All dependent services will poll certain endpoint which contains status of needed service. Looks nice but seems redundant, don't it?
Manage startup order of containers by external scheduler.
Which of the approaches above are preferable in context of absence/presence orchestrators like Swarm/Kubernetes/etc in delivery process ?
I can take a stab at the kubernetes perspective on those.
Designing containers which are able to check the health of dependent services. Simple script whait-for-it can be useful for this kind of developing containers, but aren't suitable for more complex deployments. For instance, database could accept connections but migrations aren't applied yet.
This sounds like you want to differentiate between liveness and readiness. Kubernetes allows for both types of probes for these, that you can use to check health and wait before serving any traffic.
Make container able to post own status in Consul/etcd. All dependent services will poll certain endpoint which contains status of needed service. Looks nice but seems redundant, don't it?
I agree. Having to maintain state separately is not preferred. However, in cases where it is absolutely necessary, if you really want to store the state of a resource, it is possible to use a third party resource.
Manage startup order of containers by external scheduler.
This seems tangential to the discussion mostly. However, Pet Sets, soon to be replaced by Stateful Sets in Kubernetes v1.5, give you deterministic order of initialization of pods. For containers on a single pod, there are init-containers which run serially and in order prior to running the main container.

Erlang clusters

I'm trying to implement a cluster using Erlang as the glue that holds it all together. I like the idea that it creates a fully connected graph of nodes, but upon reading different articles online, it seems as though this doesn't scale well (having a max of 50 - 100 nodes). Did the developers of OTP impose this limitation on purpose? I do know that you can setup nodes to have explicit connections only as well as have hidden nodes, etc. But, it seems as though the default out-of-the-box setup isn't very scalable.
So to the questions:
If you had 5 nodes (A, B, C, D, E) that all had explicit connections such that A-B-C-D-E. Does Erlang/OTP allow A to talk directly to E or does A have to pass messages from B through D to get to E, and thus that's the reason for the fully connected graph? Again, it makes sense but it doesn't scale well from what I've seen.
If one was to try and go for a scalable and fault-tolerant system, what are your options? It seems as though, if you can't create a fully connected graph because you have too many nodes, the next best thing would be to create a tree of some kind. But, this doesn't seem very fault-tolerant because if the root or any parent of children nodes dies, you would lose a significant portion of your cluster.
In looking into supervisors and workers, all of the examples I've seen apply this to processes on a single node. Could it be applied to a cluster of nodes to help implement fault-tolerance?
Can nodes be part of several clusters?
Thanks for your help, if there is a semi-recent website or blogpost (roughly 1-year old) that I've missed, I'd be happy to look at those. But, I've scoured the internet pretty well.
Yes, you can send messages to a process on any remote node in a cluster, for example, by using its process identifier (pid). This is called location transparency. And yes, it scales well (see Riak, CouchDB, RabbitMQ, etc).
Note that one node can run hundred thousands of processes. Erlang has proven to be very scalable and was built for fault tolerance. There are other approaches to build bigger, e.g. SOA approach of CloudI (see comments). You also could build clusters that use hidden nodes if you really really need to.
At the node level you would take a different approach, for example, build identical nodes that are easy to replace if they fail and the work is taken over by the remaining nodes. Check out how Riak handles this (look into riak_core and check the blog post Introducing Riak Core).
Nodes can leave and enter a cluster but cannot be part of multiple clusters at the same time. Connected nodes share one cluster cookie which is used to identify connected nodes. You can set the cookie while the VM is running (see Distributed Erlang).
Read http://learnyousomeerlang.com/ for greater good.
The distribution protocol is about providing robustness, not scalability. What you want to do is to group your cluster into smaller areas and then use connections, which are not distribution in Erlang but in, say, TCP sessions. You could run 5 groups of 10 machines each. This means the 10 machines have seamless Pid distribution: you can call a pid on another machine. But distributing to another group means you can't seamlessly address the group like that.
You generally want some kind of "route reflection" as in BGP.
1) I think you need a direct connection between nodes to communicate between processes. This does, however, mean that you don't need persistent connections between all the nodes if two will never communicate (say if they're only workers, not coordinators).
2) You can create a not-fully-connected graph of erlang nodes. The documentation is hard to find, and comes with problems - you disable the global system which handles global names in the cluster, so you have to do everything by locally registered names, or locally registered names on remote nodes. Or just use Pids, as they work too. To start an erlang node like this, use erl ... -connect_all false .... I hope you know what you're up to, as I couldn't trust myself to do that.
It also turns out that a not-fully-connected graph of erlang nodes is a current research topic. The RELEASE Project is currently working on exactly that, and have come up with a concept of S-groups, which are essentially fully-connected groups. However, nodes can be members of more than one S-group and nodes in separate s-groups don't have to be fully connected but can establish the connections they need on demand to do direct node-to-node communication. It's worth finding presentations of theirs because the research is really interesting.
Another thing worth pointing out is that several people have found that you can get up to 150-200 nodes in a fully-connected cluster. Do you really have a use-case for more nodes than that? Surely 150-200 incredibly beefy computers would do most things you could throw at them, unless you have a ridiculous project to do.
3) While you can't start processes on a different node using gen_server:start_link/3,4, you can certainly call servers on a foreign node very easily. It seems that they've overlooked being able to start servers on foreign nodes, but there's probably good reason for it - such as a ridiculous number of error cases.
4) Try looking at hidden nodes, and at having a not-fully-connected cluster. They should allow you to group nodes as you see fit.
TL;DR: Scaling is hard, let's go shopping.
There are some good answers already, so I'm trying to be simple.
1) No, if A and E are not connected directly, A cannot talk to E. The distribution protocol runs on direct TCP connection - no routing included.
2) I think a tree structure is good enough - trade-offs always exist.
3) There's no 'supervisor for nodes', but erlang:monitor_node is your friend.
4) Yes. A node can talk to nodes from different 'clusters'. In the local node, use erlang:set_cookie(OtherNode, OtherCookie) to access a remote node with a different cookie.
1)
yes. they talk to each other
2) 3) and 4)
Generally speaking, when building a scalable and fault tolerant system, you would want, or more over, need to divide the work load to different "regions" or "clusters". Supervisor/Worker model has this envisioned thus the topology. What you need is a few processes coordinating work between clusters and all workers within one single cluster will talk to each other to balance out within group.
As you can see, with this topology, the "limitation" is not really a limitation as long as you divide your tasks carefully and in a balanced fashion. Personally, I believe a tree like structure for supervisor processes is not avoidable in large scale systems, and this is the practice I'm following. Reasons are vary but boils down to scalability, fault tolerance as fall back policy implementation, maintenance need and portability of the clusters.
So in conclusion,
2) use a tree-like topology for your supervisors. let workers explicitly connect to each other and talk within their own domain with the supervisors.
3) while this is the native designed environment, as I presume, I'm pretty sure a supervisor can talk to a worker on a different machine. I would not suggest this as fault tolerance can be hell in remote worker scenario.
4) you should never let a node be part of two different cluster at the same moment. You can switch it from one cluster to another though.

What's the best way to run a gen_server on all nodes in an Erlang cluster?

I'm building a monitoring tool in Erlang. When run on a cluster, it should run a set of data collection functions on all nodes and record that data using RRD on a single "recorder" node.
The current version has a supervisor running on the master node (rolf_node_sup) which attempts to run a 2nd supervisor on each node in the cluster (rolf_service_sup). Each of the on-node supervisors should then start and monitor a bunch of processes which send messages back to a gen_server on the master node (rolf_recorder).
This only works locally. No supervisor is started on any remote node. I use the following code to attempt to load the on-node supervisor from the recorder node:
rpc:call(Node, supervisor, start_child, [{global, rolf_node_sup}, [Services]])
I've found a couple of people suggesting that supervisors are really only designed for local processes. E.g.
Starting processes at remote nodes
how: distributed supervision tree
What is the most OTP way to implement my requirement to have supervised code running on all nodes in a cluster?
A distributed application is suggested as one alternative to a distributed supervisor tree. These don't fit my use case. They provide for failover between nodes, but keeping code running on a set of nodes.
The pool module is interesting. However, it provides for running a job on the node which is currently the least loaded, rather than on all nodes.
Alternatively, I could create a set of supervised "proxy" processes (one per node) on the master which use proc_lib:spawn_link to start a supervisor on each node. If something goes wrong on a node, the proxy process should die and then be restarted by it's supervisor, which in turn should restart the remote processes. The slave module could be very useful here.
Or maybe I'm overcomplicating. Is directly supervising nodes a bad idea, instead perhaps I should architect the application to gather data in a more loosely coupled way. Build a cluster by running the app on multiple nodes, tell one to be master, leave it at that!
Some requirements:
The architecture should be able to cope with nodes joining and leaving the pool without manual intervention.
I'd like to build a single-master solution, at least initially, for the sake of simplicity.
I would prefer to use existing OTP facilities over hand-rolled code in my implementation.
Interesting challenges, to which there are multiple solutions. The following are just my suggestions, which hopefully makes you able to better make the choice on how to write your program.
As I understand your program, you want to have one master node where you start your application. This will start the Erlang VM on the nodes in the cluster. The pool module uses the slave module to do this, which require key-based ssh communication in both directions. It also requires that you have proper dns working.
A drawback of slave is that if the master dies, so does the slaves. This is by design as it probably fit the original use case perfectly, however in your case it might be stupid (you may want to still collect data, even if the master is down, for example)
As for the OTP applications, every node may run the same application. In your code you can determine the nodes role in the cluster using configuration or discovery.
I would suggest starting the Erlang VM using some OS facility or daemontools or similar. Every VM would start the same application, where one would be started as the master and the rest as slaves. This has the drawback of marking it harder to "automatically" run the software on machines coming up in the cluster like you could do with slave, however it is also much more robust.
In every application you could have a suitable supervision tree based on the role of the node. Removing inter-node supervision and spawning makes the system much simpler.
I would also suggest having all the nodes push to the master. This way the master does not really need to care about what's going on in the slave, it might even ignore the fact that the node is down. This also allows new nodes to be added without any change to the master. The cookie could be used as authentication. Multiple masters or "recorders" would also be relatively easy.
The "slave" nodes however will need to watch out for the master going down and coming up and take appropriate action, like storing the monitoring data so it can send it later when the master is back up.
I would look into riak_core. It provides a layer of infrastructure for managing distributed applications on top of the raw capabilities of erlang and otp itself. Under riak_core, no node needs to be designated as master. No node is central in an otp sense, and any node can take over other failing nodes. This is the very essence of fault tolerance. Moreover, riak_core provides for elegant handling of nodes joining and leaving the cluster without needing to resort to the master/slave policy.
While this sort of "topological" decentralization is handy, distributed applications usually do need logically special nodes. For this reason, riak_core nodes can advertise that they are providing specific cluster services, e.g., as embodied by your use case, a results collector node.
Another interesting feature/architecture consequence is that riak_core provides a mechanism to maintain global state visible to cluster members through a "gossip" protocol.
Basically, riak_core includes a bunch of useful code to develop high performance, reliable, and flexible distributed systems. Your application sounds complex enough that having a robust foundation will pay dividends sooner than later.
otoh, there's almost no documentation yet. :(
Here's a guy who talks about an internal AOL app he wrote with riak_core:
http://www.progski.net/blog/2011/aol_meet_riak.html
Here's a note about a rebar template:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-March/003632.html
...and here's a post about a fork of that rebar template:
https://github.com/rzezeski/try-try-try/blob/7980784b2864df9208e7cd0cd30a8b7c0349f977/2011/riak-core-first-multinode/README.md
...talk on riak_core:
http://www.infoq.com/presentations/Riak-Core
...riak_core announcement:
http://blog.basho.com/2010/07/30/introducing-riak-core/

Starting remote Erlang nodes

I want to write a master-slave application in Erlang. I am thinking at the following things I need from the architecture:
the slaves shouldn't die when the master dies, but rather try to reconnect to it while the master is down
the master should automatically start the remote nodes if they don't connect automatically or they are down (probably the supervisor behaviour in OTP)
Is there a OTP oriented behaviour to do this? I know I can start remote nodes with slave:start_link() and I can monitor nodes with erlang:monitor(), but I don't know how this can be incorporated in a gen_server behaviour.
I agree with the comments about using erlang:monitor_node and the use of distributed applications.
You cannot just use the slave module to accomplish that, it clearly states "All slave nodes which are started by a master will terminate automatically when the master terminates".
There is currently no OTP behaviour to do it either. Supervision trees are hierarchical ; it seems like you are looking for something where there is a hierarchy in terms of application logic, but spawning is done an a peer-to-peer basis (or an individual basis, depending upon your point of view).
If you were to use multiple Erlang VMs then you should carefully consider how many you run, as a large number of them may cause performance issues due to the OS swapping OS processes in and out. A rule of thumb for best performance is to aim for having no more than one OS process (i.e. one Erlang VM) per CPU core.
If you're interested in studying other implementations, Basho's riak_core framework has a pretty good take on decentralized distributed applications.
riak_core_node_watcher.erl has most of the interesting node observation code in it.
Search and you'll find there are quite a few talks and presentations about the framework.

Resources