Why not enable virtual node in an Hadoop node? - datastax-enterprise

In url: http://www.datastax.com/docs/datastax_enterprise3.2/solutions/about_hadoop
"Before starting an analytics/Hadoop node on a production cluster or data center, it is important to disable the virtual node configuration."
What will happen if I enable virtual node in an analytics/Hadoop node?

If you enable virtual nodes on hadoop node, it will lower performance of small Hadoop jobs by raising the number of mappers to at least the number of virtual nodes. E.g. if you use the default 256 vnodes / physical nodes setting, every Hadoop job will launch 257 mappers. Those mappers might have too little data to process and the server would spend most of the time managing those tasks instead of doing useful work.
On a decent hardware, a job with no data and 256 vnodes may take about 5-10 minutes, contrary to the same job requiring only about 20-40 seconds when configured without vnodes.

Related

Docker swarm regionalization for latency sensitive topology

We are currently operating a backend stack in central europe, Japan and Taiwan and are perparing our stack to transition to docker swarm.
We are working with real time data streams from sensor networks to do fast desaster warnings which means that latency is critical for some services. Therefore, we currently have brokers (rabbitmq) running on dedicated servers in each region as well as a backend instance digesting the data that is sent accross these brokers.
I'm uncertain how to best achieve a comparable topology using docker swarm. Is it possible to group nodes, let's say by nationality and then deploy a latency critical service stacks to each of these groups? Should I create separate swarms for each region (feels conceptually contradictory to docker swarm)?
The swarm managers should be in a low latency zone. Swarm workers can be anywhere. You can use a node label to indicate the location of the node, and restrict your workloads to a particular label as needed.
Latency critical considerations on the container-to-container network across large regional boundaries may be relevant depending on your required data path. If the only latency-critical data path is to the rabbitmq service that is external to the swarm, then you won't need to worry about the container-to-container latency.
It is also a valid pattern to have one swarm per region. If you need to be able to lose any region without impacting services on another region, then you'd want to split it up. If you have multiple low latency regions, then you can spread the master nodes across those.

Scheduling and scaling pods in kubernetes

i am running k8s cluster on GKE
it has 4 node pool with different configuration
Node pool : 1 (Single node coroned status)
Running Redis & RabbitMQ
Node pool : 2 (Single node coroned status)
Running Monitoring & Prometheus
Node pool : 3 (Big large single node)
Application pods
Node pool : 4 (Single node with auto-scaling enabled)
Application pods
currently, i am running single replicas for each service on GKE
however 3 replicas of the main service which mostly manages everything.
when scaling this main service with HPA sometime seen the issue of Node getting crashed or kubelet frequent restart PODs goes to Unkown state.
How to handle this scenario ? If the node gets crashed GKE taking time to auto repair and which cause service down time.
Question : 2
Node pool : 3 -4 running application PODs. Inside the application, there are 3-4 memory-intensive micro services i am also thinking same to use Node selector and fix it on one Node.
while only small node pool will run main service which has HPA and node auto scaling auto work for that node pool.
however i feel like it's not best way to it with Node selector.
it's always best to run more than one replicas of each service but currently, we are running single replicas only of each service so please suggest considering that part.
As Patrick W rightly suggested in his comment:
if you have a single node, you leave yourself with a single point of
failure. Also keep in mind that autoscaling takes time to kick in and
is based on resource requests. If your node suffers OOM because of
memory intensive workloads, you need to readjust your memory requests
and limits – Patrick W Oct 10 at
you may need to redesign a bit your infrastructure so you have more than a single node in every nodepool as well as readjust mamory requests and limits
You may want to take a look at the following sections in the official kubernetes docs and Google Cloud blog:
Managing Resources for Containers
Assign CPU Resources to Containers and Pods
Configure Default Memory Requests and Limits for a Namespace
Resource Quotas
Kubernetes best practices: Resource requests and limits
How to handle this scenario ? If the node gets crashed GKE taking time
to auto repair and which cause service down time.
That's why having more than just one node for a single node pool can be much better option. It greatly reduces the likelihood that you'll end up in the situation described above. GKE autorapair feature needs to take its time (usually a few minutes) and if this is your only node, you cannot do much about it and need to accept possible downtimes.
Node pool : 3 -4 running application PODs. Inside the application,
there are 3-4 memory-intensive micro services i am also thinking same
to use Node selector and fix it on one Node.
while only small node pool will run main service which has HPA and
node auto scaling auto work for that node pool.
however i feel like it's not best way to it with Node selector.
You may also take a loot at node affinity and anti-affinity as well as taints and tolerations

How to emulate 500-50000 worker (docker) nodes network?

So I have a worker docker images. I want to spin up a network of 500-50000 nodes to emulate what happens to a private blockchain such as etherium on different scales. What would be a recomendation for an opensource tool/library for such job:
a) one that would make sure that even on a low-endish (say one 40 cores node) all workers will be moved forward in time equaly (not realtime)
b) would allow (a) in a distributed setting (say 10 low-endish nodes on a single lan)
In other words I do not seek for realtime network emulation, so I can wait for 10 hours to simulate 1 minute and it would be good enough fro me. I thought about Kathara yet a problem still stands - how to make sure that say 10000 containers are given the same amount of ticks in a round-robin manner?
So how to emulate a complex network of docker workers?
I'm taking the assumption that you will run each inside of a container. To ensure each container runs with similar CPU access, you can configure CPU reservations and limits on each replica. These numbers get computed down to fractional slices of a core, so on an 8 core system, you could give each container 0.01 of a core to run upwards of 800 containers. See the compose documentation on how to set resource constraints. And with swarm mode, you could spread these replicas across multiple nodes, sharing a network.
That said, I think the advice to run shorter simulations on more hardware is good. You will find a significant portion of the time is spent in context switching between each process, possibly invalidating any measurements you want to take.
You will also encounter scalability issues with docker and the orchestration tool you choose. For example, you'll need to adjust the subnet size for any shared network which defaults to a /24 with around 253 available IP's. The docker engine itself will likely be spending a non-trivial amount of CPU time maintaining the state for all of the running containers.

Bosun HA and scalability

I have a minor bosun setup, and its collecting metrics from numerous services, and we are planning to scale these services on the cloud.
This will mean more data coming into bosun and hence, the load/efficiency/scale of bosun is affected.
I am afraid of losing data, due to network overhead, and in case of failures.
I am looking for any performance benchmark reports for bosun, or any inputs on benchmarking/testing bosun for scale and HA.
Also, any inputs on good practices to be followed to scale bosun will be helpful.
My current thinking is to run numerous bosun binaries as a cluster, backed by a distributed opentsdb setup.
Also, I am thinking is it worthwhile to run some bosun executors as plain 'collectors' of scollector data (with bosun -n command), and some to just calculate the alerts.
The problem with this approach is it that same alerts might be triggered from multiple bosun instances (running without option -n). Is there a better way to de-duplicate the alerts?
The current best practices are:
Use https://godoc.org/bosun.org/cmd/tsdbrelay to forward metrics to opentsdb. This gets the bosun binary out of the "critical path". It should also forward the metrics to bosun for indexing, and can duplicate the metric stream to multiple data centers for DR/Backups.
Make sure your hadoop/opentsdb cluster has at least 5 nodes. You can't do live maintenance on a 3 node cluster, and hadoop usually runs on a dozen or more nodes. We use Cloudera Manager to manage the hadoop cluster, and others have recommended Apache Ambari.
Use a load balancer like HAProxy to split the /api/put write traffic across multiple instances of tsdbrelay in an active/passive mode. We run one instance on each node (with tsdbrelay forwarding to the local opentsdb instance) and direct all write traffic at a primary write node (with multiple secondary/backup nodes).
Split the /api/query traffic across the remaining nodes pointed directly at opentsdb (no need to go thru the relay) in an active/active mode (aka round robin or hash based routing). This improves query performance by balancing them across the non-write nodes.
We only run a single bosun instance in each datacenter, with the DR site using the read only flag (any failover would be manual). It really isn't designed for HA yet, but in the future may allow two nodes to share a redis instance and allow active/active or active/passive HA.
By using tsdbrelay to duplicate the metric streams you don't have to deal with opentsdb/hbase replication and instead can setup multiple isolated monitoring systems in each datacenter and duplicate the metrics to whichever sites are appropriate. We have a primary and a DR site, and choose to duplicate all metrics to both data centers. I actually use the DR site daily for Grafana queries since it is closer to where I live.
You can find more details about production setups at http://bosun.org/resources including copies of all of the haproxy/tsdbrelay/etc configuration files we use at Stack Overflow.

Aerospike data migration with no reason

I have two Aerospike servers cluster runnning with a replication factor of 2. Both servers have the same replicated objects count, which means all records are replacated. But still the monitoring panel shows incoming and outgoing migration going on.
This happened after I restarted one of the servers. Now de I/O rate in both servers are above it was before restarting.
Why is this happening?
When a node leaves the cluster, the partition id of any partition that node was a member of advances. When the node returns, they share their partition info with the cluster and migrations are required for any partition the returning node is a member of. This is done because while the node was down, the remaining node may have taken on writes.
For replication factor 2 with 2 nodes, both nodes are members of all partitions.

Resources