Necessary data size to evaluate Hadoop performance

Necessary data size to evaluate Hadoop performance - docker

I'm running Hadoop with 3 datanodes on a single machine using Docker containers. I've run a KMeans algorithm on a small simulated dataset with 200 data points.
Because of the Hadoop overhead, the process takes a long time, about 2 or 3 minutes, while running kmeans locally in R takes few seconds.
I wonder how big my dataset has to be to Hadoop overperform the non-distributed approach, and if that's possible since I'm running all the nodes on single machine.

It's the number of cores and RAM available to process the data, that matters more than the amount of data itself, so limiting Hadoop jobs inside containers is actually running little JVM containers within those containers. Therefore, it's expected that giving one full machine access to process the same amount of data will be much more likely to process quicker, and I'm sure there's a way to write the same distributed algorithm without Hadoop
Besides that, if the data itself isn't splittable or less than the hdfs block size, then it'll only be able to be processed by a single mapreduce task anyway. You didn't mention the size, but I suspect 200 data points is only a few MB at most

Related

What factors can affect different containers processing on one machine at the same time?

For example, I have a 4vCPU, 8GB mem VM. At first, I ran a Nginx container on it and then used a stress test tool to continuously send requests to it and got some information like QPS, average latency. Then I ran three same Nginx containers on the VM and parallelly send the same requests above to these containers.I found that the respective QPS all decreased, and average latency all increased.
So what factors can affect different containers processing on one machine at the same time? I think the CPU and memory are enough to provide resources to these containers. What factors below the docker can affect these, firstly I think is network, but what else? And Specifically, why can network affect these QPS, average latency metrics?

Why would one chose many smaller machine types instead of fewer big machine types?

In a clustering high-performance computing framework such as Google Cloud Dataflow (or for that matter even Apache Spark or Kubernetes clusters etc), I would think that it's far more performant to have fewer really BIG machine types rather than many small machine types, right? As in, it's more performant to have 10 n1-highcpu-96 rather than say 120 n1-highcpu-8 machine types, because
the cpus can use shared memory, which is way way faster than network communications
if a single thread needs access to lots of memory for a single threaded operation (eg sort), it has access to that greater memory in a BIG machine rather than a smaller one
And since the price is the same (eg 10 n1-highcpu-96 costs the same as 120 n1-highcpu-8 machine types), why would anyone opt for the smaller machine types?
As well, I have a hunch that for the n1-highcpu-96 machine type, we'd occupy the whole host, so we don't need to worry about competing demands on the host by another VM from another Google cloud customer (eg contention in the CPU caches
or motherboard bandwidth etc.), right?
Finally, although I don't think the google compute VMs correctly report the "true" CPU topology of the host system, if we do chose the n1-highcpu-96 machine type, the reported CPU topology may be a touch closer to the "truth" because presumably the VM is using up the whole host, so the reported CPU topology is a little closer to the truth, so any programs (eg the "NUMA" aware option in Java?) running on that VM that may attempt to take advantage of the topology has a better chance of making the "right decisions".

It will depend on many factors if you want to choose many instances with smaller machine type or a few instances with big machine types.
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance.
Instances with small machine types have are limited in CPU and I/O power and are inadequate for heavy workloads.
Also, if you are planning to grow and scale it is better to design and develop your application in several instances. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned.
Having a small number of instances helps to isolate fault domains. If one of your small nodes crashes, that only affects a small number of processes. If a large node crashes, multiple processes go down.
It also depends on the application you are running on your cluster and the workload.I would also recommend going through this link to see the sizing recommendation for an instance.

How to emulate 500-50000 worker (docker) nodes network?

So I have a worker docker images. I want to spin up a network of 500-50000 nodes to emulate what happens to a private blockchain such as etherium on different scales. What would be a recomendation for an opensource tool/library for such job:
a) one that would make sure that even on a low-endish (say one 40 cores node) all workers will be moved forward in time equaly (not realtime)
b) would allow (a) in a distributed setting (say 10 low-endish nodes on a single lan)
In other words I do not seek for realtime network emulation, so I can wait for 10 hours to simulate 1 minute and it would be good enough fro me. I thought about Kathara yet a problem still stands - how to make sure that say 10000 containers are given the same amount of ticks in a round-robin manner?
So how to emulate a complex network of docker workers?

I'm taking the assumption that you will run each inside of a container. To ensure each container runs with similar CPU access, you can configure CPU reservations and limits on each replica. These numbers get computed down to fractional slices of a core, so on an 8 core system, you could give each container 0.01 of a core to run upwards of 800 containers. See the compose documentation on how to set resource constraints. And with swarm mode, you could spread these replicas across multiple nodes, sharing a network.
That said, I think the advice to run shorter simulations on more hardware is good. You will find a significant portion of the time is spent in context switching between each process, possibly invalidating any measurements you want to take.
You will also encounter scalability issues with docker and the orchestration tool you choose. For example, you'll need to adjust the subnet size for any shared network which defaults to a /24 with around 253 available IP's. The docker engine itself will likely be spending a non-trivial amount of CPU time maintaining the state for all of the running containers.

Tensorflow scalibility

I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?

General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)

Why does Spark use and grab too many memory when shuffle occurred?

I find there is too many memory usage when shuffle occurred in Spark process.
Following figure is memory metric when I use 700MB data and just three rdd.map.
(I use Ganglia as monitoring tool, and show just three nodes of my cluster. x-axis means time-series, y-axis means memory usage)
enter image description here
and following figure is also memory metric when I use same data and use three rdd.groupBy, three rdd.flatMap (order : groupBy1->flatMap1->groupBy2->flatMap2->groupBy3->flatMap3)
enter image description here
As you can see, all of three node's memory is considerably increased (several GB) even though I use just 700MB data. Indeed I have 8 worker node, and all of 8 worker's memory is considerably increased.
I think the main cause is shuffle since rdd.map has no shuffle but rdd.groupBy has shuffle.
In this situation, I wonder three point below :
why is there too many memory usage? (more than 15GB is used when I use 700MB in all of my worker node.)
why does it seem that used memory for old shuffle is not removed before Spark application is finished?
Is there any way to reduce memory usage or remove memory generated in old shuffle?
P.S. - My environment :
cloud platform : MS Azure (8 worker nodes)
Spec. of one worker : 8 cores CPU, 16GB RAM
Language : Java
Spark version : 1.6.2
Java version : 1.7(development), 1.8(execution)
Run in Spark-standalone (Not use Yarn or Mesos)

In Spark, The operating system will decide if the data can stay in its buffer cache or should it be spilled to DISK. Each map task creates as many shuffle spill files as number of reducers. SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop.
Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation.
Please refer to this post which very well explains this in detail in continuation to above explanation.
You can also refer to Optimizing Shuffle Performance in Spark paper.
~Kedar

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart