Why does dataflow use additional disks? - google-cloud-dataflow

When I see the details of my dataflow compute engine instance, I can see two categories of disks being used - (1) Boot disk and local disks, and (2) Additional disks.
I can see that the size that I specify using the diskSizeGb option determines the size of a single disk under the category 'Boot disk and local disks'. My not-so-heavy job is using 8 additional disks of 40GB each.
What are additional disks used for and is it possible to limit their size/number?

Dataflow will create for your job Compute Engine VM instances, also known as workers.
To process the input data and store temporary data, each worker may require up to 15 additional Persistent Disks.
The default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode. 40GB is very far from the default value
In this case, the Dataflow service will span more disks for your worker. If you want to keep a 1:1 ratio between workers and disks, please increase the ‘diskSizeGb’ field.

The existing answer explains how many disks, and information about the disks - but it does not answer the main question: Why so many disks per worker?
WHY does Dataflow need several disks per worker?
The way in which Dataflow does load balancing for streaming jobs is that a range of keys is allocated to each disk. Persistent state about each key is stored in these disks.
A worker can be overloaded if the ranges that are allocated to its persistent disks have a very high volume. To load-balance, Dataflow can move a range from one worker to another by transferring a persistent disk to a different worker.
So this is why Dataflow uses multiple disks per worker: Because this allows it to do load balancing and autoscaling by moving the disks from worker to worker.

Related

Memory issue in dask with using local cluster

I'm trying to use Dask local cluster to manage system wide memory usage,
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(scheduler_port=5272, dashboard_address=5273,memory_limit='4GB')
I connect with:
client = Client('tcp://127.0.0.1:5272')
I have 8 cores and 32 GB. The local cluster distributes 4GB * 4 = 16GB memory (I have another task that required about 10GB memory) into the local cluster. However, previously there are some tasks I could finish well without calling client = Client('tcp://127.0.0.1:5272'). After I call client = Client('tcp://127.0.0.1:5272'), memory error triggered. What can i do in this scenario? Thanks!
I'm thinking if it is because each worker is only allocated 4GB memory... but if I assign memory_limit='16GB'. If it uses all of the resources it would take 64GB. I don't have that much memory. What can I do?
It's not clear what you are trying to achieve, but your observation on memory is correct. If a worker is constrained by memory, then they won't be able to complete the task. What are ways out of this?
getting access to more resources, if you don't have access to additional hardware, then you can check coiled.io or look into the various dask cloud options
optimizing your code, perhaps some calculations could be done in smaller chunks, data could be compressed (e.g. categorical dtype) or there are other opportunities to reduce memory requirements (really depends on the functions, but let's say some internal calculation could be done at a smaller accuracy with fewer resources)
using all available resources with a non-distributed code (which would add some overhead to the resource requirements).

Necessary data size to evaluate Hadoop performance

I'm running Hadoop with 3 datanodes on a single machine using Docker containers. I've run a KMeans algorithm on a small simulated dataset with 200 data points.
Because of the Hadoop overhead, the process takes a long time, about 2 or 3 minutes, while running kmeans locally in R takes few seconds.
I wonder how big my dataset has to be to Hadoop overperform the non-distributed approach, and if that's possible since I'm running all the nodes on single machine.
It's the number of cores and RAM available to process the data, that matters more than the amount of data itself, so limiting Hadoop jobs inside containers is actually running little JVM containers within those containers. Therefore, it's expected that giving one full machine access to process the same amount of data will be much more likely to process quicker, and I'm sure there's a way to write the same distributed algorithm without Hadoop
Besides that, if the data itself isn't splittable or less than the hdfs block size, then it'll only be able to be processed by a single mapreduce task anyway. You didn't mention the size, but I suspect 200 data points is only a few MB at most

Is there a way to limit the performance data being recorded by AKS clusters?

I am using azure log analytics to store monitoring data from AKS clusters. 72% of the data stored is performance data. Is there a way to limit how often AKS reports performance data?
At this point we do not provide a mechanism to change performance metric collection frequency. It is set to 1 minute and cannot be changed.
We were actually thinking about adding an option to make more frequent collection as was requested by some customers.
Given the number of objects (pods, containers, etc) running in the cluster collecting even a few perf metrics may generate noticeable amount of data... You need that data in order to figure out what is going on in case of a problem.
Curious: you say your perf data is 72% of total - how much is it in terms om Gb/day, do you now? Do you have any active applications running on the cluster generating tracing? What we see is that once you stand up a new cluster, perf data is "the king" of volume, but once you start ading active apps that trace, logs become more and more of a factor in telemetry data volume...

GlusterFS high CPU usage on read load

I have a GlusterFS setup with two nodes(node1 and node2) setup to a replicated volume.
The volume contains many small files, 8kb - 200kb in size. When I subject node1 to heavy read load, glusterfsd and glusterfs processed together uses ~ 100% CPU on both nodes.
There is no write load on any of the nodes. But why is the CPU load so high, on both nodes?
As I understand it all the data is replicated to both nodes, so it "should" perform like a local filesystem.
this is commonly related to small files, e.g. if you have PHP apps running from a gluster volume.
This one bit me in the rear once, and it mostly has to do that in many php frameworks, you get a lot of stats to see if a file exists at that spot, if not, it will state a level (directory) higher, or with a slightly different name. Repeat 1000 times. Per file.
Now here's the catch: that lookup if the file exists does not just happen on that node / the local brick. (if you use replication), but on ALL the nodes / bricks involved. The cost involved can explode fast. (specially on some cloud platforms, where IOPS are capped)
This article helped me out significantly. In the end there was still a small penalty, but the benefits outweighed that.
https://www.vanderzee.org/linux/article-170626-141044/article-171031-113239/article-171212-095104

What is the difference between Volume and Partition?

What is the difference between partition and volume?
Kindly give an analogy if possible since I am unable to understand the difference between them.
Partitions -
Storage media (DVD's, USB sticks, HDD's, SSD's) can all be divided into partitions, these partitions are identified by a partition table.
The partition table is where the partition information is stored, the information stored within here is basically where the partition starts and where it finishes on the disc platter.
Volumes -
A Volume is a logical abstraction from physical storage.
Large disks can be partitioned into multiple logical volumes
Volumes are divided up into fixed size blocks or a cluster or blocks.
We don't see the partition as this is sorted by the file system controller but we see volumes as they are logical and are provided by a gui with a hierarchical structure and human interface. When we request to see a file it runs through a specific order to view that information from within the volume on the partition:
Application created the file I/O request
The file system creates a block I/O request
Block I/O drive accesses the disk
Hope this helps... If any part needs clearing up let me know, try my best to clear it up more

Resources