GlusterFS high CPU usage on read load - storage

I have a GlusterFS setup with two nodes(node1 and node2) setup to a replicated volume.
The volume contains many small files, 8kb - 200kb in size. When I subject node1 to heavy read load, glusterfsd and glusterfs processed together uses ~ 100% CPU on both nodes.
There is no write load on any of the nodes. But why is the CPU load so high, on both nodes?
As I understand it all the data is replicated to both nodes, so it "should" perform like a local filesystem.

this is commonly related to small files, e.g. if you have PHP apps running from a gluster volume.
This one bit me in the rear once, and it mostly has to do that in many php frameworks, you get a lot of stats to see if a file exists at that spot, if not, it will state a level (directory) higher, or with a slightly different name. Repeat 1000 times. Per file.
Now here's the catch: that lookup if the file exists does not just happen on that node / the local brick. (if you use replication), but on ALL the nodes / bricks involved. The cost involved can explode fast. (specially on some cloud platforms, where IOPS are capped)
This article helped me out significantly. In the end there was still a small penalty, but the benefits outweighed that.
https://www.vanderzee.org/linux/article-170626-141044/article-171031-113239/article-171212-095104

Related

Best practice to permute static directory structures at deployment of Docker containers (in Kubernetes)?

I have a base image and several completely orthogonal "dimensions" of completely static overlays that map into data directories, each with several options, that I want to permute to produce final container(s) in my deployments. As a degenerate example, the base image (X) will need one of each of (A,B,C), (P,D,Q), and (K,L,M) at deployment time. What I'm doing now is building separate images for each permutation that I end up actually needing: e.g. XADM, XBDK, etc. The problem is that as the number of dimensions of static data overlays expands and the number of choices inside each dimension gets larger, I run into serious combinatorial explosion issues - it might take 10 minutes for our CI/CD system to build each image (some of the overlays are large) and since it is the base image that changes most often, the layers don't cache well.
Thoughts so far:
generate each layer (ABCPDQKLM) as a separate container that populates a volume which then gets mounted by each of my X containers. This is is fine, though I NEVER need the layers to be writable and don't especially want to pay for persistent storage associated with volumes that feel like they should be superfluous.
reorder my layers to be slowest-to-fastest changing. I can get some improvement from doing this, but I still hit the combinatorics issue: I probably still have to build all the combinations I need, but at least my CI/CD build time will be improved. I think it results in poorer overall layer caching, but trading off space for time might be reasonable and the result per tenant is still good and doesn't incur any volume storage during deployment.
I'm not happy about either option (or my current solution). Any ideas would be welcome.
Edits/Questions:
"static" means read-only, but as a practical matter, the A/B/C overlays might each be a few 100MB of directory structure to be mounted/present in a specific place in the container's file system. In every case, it is data that is going to be used (even memory-mapped!) by the programs in the base image, so it needs to be at least very effectively cached near each of the CPUs that is going to be using it. I like the performance characteristics of having the data baked into the containers, but perhaps I should be trusting the storage layer more to keep the data properly cached/replicated near the real CPUs. Doing so means trading off registry space charges against PV storage charges, but that may be a minor consideration.
Basically, each "dimension" is a type of trained machine learning model. I need to compose the dimensions by choosing the right set of trained models to fit the domain required for each of many production tenants.

Docker design: exchange data between containers or put multiple processes in one container?

In a current project I have to perform the following tasks (among others):
capture video frames from five IP cameras and stitch a panorama
run machine learning based object detection on the panorama
stream the panorama so it can be displayed in a UI
Currently, the stitching and the streaming runs in one docker container, and the object detection runs in another, reading the panorama stream as input.
Since I need to increase the input resolution for the the object detector while maintaining the stream resolution for the UI, I have to look for alternative ways of getting the stitched (full resolution) panorama (~10 MB per frame) from the stitcher container to the detector container.
My thoughts regarding potential solutions:
shared volume. Potential downside: One extra write and read per frame might be too slow?
Using a message queue or e.g. redis. Potential downside: yet another component in the architecture.
merging the two containers. Potential downside(s): Not only does it not feel right, but the two containers have completely different base images and dependencies. Plus I'd have to worry about parallelization.
Since I'm not the sharpest knife in the docker drawer, what I'm asking for are tips, experiences and best practices regarding fast data exchange between docker containers.
Usually most communication between Docker containers is over network sockets. This is fine when you're talking to something like a relational database or an HTTP server. It sounds like your application is a little more about sharing files, though, and that's something Docker is a little less good at.
If you only want one copy of each component, or are still actively developing the pipeline: I'd probably not use Docker for this. Since each container has an isolated filesystem and its own user ID space, sharing files can be unexpectedly tricky (every container must agree on numeric user IDs). But if you just run everything on the host, as the same user, pointing at the same directory, this isn't a problem.
If you're trying to scale this in production: I'd add some sort of shared filesystem and a message queueing system like RabbitMQ. For local work this could be a Docker named volume or bind-mounted host directory; cloud storage like Amazon S3 will work fine too. The setup is like this:
Each component knows about the shared storage and connects to RabbitMQ, but is unaware of the other components.
Each component reads a message from a RabbitMQ queue that names a file to process.
The component reads the file and does its work.
When it finishes, the component writes the result file back to the shared storage, and writes its location to a RabbitMQ exchange.
In this setup each component is totally stateless. If you discover that, for example, the machine-learning component of this is slowest, you can run duplicate copies of it. If something breaks, RabbitMQ will remember that a given message hasn't been fully processed (acknowledged); and again because of the isolation you can run that specific component locally to reproduce and fix the issue.
This model also translates well to larger-scale Docker-based cluster-computing systems like Kubernetes.
Running this locally, I would absolutely keep separate concerns in separate containers (especially if individual image-processing and ML tasks are expensive). The setup I propose needs both a message queue (to keep track of the work) and a shared filesystem (because message queues tend to not be optimized for 10+ MB individual messages). You get a choice between Docker named volumes and host bind-mounts as readily available shared storage. Bind mounts are easier to inspect and administer, but on some platforms are legendarily slow. Named volumes I think are reasonably fast, but you can only access them from Docker containers, which means needing to launch more containers to do basic things like backup and pruning.
Alright, Let's unpack this:
IMHO Shared Volume works just fine, but gets way too messy over time. Especially if you're handling Stateful services.
MQ: This seems like a best option in my opinion. Yes, it's another component in your architecture, but it makes sense to have it rather than maintaining messy shared Volumes or handling massive container images (if you manage to combine 2 container images)
Yes, You could potentially do this, but not a good idea. Considering your use case, I'm going to go ahead and make an assumption that you have a massive list of dependencies which could potentially lead to a conflict. Also, lot of dependencies = larger image = Larger attack surface - which from a security perspective is not a good thing.
If you really want to run multiple processes in one container, it's possible. There are multiple ways to achieve that, however I prefer supervisord.
https://docs.docker.com/config/containers/multi-service_container/

Why would one chose many smaller machine types instead of fewer big machine types?

In a clustering high-performance computing framework such as Google Cloud Dataflow (or for that matter even Apache Spark or Kubernetes clusters etc), I would think that it's far more performant to have fewer really BIG machine types rather than many small machine types, right? As in, it's more performant to have 10 n1-highcpu-96 rather than say 120 n1-highcpu-8 machine types, because
the cpus can use shared memory, which is way way faster than network communications
if a single thread needs access to lots of memory for a single threaded operation (eg sort), it has access to that greater memory in a BIG machine rather than a smaller one
And since the price is the same (eg 10 n1-highcpu-96 costs the same as 120 n1-highcpu-8 machine types), why would anyone opt for the smaller machine types?
As well, I have a hunch that for the n1-highcpu-96 machine type, we'd occupy the whole host, so we don't need to worry about competing demands on the host by another VM from another Google cloud customer (eg contention in the CPU caches
or motherboard bandwidth etc.), right?
Finally, although I don't think the google compute VMs correctly report the "true" CPU topology of the host system, if we do chose the n1-highcpu-96 machine type, the reported CPU topology may be a touch closer to the "truth" because presumably the VM is using up the whole host, so the reported CPU topology is a little closer to the truth, so any programs (eg the "NUMA" aware option in Java?) running on that VM that may attempt to take advantage of the topology has a better chance of making the "right decisions".
It will depend on many factors if you want to choose many instances with smaller machine type or a few instances with big machine types.
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance.
Instances with small machine types have are limited in CPU and I/O power and are inadequate for heavy workloads.
Also, if you are planning to grow and scale it is better to design and develop your application in several instances. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned.
Having a small number of instances helps to isolate fault domains. If one of your small nodes crashes, that only affects a small number of processes. If a large node crashes, multiple processes go down.
It also depends on the application you are running on your cluster and the workload.I would also recommend going through this link to see the sizing recommendation for an instance.

How to emulate 500-50000 worker (docker) nodes network?

So I have a worker docker images. I want to spin up a network of 500-50000 nodes to emulate what happens to a private blockchain such as etherium on different scales. What would be a recomendation for an opensource tool/library for such job:
a) one that would make sure that even on a low-endish (say one 40 cores node) all workers will be moved forward in time equaly (not realtime)
b) would allow (a) in a distributed setting (say 10 low-endish nodes on a single lan)
In other words I do not seek for realtime network emulation, so I can wait for 10 hours to simulate 1 minute and it would be good enough fro me. I thought about Kathara yet a problem still stands - how to make sure that say 10000 containers are given the same amount of ticks in a round-robin manner?
So how to emulate a complex network of docker workers?
I'm taking the assumption that you will run each inside of a container. To ensure each container runs with similar CPU access, you can configure CPU reservations and limits on each replica. These numbers get computed down to fractional slices of a core, so on an 8 core system, you could give each container 0.01 of a core to run upwards of 800 containers. See the compose documentation on how to set resource constraints. And with swarm mode, you could spread these replicas across multiple nodes, sharing a network.
That said, I think the advice to run shorter simulations on more hardware is good. You will find a significant portion of the time is spent in context switching between each process, possibly invalidating any measurements you want to take.
You will also encounter scalability issues with docker and the orchestration tool you choose. For example, you'll need to adjust the subnet size for any shared network which defaults to a /24 with around 253 available IP's. The docker engine itself will likely be spending a non-trivial amount of CPU time maintaining the state for all of the running containers.

How important is a small Docker image when running?

There are a number of really tiny Linux Docker images that weigh in around 4-5M and the "full" distros that start around 100M and climb to twice that.
Setting aside storage space and download time from a repo, are there runtime considerations to small vs large images? For example if I have a compiled Go program, one running on Busybox and the other on Ubuntu, and I run say 10 of them on a machine, in what ways (if any) does it matter than one image is tiny and the other pretty heavy? Does one consume more runtime resources than the other?
I never saw any real difference in consuming other resources than storage and RAM if the image is bigger, however, as Docker containers should be single process why having the big overhead of unused clutter in your containers?
When trimming things down to small containers, there some advantages you may consider:
Faster transfer when deploying (esp. important if you wan't to do rolling upgrades)
Costs: The most time I used big containers, I ran exactly into storage issues on small VMs
Distributed File Systems: When using some File Storage like GlusterFS or other attached storage, big containers slow down, when bootet and updated heavily
massive overhead of data: if you have 500 MB clutter, you'll have it on your dev-machine, your CI/CD-Server, your registry AND every node of your production servers. This CAN matter depending on your use case.
I would say: If you just use a handful containers internally then the size is less important, if ever, than using hundrets of containers in production.

Resources