File Not Found Error in Dask program run on cluster - dask

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers.
When I run the program with read_csv file in dask. It gives me Error, file not found

When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways:
copy the file to every worker; this is obviously wasteful in terms of disc space, but the easiest to achieve
place the file on a networked filesystem (NFS mount, gluster, HDFS, etc.)
place the file on an external storage system such as amazon S3 and refer to that location
load the data in your local process and distribute it with scatter; in this case presumably the data was small enough to fit in memory and probably dask would not be doing much for you.

Related

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing.
Using mmap, process B is suppossed to read the file from memory instead of disk assuming process A has not called munmap.
If I would like to deploy process A and process B to diferent containers in the same pod in Kubernetes, is memory mapped IO supposed to work the same way as the initial example? Should container B (process B) read the file from memory as in my regular Linux desktop?
Let's assume both containers are in the same pod and are reading/writing the file from the same persistent volume. Do I need to consider a specific type of volume to achieve mmap IO?
In case you are courious I am using Apache Arrow and pyarrow to read and write those files and achieve zero-copy reads.
A Kubernetes pod is a group of containers that are deployed together on the same host. (reference). So this question is really about what happens for multiple containers running on the same host.
Containers are isolated on a host using a number of different technologies. There are two that might be relevant here. Neither prevent two processes from different containers sharing the same memory when they mmap a file.
The two things to consider are how the file systems are isolated and how memory is ring fenced (limited).
How the file systems are isolated
The trick used is to create a mount namespace so that any new mount points are not seen by other processes. Then file systems are mounted into a directory tree and finally the process calls chroot to set / as the root of that directory tree.
No part of this affects the way processes mmap files. This is just a clever trick on how file names / file paths work for the two different processes.
Even if, as part of that setup, the same file system was mounted from scratch by the two different processes the result would be the same as a bind mount. That means the same file system exists under two paths but it is *the same file system, not a copy.
Any attempt to mmap files in this situation would be identical to two processes in the same namespace.
How are memory limits applied?
This is done through cgroups. cgroups don't really isolate anything, they just put limits on what a single process can do.
But there is a natuarl question to ask, if two processes have different memory limits through cgroups can they share the same shared memory? Yes they can!
Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page
cache.
The reference is a little obscure, but describes how memory limits are applied to such situations.
Conclusion
Two processes both memory mapping the same file from the same file system as different containers on the same host will behave almost exactly the same as if the two processes were in the same container.

Using NFS with Dask workers

I have been experimenting with using an NFS shared drive with my user and Dask workers. Is this something that can work? I noticed that Dask created two files in my home directory, global.lock and purge.lock, and did not clean them up when workers were finished. What do these files do?
It is entirely normal to use the NFS to host a user's software environment. The files you're seeing are used by a different system altogether.
When Dask workers run out of space they spill excess data to disk. An NFS can work here, but it's much nicer to use local disk if available. This is usually configurable with the --local-directory dask-worker keyword, or the temporary-directory configuration value.
You can read more about storage issues with NFS and more guidelines here: https://docs.dask.org/en/latest/setup/hpc.html
Yes, Dask can be used with an NFS mound, and indeed you can share configuration/scheduler state between the various processes. Each worker process will use its own temporary storage area. The lock files are safe to ignore, and their existence will depend on exactly the workload you are doing.

Docker design: exchange data between containers or put multiple processes in one container?

In a current project I have to perform the following tasks (among others):
capture video frames from five IP cameras and stitch a panorama
run machine learning based object detection on the panorama
stream the panorama so it can be displayed in a UI
Currently, the stitching and the streaming runs in one docker container, and the object detection runs in another, reading the panorama stream as input.
Since I need to increase the input resolution for the the object detector while maintaining the stream resolution for the UI, I have to look for alternative ways of getting the stitched (full resolution) panorama (~10 MB per frame) from the stitcher container to the detector container.
My thoughts regarding potential solutions:
shared volume. Potential downside: One extra write and read per frame might be too slow?
Using a message queue or e.g. redis. Potential downside: yet another component in the architecture.
merging the two containers. Potential downside(s): Not only does it not feel right, but the two containers have completely different base images and dependencies. Plus I'd have to worry about parallelization.
Since I'm not the sharpest knife in the docker drawer, what I'm asking for are tips, experiences and best practices regarding fast data exchange between docker containers.
Usually most communication between Docker containers is over network sockets. This is fine when you're talking to something like a relational database or an HTTP server. It sounds like your application is a little more about sharing files, though, and that's something Docker is a little less good at.
If you only want one copy of each component, or are still actively developing the pipeline: I'd probably not use Docker for this. Since each container has an isolated filesystem and its own user ID space, sharing files can be unexpectedly tricky (every container must agree on numeric user IDs). But if you just run everything on the host, as the same user, pointing at the same directory, this isn't a problem.
If you're trying to scale this in production: I'd add some sort of shared filesystem and a message queueing system like RabbitMQ. For local work this could be a Docker named volume or bind-mounted host directory; cloud storage like Amazon S3 will work fine too. The setup is like this:
Each component knows about the shared storage and connects to RabbitMQ, but is unaware of the other components.
Each component reads a message from a RabbitMQ queue that names a file to process.
The component reads the file and does its work.
When it finishes, the component writes the result file back to the shared storage, and writes its location to a RabbitMQ exchange.
In this setup each component is totally stateless. If you discover that, for example, the machine-learning component of this is slowest, you can run duplicate copies of it. If something breaks, RabbitMQ will remember that a given message hasn't been fully processed (acknowledged); and again because of the isolation you can run that specific component locally to reproduce and fix the issue.
This model also translates well to larger-scale Docker-based cluster-computing systems like Kubernetes.
Running this locally, I would absolutely keep separate concerns in separate containers (especially if individual image-processing and ML tasks are expensive). The setup I propose needs both a message queue (to keep track of the work) and a shared filesystem (because message queues tend to not be optimized for 10+ MB individual messages). You get a choice between Docker named volumes and host bind-mounts as readily available shared storage. Bind mounts are easier to inspect and administer, but on some platforms are legendarily slow. Named volumes I think are reasonably fast, but you can only access them from Docker containers, which means needing to launch more containers to do basic things like backup and pruning.
Alright, Let's unpack this:
IMHO Shared Volume works just fine, but gets way too messy over time. Especially if you're handling Stateful services.
MQ: This seems like a best option in my opinion. Yes, it's another component in your architecture, but it makes sense to have it rather than maintaining messy shared Volumes or handling massive container images (if you manage to combine 2 container images)
Yes, You could potentially do this, but not a good idea. Considering your use case, I'm going to go ahead and make an assumption that you have a massive list of dependencies which could potentially lead to a conflict. Also, lot of dependencies = larger image = Larger attack surface - which from a security perspective is not a good thing.
If you really want to run multiple processes in one container, it's possible. There are multiple ways to achieve that, however I prefer supervisord.
https://docs.docker.com/config/containers/multi-service_container/

Aren't Named Pipes in the Filesystem slow?

Isn't it useless to write stream data for IPC into a file in the filesystem and so on to your (HDD o. SSD)? I mean, isn't it better to create a "buffered" pipe in the memory, so that we have more performance on the drive? But I'm new in IPC... or isn't it writing onto the disk? But how is this possible that the system writes into the filesystem without writing into a disk?
Aren't Named Pipes in the Filesystem slow?
They're no slower than any other sort of pipe.
isn't it better to create a "buffered" pipe in the memory
If you aren't memory constrained, then yes (see older OS link below).
[...] or isn't it writing onto the disk?
Your guess is correct - on many modern operating systems data going into a named pipe is not being written to the disk; the filesystem is just the namespace that holds something that tells you where the ends of the pipe can be found. From the Linux man page for pipe:
Note: although FIFOs have a pathname in the filesystem, I/O on FIFOs does not involve operations on the underlying device (if there is one).
There are older operating systems that buffer pipe data within a filesystem but given your question's phrasing (on such systems ALL pipes go through the filesystem not just named ones) I suspect this is a tangent.

GlusterFS high CPU usage on read load

I have a GlusterFS setup with two nodes(node1 and node2) setup to a replicated volume.
The volume contains many small files, 8kb - 200kb in size. When I subject node1 to heavy read load, glusterfsd and glusterfs processed together uses ~ 100% CPU on both nodes.
There is no write load on any of the nodes. But why is the CPU load so high, on both nodes?
As I understand it all the data is replicated to both nodes, so it "should" perform like a local filesystem.
this is commonly related to small files, e.g. if you have PHP apps running from a gluster volume.
This one bit me in the rear once, and it mostly has to do that in many php frameworks, you get a lot of stats to see if a file exists at that spot, if not, it will state a level (directory) higher, or with a slightly different name. Repeat 1000 times. Per file.
Now here's the catch: that lookup if the file exists does not just happen on that node / the local brick. (if you use replication), but on ALL the nodes / bricks involved. The cost involved can explode fast. (specially on some cloud platforms, where IOPS are capped)
This article helped me out significantly. In the end there was still a small penalty, but the benefits outweighed that.
https://www.vanderzee.org/linux/article-170626-141044/article-171031-113239/article-171212-095104

Resources