So I have a working DASK/SLURM cluster of 4 raspberry Pis with a common NFS share, that I can run Python jobs succesfully.
However, I want to add some more arm devices to my cluster that do not support NFS mounts (Kernel module missing) so I wish to move to fuse based ftp mounts wiht CurlftpFS.
I have setup the mounts sucesfully with anonymous username and without any passwords and the common FTP share can be seen by all the nodes (just as before when it was an NFS share).
I can still run SLURM jobs (since they do not use the share) but when I try to run a DASK job the master node timesout complaining that no worker nodes could be started.
I am not sure what exactly is the problem, since the share it open to anyone for read/write access (e.g. logs and dask queue intermediate files).
Any ideas how I can troubleshoot this?
I don't believe anyone has a cluster like yours!
At a guess, the filesystem access via FUSE, ftp and the pi is much slower than the OS is expecting, and you are seeing the effects of low-level timeouts, i.e., from Dask's point of view it appears that files reads are failing. Dask needs access to storage for configuration and sometimes temporary files. You would want to make sure that these locations are on local storage or tuned off. However, if this is happening during import of modules, which you have on the shared drive by design, there may be no fixing it (python loads many small files during import). Why not use rsync to move the files to the nodes?
Related
I have an image in Artifact Registry that does a unit of work:
it expects input files inside a certain directory (let's call it main_input)
runs them, does some sequences of computation, and outputs results into an output folder in Google Storage
The run time of each does not exceed 30 minutes, but I have thousands of such runs to perform.
Inside a single VM, I can create various containers from this image by mounting the main_input directory inside the container to the correct ones on the host, and run.
However, I wonder if Cloud Run is a more scalable solution for this? or shall I look at other services/strategies?
Managing thousands of runs is not an easy task, you can use a scheduler like airflow or argo worfkflows to run the tasks and restart them if needed.
For the containers environment, I propose Kubernetes (GKE) over Cloud Run, for some reason:
you have more permissions than Cloud Run
to be agnostic from GCP (your code can work on other platforms like AWS)
better management for apps configurations and secrets, and supported with all the CI/CD tools
less expensive: you can create scalable preemptible node pools to reduce the cost and add a lot of resources when needed
you can use the same cluster to run other applications for your company
you can use open source tools for log collections, monitoring, HTTP reverse proxy, ...
Currently, I am migrating to Docker Swarm and have begun to use docker configs to offload most of the configuration files but I have one file remaining that is several GBs that is used by my tileserver. Right now, I have a 1 master / 4 workers and I am looking for a way to share that file with all nodes in the swarm to prepare for a time when the tileserver goes down.
Any ideas ?
If you want highly available data then a solution that distributes data amongst nodes (or servers).
One approach would be deploying an object storage solution onto the swarm - something like minio gives you an s3 compatible REST api and when deployed with a minimum of 4 disks in erasure coding mode tolerates 1 disk down for writing and 2 disks down for reading (assuming you have a node per disk).
If re-jigging your app to work with object storage isnt in scope then investigate something like glusterfs which you will want to install on the metal, rather than on docker. glusterfs will give you a unified filesystem with decent HA on 3 nodes, you can add disks on the fly.
Obviously with minio its expected your app would use the s3 api to access its files. With glusterfs you would need to mount gfs volumes on host locations where containers than then mount volumes to gain access to that network storage.
unless you are willing to go wandering through the world of rex-ray and other community supported docker volume drivers that either havn't seen an update in years or are literally maintained by one guy for fun which can bring some first class support for glusterfs based docker volumes to your hopefully non production docker swarm.
We are running a pod in Kubernetes that needs to load a file during runtime. This file has the following properties:
It is known at build time
It should be mounted read-only by multiple pods (the same kind)
It might change (externally to the cluster) and needs to be updated
For various reasons (security being the main concern) the file cannot be inside the docker image
It is potentially quite large, theoretically up to 100 MB, but in practice between 200kB - 10MB.
We have considered various options:
Creating a persistent volume, mount the volume in a temporary pod to write (update) the file, unmount the volume, and then mount it in the service with ROX (Read-Only Multiple) claims. This solution means we need downtime during upgrade, and it is hard to automate (due to timings).
Creating multiple secrets using the secrets management of Kubernetes, and then "assemble" the file before loading it in an init-container or something similar.
Both of these solutions feels a little bit hacked - is there a better solution out there that we could utilize for solving this?
You need to use a shared filesystem that supports Read/Write Multiple Pods.
Here is a link to the CSI Drivers which can be used with Kubernetes and provide those access:
https://kubernetes-csi.github.io/docs/drivers.html
Ideally, you need a solution that is not an appliance, and can run anywhere meaning it can run in the cloud or on-prem.
The platforms that could work for you are Ceph, GlusterFS, and Quobyte (Disclaimer, I work for Quobyte)
Perhaps a silly question with no sense:
In a kubernetes deployment (or minikube), when a pod container crashes, i would like to analyze the file system at that moment. In this way, i could see core dumps or any other useful information.
I know that i could mount a volume or PVC to get core dumps from a host-defined core pattern location, and i also could get logs by mean a rsyslog sidecar or any other way, but i still would like to do "post-mortem" analysis if possible. I assume that kubernetes should provide (but i don't know how, that's the reason of my question) some mechanism to do this forensics tasks easing the life to all of us, because in a production system we could need to analyze killed/exited containers.
I tried playing directly with docker run without --rm option, but can't get nothing useful from inspection to get useful information or recreate the file system in last moment that had the container alive.
Thank u very much!
When a pod container crashes, i would like to analyze the file system at that moment.
POD (Containers) natively use non-persistent storage.
When a container exits/terminates, so does the container’s storage.
POD (Container) can be connected to storage that is external. This will allows for the storage of persistent data (you can configure volume mount as path to core dump etc..), since this external storage is not removed when a container is stopped/killed will help you with more flexibility to analysis the file system. Configuring container file system storage with commonly used file systems such as NFS .. etc ..
I'm trying to deploy and run a docker image in a GCP VM instance.
I need it to access a certain Cloud Storage Bucket (read and write).
How do I mount a bucket inside the VM? How do I mount a bucket inside the Docker container running in my VM?
I've been reading google cloud documentation for a while, but I'm still confused. All guides show how to access a bucket from a local machine, and not how to mount it to VM.
https://cloud.google.com/storage/docs/quickstart-gsutil
Found something about Fuse, but it looks overly complicated for just mounting a single bucket to VM filesystem.
Google Cloud Storage is a object storage API, it is not a filesystem. As a result, it isn't really designed to be "mounted" within a VM. It is designed to be highly durable and scalable to extraordinarily large objects (and large numbers of objects).
Though you can use gcsfuse to mount it as a filesystem, that method has pretty significant drawbacks. For example, it can be expensive in operation count to do even simple operations for a normal filesystem.
Likewise, there are many surprising behaviors that are a result of the fact that it is an object store. For example, you can't edit objects -- they are immutable. To give the illusion of writing to the middle of an object, the object is, in effect, deleted and recreated whenever a call to close() or fsync() happens.
The best way to use GCS is to design your application to use the API (or the S3 compatible API) directly. That way the semantics are well understood by the application, and you can optimize for them to get better performance and control your costs. Thus, to access it from your docker container, ensure your container has a way to authenticate through GCS (either through the credentials on the instance, or by deploying a key for a service account with the necessary permissions to access the bucket), then have the application call the API directly.
Finally, if what you need is actually a filesystem, and not specifically GCS, Google Cloud does offer at least 2 other options if you need a large mountable filesystem that is designed for that specific use case:
Persistent Disk, which is the baseline filesystem that you get with a VM, but you can mount many of these devices on a single VM. However, you can't mount them read/write to multiple VMs at once -- if you need to mount to multiple VMs, the persistent disk must be read only for all instances they are mounted to.
Cloud Filestore is a managed service that provides an NFS server in front of a persistent disk. Thus, the filesystem can be mounted read/write and shared across many VMs. However it is significantly more expensive (as of this writing, about $0.20/GB/month vs $0.04/GB/month in us-central1) than PD, and there are minimum size requirements (1TB).
Google Cloud Storage buckets cannot be mounted in Google Compute instances or containers without third-party software such as FUSE. Neither Linux nor Windows have built-in drivers for Cloud Storage.
GCS VM comes with google cloud SDK installed. So without mounting you can copy in and out files using those commands.
gsutil ls gs://