We have a standalone Spark cluster. With a cluster, if the RDD memory storage is not enough, it spills the data to disk. Where exactly is the data spilled to when there is no HDFS? Local disk of each slave node?
Thanks!
As far as I know all data is spilled to the local directory defined by spark.local.dir independent of HDFS access.
Related
We are planning to host our Artifactory on ECS (Fargate) and mount the data to EFS. We will use an ALB in front of the containers (8081 and 8082) We still have some open issues:
Can we use multiple containers at the same time or will there be upload/write issues to EFS?
Is EFS a good solution or is S3 better?
What about the metadata. I read Artifactory is hosting this in some Derby database. What if we redeploy a new container? Will the data be gone? Can this data be persisted on EFS or do we need RDS?
Can we use multiple containers at the same time or will there be upload/write issues to EFS?
Ans: Yes you can use multiple containers to host Artifactory instances in a single host. However it is generally recommended to use multiple host to avoid the 'single point of failure' scenario. I don't anticipate any RW issues with EFS/S3.
Is EFS a good solution or is S3 better?
Ans: In my opinion both S3 and EFS are better known as scalable solutions rather than high performance oriented and it completely depends on the use-case. You can overcome this issue by enabling cache-fs in Artifactory which will store the frequently used binaries in a defined place (like a local disk with higher RW speeds). You can read more about cache-fs here: https://jfrog.com/knowledge-base/what-is-cache-fs-video/
What about the metadata. I read Artifactory is hosting this in some Derby database. What if we redeploy a new container? Will the data be gone? Can this data be persisted on EFS or do we need RDS?
Ans: when you are configuring more than one Artifactory node it is mandatory to have an external database (RDS) to store the configurations/references. On a side note: Artifactory generates the metadata for the packages/artifacts and store them in the FS only. However the references will be stored in the DB
Currently, I am migrating to Docker Swarm and have begun to use docker configs to offload most of the configuration files but I have one file remaining that is several GBs that is used by my tileserver. Right now, I have a 1 master / 4 workers and I am looking for a way to share that file with all nodes in the swarm to prepare for a time when the tileserver goes down.
Any ideas ?
If you want highly available data then a solution that distributes data amongst nodes (or servers).
One approach would be deploying an object storage solution onto the swarm - something like minio gives you an s3 compatible REST api and when deployed with a minimum of 4 disks in erasure coding mode tolerates 1 disk down for writing and 2 disks down for reading (assuming you have a node per disk).
If re-jigging your app to work with object storage isnt in scope then investigate something like glusterfs which you will want to install on the metal, rather than on docker. glusterfs will give you a unified filesystem with decent HA on 3 nodes, you can add disks on the fly.
Obviously with minio its expected your app would use the s3 api to access its files. With glusterfs you would need to mount gfs volumes on host locations where containers than then mount volumes to gain access to that network storage.
unless you are willing to go wandering through the world of rex-ray and other community supported docker volume drivers that either havn't seen an update in years or are literally maintained by one guy for fun which can bring some first class support for glusterfs based docker volumes to your hopefully non production docker swarm.
So I have a working DASK/SLURM cluster of 4 raspberry Pis with a common NFS share, that I can run Python jobs succesfully.
However, I want to add some more arm devices to my cluster that do not support NFS mounts (Kernel module missing) so I wish to move to fuse based ftp mounts wiht CurlftpFS.
I have setup the mounts sucesfully with anonymous username and without any passwords and the common FTP share can be seen by all the nodes (just as before when it was an NFS share).
I can still run SLURM jobs (since they do not use the share) but when I try to run a DASK job the master node timesout complaining that no worker nodes could be started.
I am not sure what exactly is the problem, since the share it open to anyone for read/write access (e.g. logs and dask queue intermediate files).
Any ideas how I can troubleshoot this?
I don't believe anyone has a cluster like yours!
At a guess, the filesystem access via FUSE, ftp and the pi is much slower than the OS is expecting, and you are seeing the effects of low-level timeouts, i.e., from Dask's point of view it appears that files reads are failing. Dask needs access to storage for configuration and sometimes temporary files. You would want to make sure that these locations are on local storage or tuned off. However, if this is happening during import of modules, which you have on the shared drive by design, there may be no fixing it (python loads many small files during import). Why not use rsync to move the files to the nodes?
I'm trying to deploy and run a docker image in a GCP VM instance.
I need it to access a certain Cloud Storage Bucket (read and write).
How do I mount a bucket inside the VM? How do I mount a bucket inside the Docker container running in my VM?
I've been reading google cloud documentation for a while, but I'm still confused. All guides show how to access a bucket from a local machine, and not how to mount it to VM.
https://cloud.google.com/storage/docs/quickstart-gsutil
Found something about Fuse, but it looks overly complicated for just mounting a single bucket to VM filesystem.
Google Cloud Storage is a object storage API, it is not a filesystem. As a result, it isn't really designed to be "mounted" within a VM. It is designed to be highly durable and scalable to extraordinarily large objects (and large numbers of objects).
Though you can use gcsfuse to mount it as a filesystem, that method has pretty significant drawbacks. For example, it can be expensive in operation count to do even simple operations for a normal filesystem.
Likewise, there are many surprising behaviors that are a result of the fact that it is an object store. For example, you can't edit objects -- they are immutable. To give the illusion of writing to the middle of an object, the object is, in effect, deleted and recreated whenever a call to close() or fsync() happens.
The best way to use GCS is to design your application to use the API (or the S3 compatible API) directly. That way the semantics are well understood by the application, and you can optimize for them to get better performance and control your costs. Thus, to access it from your docker container, ensure your container has a way to authenticate through GCS (either through the credentials on the instance, or by deploying a key for a service account with the necessary permissions to access the bucket), then have the application call the API directly.
Finally, if what you need is actually a filesystem, and not specifically GCS, Google Cloud does offer at least 2 other options if you need a large mountable filesystem that is designed for that specific use case:
Persistent Disk, which is the baseline filesystem that you get with a VM, but you can mount many of these devices on a single VM. However, you can't mount them read/write to multiple VMs at once -- if you need to mount to multiple VMs, the persistent disk must be read only for all instances they are mounted to.
Cloud Filestore is a managed service that provides an NFS server in front of a persistent disk. Thus, the filesystem can be mounted read/write and shared across many VMs. However it is significantly more expensive (as of this writing, about $0.20/GB/month vs $0.04/GB/month in us-central1) than PD, and there are minimum size requirements (1TB).
Google Cloud Storage buckets cannot be mounted in Google Compute instances or containers without third-party software such as FUSE. Neither Linux nor Windows have built-in drivers for Cloud Storage.
GCS VM comes with google cloud SDK installed. So without mounting you can copy in and out files using those commands.
gsutil ls gs://
I have mapped an azure disk to one of the kafka pods,I want to check the logs stored in the azure disk, is there any provision for this ?
For your requirement, I think you can ssh into the pods to check the logs stored in the disk. Actually, Azure disk is a VHD file, so you cannot check the files inside it when it is not in the attached state.
This got sorted by doing ssh into the specific node placed under AKS cluster and checking into the mounted disk to that