How to parallel compute csv file store on each worker without use hdfs? - dask

A concept same as data localy on hadoop but I don't want to use hdfs.
I have 3 dask-worker .
I want to compute a big csv filename for example mydata.csv.
I split mydata.csv to small file (mydata_part_001.csv ... mydata_part_100.csv) and store in local folder /data on each worker
e.g.
worker-01 store mydata_part_001.csv - mydata_part_030.csv in local folder /data
worker-02 store mydata_part_031.csv - mydata_part_060.csv in local folder /data
worker-03 store mydata_part_061.csv - mydata_part_100.csv in local folder /data
how to use dask compute to mydata ??
Thank.

It is more common to use some sort of globally accessible file system. HDFS is one example of this, but several other Network File Systems (NFSs) exist. I recommend looking into these instead of managing your data yourself in this way.
However, if you want to do things this way then you are probably looking for Dask's worker resources, which allow you to target specific tasks to specific machines.

Related

kubeflow OutputPath/InputPath question when writing/reading multiple files

I've a data-fetch stage where I get multiple DFs and serialize those. I'm currently treating OutputPath as directory - create it if it doesn't exist etc. and then serialize all the DFs in that path with different names for each DF.
In a subsequent pipeline stage (say, predict) I need to retrieve all those through InputPath.
Now, from the documentation it seems InputPath/OutputPath as file. Does kubeflow as any limitation if I use it as directory?
The ComponentSpec's {inputPath: input_name} and {outputPath: output_name} placeholders and their Python analogs (input_name: InputPath()/output_name: OutputPath()) are designed to support both files/blobs and directories.
They are expected to provide the path for the input/output data. No matter whether the data is a blob/file or a directory.
The only limitation is that UX might not be able to preview such artifacts.
But the pipeline itself would work.
I have experimented with a trivial pipeline - no issue is observed if InputPath/OutputPath is used as directory.

CatBoostError: catboost/libs/train_lib/dir_helper.cpp:20: Can't create train working dir:

I am running catboost on Databricks cluster. Databricks Production cluster is very secure and we cannot create new directory on the go as a user. But we can have pre-created directories. I am passing below parameter for my CatBoostClassifier.
CatBoostClassifier(train_dir='dbfs/FileStore/files/')
It does not work and throws below error.
CatBoostError: catboost/libs/train_lib/dir_helper.cpp:20: Can't create train working dir
you're missing / character at the beginning - it should be '/dbfs/FileStore/files/' instead.
Also, writing to DBFS could be slow, and may fail if catboost is using random writes (see limitations). You may instead point to the local directory of the node, like, /tmp/...., and then use dbutils.fs.cp("file:///tmp/....", "/FileStore/files/catboost/...", True) to copy files from local directory to the DBFS.

Copy folders into Azure data storage (azure data factory)

I am trying to copy folders with their files from ftp into an azure data storage, by looping through the folders and for each folder copy the content into a container that has the folder's name. for this, I used a metadata ,for each and copy data component. For now I am able to copy all the folders into the same container , but what I want is to have multiple containers named after the the folders in the output, each one containing files from the ftp.
ps : I am still new to azure data factory
Any advise or help is very welcome :)
You need to add a Get Metadata activity before the for-each. The Get Metadata activity will get the files in the current directory and pass them to the For-Each. You connect it to your Blob storage folder.
try something like this
Setup a JSON source:
Create a pipeline, use GetMetadata activity to list all the folders in the container/storage. Select fields as childItems
Feed the Metadata output (list of container contents) into filter activity and filter only folders.
Input the list of folders to a ForEach activity
Inside ForEach, set the current item() to a variable, and use it as a parameter for a parameterized source dataset which is a clone of original source !
This would result in listing the files from each folder in your container.
Feed this to another filter and this time filter on files. Use #equals(item().type,'File')
Now create another pipeline where we will have our copy activity running for each file found to be having same name as that of its parent folder.
Create parameters in the new child pipeline to receive the current Folder and File name in the iteration from Parent Pipeline to evaluate for copy.
Inside child pipeline, start with foreach whose input will be the list of filenames inside the folder received into parameter: #pipeline().parameters.filesnamesreceived
Use variable to hold the current item and use IfCondition to check if filename and folder names match.
Note: Try to evaluate dropping the file extension as per your requirement as metadata would hold the complete file name along with
its extension.
If True - > the names match, copy from source to sink.
Here the hierarchy is preserved and you can also use "Prefix" to mention the file path as it copies with preserving hierarchy. It utilizes the service-side filter for Blob storage, which provides better performance than a wildcard filter.
The sub-path after the last "/" in prefix will be preserved. For example, you have source container/folder/subfolder/file.txt, and configure prefix as folder/sub, then the preserved file path is subfolder/file.txt. Which fits your scenario.
This copies files like /source/source/source.json to /sink/source/source.json
AzCopy is simplest solution for this than Data factory, dry run can be used to check which files/folders will be copied
az storage blob copy start \
--destination-container destContainer \
--destination-blob myBlob \
--source-account-name mySourceAccount \
--source-account-key mySourceAccountKey \
--source-container myContainer \
--source-blob myBlob

Are absolute paths safe to use in Bazel?

I am experimenting with Bazel to be added along with an old, make/shell based build system. I can easily make shell commands which returns an absolute path to some tool or library build by the old build system as early prerequisites. These commands I can use in a genrule(), which copies the needed files (like headers and libs) into Bazel proper to be exposed in form of a cc_library().
I found out that genrule() does not detect a dependency if the command uses a file with absolute path - it is not caught by the sandbox. In a way I am (ab)using that behavior.
It is it safe? Will some future update of Bazel refuse access to files based on absolute path in that way in a command in genrule?
Most of Bazel's sandboxes allow access to most paths outside of the source tree by default. Details depend on which sandbox implementation you're using. The docker sandbox, for example, allows access to all those paths inside of a docker image. It's kind of hard to make promises about future Bazel versions, but I think it's unlikely that a sandbox will prevent accessing /bin/bash (for example), which means other absolute paths will probably continue to work too.
--sandbox_block_path can be used to explicitly block a path if you want.
If you always have the files available on every machine you build on, your setup should work. Keep in mind that Bazel will not recognize when the contents of those files change, so you can easily get stale results in various caches. You can avoid that by ensuring the external paths change whenever their contents do.
new_local_repository might be a better fit to avoid those problems, if you know the paths ahead of time.
If you don't know the paths ahead of time, you can write a custom repository rule which runs arbitrary commands via repository_ctx.execute to retrieve the paths and them symlinks them in with repository_ctx.symlink.
Tensorflow's third_party/sycl/sycl_configure.bzl has an example of doing something similar (you would do something other than looking at environment variables like find_computecpp_root does, and you might symlink entire directories instead of all the files in them):
def _symlink_dir(repository_ctx, src_dir, dest_dir):
"""Symlinks all the files in a directory.
Args:
repository_ctx: The repository context.
src_dir: The source directory.
dest_dir: The destination directory to create the symlinks in.
"""
files = repository_ctx.path(src_dir).readdir()
for src_file in files:
repository_ctx.symlink(src_file, dest_dir + "/" + src_file.basename)
def find_computecpp_root(repository_ctx):
"""Find ComputeCpp compiler."""
sycl_name = ""
if _COMPUTECPP_TOOLKIT_PATH in repository_ctx.os.environ:
sycl_name = repository_ctx.os.environ[_COMPUTECPP_TOOLKIT_PATH].strip()
if sycl_name.startswith("/"):
return sycl_name
fail("Cannot find SYCL compiler, please correct your path")
def _sycl_autoconf_imp(repository_ctx):
<snip>
computecpp_root = find_computecpp_root(repository_ctx)
<snip>
_symlink_dir(repository_ctx, computecpp_root + "/lib", "sycl/lib")
_symlink_dir(repository_ctx, computecpp_root + "/include", "sycl/include")
_symlink_dir(repository_ctx, computecpp_root + "/bin", "sycl/bin")

Is it possible to read a .tiff file from a remote service with dask?

I'm storing .tiff files on google cloud storage. I'd like to manipulate them using a distributed Dask cluster installed with Helm on Kubernetes..
Based on the dask-image repo, the Dask documentation on remote data services, and the use of storage_options, right now it looks like remote reads from .zarr, .tdb, .orc, .txt, .parquet, and .csv formats are supported. Is that correct? If so, is there any recommended workaround for accessing remote .tiff files?
There are many ways to do this. I would probably use a library like skimage.io.imread along with dask.delayed to read the TIFF files in parallel and then arrange them into a Dask Array
I encourage you to take a look at this blogpost on loading image data with Dask, which does something similar.
I believe that the skimage.io.imread function will happily read data from a URL, although it may not know how to interoperate with GCS. If the data on GCS is also available by a public URL (this is easy to do if you have access to the GCS bucket) then that would be easy. Otherwise you might use the gcsfs library to get the bytes from the file and then feed those bytes into some Python image reader.
Building off #MRocklin's answer, I found two ways to do it with gcsfs. One way with imageio for image parsing:
fs = gcsfs.GCSFileSystem(project="project_name")
img_bytes = fs.cat("bucket/blob_name.tif")
imageio.core.asarray(imageio.imread(img_bytes, "TIFF"))
And another with opencv-python for image parsing:
fs = gcsfs.GCSFileSystem(project="project_name")
fs.get("bucket/blob_name.tif", "local.tif")
img = np.asarray(cv2.imread("local.tif", cv2.IMREAD_UNCHANGED))

Resources