I want to create a 3d dask array from data that I have that is already chunked. My data consists of 216 blocks containing 1024x1024x1024 uint8 voxels each, each stored as a compressed hdf5 file with one key called data. Compressed, my data is only a few megabytes per block, but decompressed, it takes 1GB per block. Furthermore, my data is currently stored in Google Cloud storage (gcs), although I could potentially mirror it locally inside a container.
I thought the easiest way would be to use zarr, following these instructions (https://pangeo.io/data.html). Would xarray have to decompress my data before saving to zarr format? Would it have to shuffle data and try to communicate across blocks? Is there a lower level way of assembling a zarr from hdf5 blocks?
There are a few questions there, so I will try to be brief and hope that some edits can flesh out details I may have omitted.
You do not need to do anything in order to view your data as a single dask array, since you can reference the individual chunks as arrays (see here) and then use the stack/concatenate functions to build up into a single array. That does mean opening every file in the client, in order to read the meatadata, though.
Similarly, xarray has some functions for reading sets of files, where you should be able to assume consistency of dtype and dimensionality - please see their docs.
As far as zarr is concerned, you could use dask to create the set of files for you on GCS or not, and choose to use the same chunking scheme as the input - then there will be no shuffling. Since zarr is very simple to set up and understand, you could even create the zarr dataset yourself and write the chunks one-by-one without having to create the dask array up front from the zarr files. That would normally be via the zarr API, and writing a chunk of data does not require any change to the metadata file, so can be done in parallel. In theory, you could simply copy a block in, if you understood the low-level data representation (e.g., int64 in C-array layout); however, I don't know how likely it is that the exact same compression mechanism will be available in both the original hdf and zarr (see here).
Related
I want to load a model (ANNOY model) on Dask. The size of the model is 60GB and Dask RAM is 2GB only. Is there a way to load the model in distributed manner as well?
If by "load" you mean: "store in memory", then obviously there is no way to do this. If you need access to the whole dataset in memory at once, you'll need a machine that can handle this.
However, you very probably meant that you want to do some processing to the data and get a result (prediction, statistical score...) which does fit in memory.
Since I don't know what ANNOY is (array? dataframe? something else?), I can only give you general rules. For dask to work, it needs to be able to split a job into tasks. For data IO, this commonly means that the input is in multiple files, or that the files have some natural internal structure such that they can be loaded chunk-wise. For example, zarr (for arrays) stores each chunk of a logical dataset as a separate file, parquet (for dataframes) chunks up data into pages within columns within groups within files, and even CSV can be loaded chunkwise by looking for newline characters.
I suspect annoy ( https://github.com/spotify/annoy ?) has complex internal storage structure, and you may eed to raise an issue on their repo asking about dask support.
I am currently trying to load a big multi-dimensional array (>5 GB) into a python script. Since I use the array as training data for a machine learning model, it is important to efficiently load the data in mini batches but avoid loading the whole data set in memory once.
My idea was to use the xarray library.
I load the data set with X=xarray.open_dataset("Test_file.nc"). To the best of my knowledge, this command does not load the data set in memory - so far, so good. However, I want to convert X to an array with the command X=X.to_array().
My first question is: Does X=X.to_array() load it into memory or not?
If that is done, I wonder how to best load minibatches in memory. The shape of the array is (variable,datetime,x1_position,x2_position). I want to load minibatches per datetime, which would lead to:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[:,ind]
The other approach would be to transpose the array before with X.transpose("datetime","variable","x1_position","x2_position") and then sample via:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[ind,:]
My second question is:
Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
My first question is: Does X=X.to_array() load it into memory or not?
xarray makes use of dask to chunk (load) parts of the data into memory. You can compare X through
X = xarray.open_dataset("Test_file.nc")
# or
X = xarray.open_dataset("Test_file.nc",
chunks={'datetime':1, 'x1_position':x1_count, 'x2_position':x2_count})
and see (print(X)) the differences between loaded datasets, or specify the chunks accordingly.
The latter way means chunking (load) only one datetime slice data into memory. I don't think you need X=X.to_array() but you can also compare the results after to_array(). My experience is that to_array() does not change the actual chunking (loading) but just the view of the data.
My second question is: Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
I think one goal of xarray is to let users forget the details of the underlying implementation (based on numpy). Transposing may only modify the view rather than the underlying structure of the data. There certainly are some efficiency differences between the two indexing ways, depending on which one is accessing data along contiguous memory. But such difference would not be overhead. Feel free to use both.
I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"
Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.
I have input data stored as a single large file on S3.
I want Dask to chop the file automatically, distribute to workers and manage the data flow. Hence the idea of using distributed collection, e.g. bag.
On each worker I have a command line tools (Java) that read the data from file(s). Therefore I'd like to write a whole chunk of data into file, call external CLI/code to process the data and then read the results from output file. This looks like processing batches of data instead of record-at-a-time.
What would be the best approach to this problem? Is it possible to write partition to disk on a worker and process it as a whole?
PS. It nor necessary, but desirable, to stay in a distributed collection model because other operations on data might be simpler Python functions that process data record by record.
You probably want the read_bytes function. This breaks the file into many chunks cleanly split by a delimiter (like an endline). It gives you back a list of dask.delayed objects that point to those blocks of bytes.
There is more information on this documentation page: http://dask.pydata.org/en/latest/bytes.html
Here is an example from the docstring:
>>> sample, blocks = read_bytes('s3://bucket/2015-*-*.csv', delimiter=b'\n')
New user of hadoop and mapreduce, i would like to create a mapreduce job to do some measure on images. this why i would like to know if i can passe an image as input to mapreduce?if yes? any kind of example
thanks
No.. you cannot pass an image directly to a MapReduce job as it uses specific types of datatypes optimized for network serialization. I am not an image processing expert but I would recommend to have a look at HIPI framework. It allows image processing on top of MapReduce framework in a convenient manner.
Or if you really want to do it the native Hadoop way, you could do this by first converting the image file into a Hadoop Sequence file and then using the SequenceFileInputFormat to process the file.
Yes, you can totally do this.
With the limited information provided, I can only give you a very general answer.
Either way, you'll need to:
1) You will need to write a custom InputFormat that instead of taking chunks of files in HDFS locations (like TextInputFormat and SequenceFileInputFormat do), it actually passes to each map task the Image's HDFS path name. Reading the image from that won't be too hard.
If you plan to have a Reduce phase in which Images are passed around through the framework, you'll need to:
2) You will need to make an "ImageWritable" class that implements Writable (or WritableComparable if you're keying on the image). In your write() method, you'll need to serialize your image to a byte array. When you do this, what I would do is first write to the output an int/long which is the size of the array you're going to write. Lastly, you'll want to write the array as bytes.
In your read() method, you'll read an int/long first (which will describe the payload of the image), create an byte array of this size, and then read the bytes fully into your byte array up to the length of your int/long that you captured.
I'm not entirely sure what you're doing, but that's how I'd go about it.