Dask delayed object computed result not get proper dataframe - dask

I tried to use dask delayed to improve loops iteration speed, iteration done by map function. The problem is after dd.compute(), the result list is over bracket, so can not get proper dataframe. Anyone have solutions?
def combine(val):
a=delayed(rss)(val)
b=delayed(altman)(val)
df={'Tiker':val,'RS':a,'Alt':b}
return df
vals=tickers
df=map(combine,vals)
df=dd.compute(df)
df
Output:
([{'Tiker': 'ABDA.JK', 'RS': 0.75, 'Alt': 4.1937988034309255},
{'Tiker': 'ABMM.JK', 'RS': 1.75, 'Alt': 6320.155816168163},
{'Tiker': 'ACES.JK', 'RS': 0.44, 'Alt': 7.431649213502305}],)

It may be help for beginner like me, we can try trimmed over bracket list by: flatten one level of nesting
def flatten(listOfLists):
# "Flatten one level of nesting"
return chain.from_iterable(listOfLists)
lst=list(flatten(lst))

Related

Iterate on (or access directly) xarray chunks

I'm after a way to iterate on xarray chunks, so something similar to dask.array.blocks but that would give me access to xarray chunks with coordinates and dimensions.
For the record, I'm aware that xarray.map_blocks exists, but what I'm doing maps input chunks to output chunks of unknown shape, so I'd like to write something custom by looping directly on the xarray chunks.
I've tried to look into the xarray.map_blocks source code, since I guess something similar to what I need is in there, but I had a hard time understanding what's going on there.
EDIT:
My use case is that I would like, for each xarray chunk, to get an output xarray chunk of variable length along a new dimension (called foo below), and eventually concatenate them along foo.
This is a mocked scenario that should at least clarify what I'm after.
For now I've solved the problem constructing, from each dask chunk of the DataArray, an "xarray" chunk (but this looks quite convoluted), and then using client.map(fn_on_chunk, xarray_chunks).
n = 1000
x_raster = y_raster = np.arange(n)
time = np.arange(10)
vals_raster = np.arange(n*n*10).reshape(n, n, 10)
da_raster = xr.DataArray(vals_raster, coords={"y": y_raster, "x": x_raster, 'time':time})
da_raster = da_raster.chunk(dict(x=100, y=100))
def fn_on_chunk(da_chunk):
# Tried to replicate the fact that I can't know in advance
# the lenght of one dimension of the output
len_range = np.random.randint(10)
outs = []
for foo in range(len_range):
# Do some magic that finds needed coordinates
# on this particular chunk
x_chunk, y_chunk = fn_magic(foo)
out = da_chunk.sel(x=x_chunk, y=y_chunk)
out['foo'] = foo
outs.append(out)
return xr.concat(outs, dim='foo')

pyarrow - identify the fragments written or filters used when writing a parquet dataset?

My use case is that I want to pass the file paths or filters to a task in Airflow as an xcom so that my next task can read the data which was just processed.
Task A writes a table to a partitioned dataset and a number of Parquet file fragments are generated --> Task B reads those fragments later as a dataset. I need to only read relevant data though, not the entire dataset which could have many millions of rows.
I have tested two approaches:
List modified files right after I finish writing to the dataset. This will provide me with a list of paths which I can call ds.dataset(paths) on during my next task. I can use partitioning.parse() on these paths or check the fragments to get a list of filters used (frag.partition_expression)
A flaw with this is that I can have files being written in parallel to the same dataset.
I can generate the filters used when writing the dataset by turning the table into a pandas dataframe, doing a groupby, and then constructing filters. I am not sure if there is a simpler approach to this. I can then use pq._filters_to_expression() on the results to create a usable filter.
This is not ideal since I need to fix certain data types which do not get saved properly as an Airflow xcom (no pickling so everything has to be in json format). Also, if I want to partition on a dictionary column, I might need to tweak this function.
def create_filter_list(df, partition_columns):
"""Creates a list of pyarrow filters to be sent through an xcom and evaluated as an expression. Xcom disables pickling, so we need to save timestamp and date values as strings and convert downstream"""
filter_list = []
value_list = []
partition_keys = [df[col] for col in partition_columns]
for keys, _ in df[partition_columns].groupby(partition_keys):
if len(partition_columns) == 1:
if is_jsonable(keys):
value_list.append(keys)
elif keys is not None:
value_list.append(str(keys))
else:
if not isinstance(keys, tuple):
keys = (keys,)
read_filter = []
for name, val in zip(partition_columns, keys):
if type(val) == np.int_:
read_filter.append((name, "==", int(val)))
elif val is not None:
read_filter.append((name, "==", str(val)))
filter_list.append(read_filter)
if len(partition_columns) == 1:
if len(value_list) > 0:
filter_list = [(name, "in", value_list) for name in partition_columns]
return filter_list
Any suggestions on which approach I should take, or if there is a better way to achieve my goal?
You can watch this issue (https://issues.apache.org/jira/browse/ARROW-10440) which does what you want I believe. In the meantime, you could use basename_template as a workaround.
import glob
import os
import pyarrow as pa
import pyarrow.dataset as pads
class TrackingWriter:
def __init__(self):
self.counter = 0
part_schema = pa.schema({'part': pa.int64()})
self.partitioning = pads.HivePartitioning(part_schema)
def next_counter(self):
result = self.counter
self.counter += 1
return result
def write_dataset(self, table, base_dir):
counter = self.next_counter()
pads.write_dataset(table, base_dir, format='parquet', partitioning=self.partitioning, basename_template=f'batch-{counter}-part-{{i}}')
files_written = glob.glob(os.path.join(base_dir, '**', f'batch-{counter}-*'))
return files_written
table_one = pa.table({'part': [0, 0, 1, 1], 'val': [1, 2, 3, 4]})
table_two = pa.table({'part': [0, 0, 1, 1], 'val': [5, 6, 7, 8]})
writer = TrackingWriter()
print(writer.write_dataset(table_one, '/tmp/mydataset'))
print(writer.write_dataset(table_two, '/tmp/mydataset'))
This is just a rough sketch. You'd probably also want code to run at startup to see what the next free value of counter is. Or you could use a uuid instead of a counter.
A suggestion (not sure if this is optimal for your use case or not):
The key problem is the need to correctly select subset of the data, this can be 'fixed' upstream. The function/script that updates the big dataframe can contain a condition to save a temporary copy of data that is modified and satisfies some requirements in a separate (temporary) path. Then this file would be passed to the downstream tasks, which can delete the temporary file once it's processed.

reduce_max function in tensorflow

Screenshot
>>> boxes = tf.random_normal([ 5])
>>> with s.as_default():
... s.run(boxes)
... s.run(keras.backend.argmax(boxes,axis=0))
... s.run(tf.reduce_max(boxes,axis=0))
...
array([ 0.37312034, -0.97431135, 0.44504794, 0.35789603, 1.2461706 ],
dtype=float32)
3
0.856236
.
Why am I getting 0.8564. I expect the value to be 1.2461. since 1.2461 is big.right?
I am getting correct answer if i use tf.constant.
But I am not getting correct answer while using radom_normal
Each time a new boxes is regenerated when you run s.run() with radom_normal. So your three results are different. If you want to get consistent results, you should only run s.run() once.
result = s.run([boxes,keras.backend.argmax(boxes,axis=0),tf.reduce_sum(boxes,axis=0)])
print(result[0])
print(result[1])
print(result[2])
#print
[ 0.69957364 1.3192859 -0.6662426 -0.5895929 0.22300807]
1
0.9860319
In addition, the code should be given in text format rather than picture format.
TensorFlow is different from numpy because TF only uses symbolic operations. That means when you instantiate the random_normal, you don't get numeric values, but a symbolic normal distribution, so each time you evaluate it, you get different numbers.
Each time you operate with this distribution, with any other operation, you are getting different numbers, and that explains the results you see.

How to share the same index among multiple dask arrays

I'm trying to build a dask-based ipython application, that holds a meta-class which consists of some sub-dask-arrays (which are all shaped (n_samples, dim_1, dim_2 ...)) and should be able to sector the sub-dask-arrays by its getitem operator.
In the getitem method, I call the da.Array.compute method (the code is still in it's very early state), so I would be able to iterate batches of the sub-arrays.
def MetaClass(object):
...
def __getitem__(self, inds):
new_m = MetaClass()
inds = inds.compute()
for name,var in vars(self).items():
if isinstance(var,da.Array):
try:
setattr(new_m, name, var[inds])
except Exception as e:
print(e)
else:
setattr(new_m, name, var)
return new_m
# Here I construct the meta-class to work with some directory.
m = MetaClass('/my/data/...')
# m.type is one of the sub-dask-arrays
m2 = m[m.type==2]
It works as expected, and I get the sliced arrays, but as a result I get a huge memory consumption, and I assume that in the background the mechanism of dask is copying the index for each sub-dask-array.
My question is, how do I achieve the same results, without using so much memory?
(I tried not to "compute" the "inds" in getitem, but then I get nan shaped arrays, which can not be iterated, which is a must for the application)
I have been thinking about three possible solutions that I'd be happy to be advised which of them is the "right" one for me. (or to get another solution which I haven't thought of):
To use a Dask DataFrame, which I'm not sure how to fit multidimensional-dask-arrays in (would really appreciate some help or even a link that explains how to deal with multidimensional arrays in dd).
To forget about the entire MetaClass, and to use one dask-array with a nasty dtype (something like [("type",int,(1,)),("images",np.uint8,(1000,1000))]), again, I'm not familiar with this and would really appreciate some help with that (tried to google it.. it's a bit complicated..)
To share the index as a global inside the calling function (getitem) with property and its get-function-mechanism (https://docs.python.org/2/library/functions.html#property). But the big downside here is that I lose the types of the arrays (big down for representation and everything that needs anything but the data itself).
Thanks in advance!!!
One can use the sub-arrays.map_blocks with a shared function that holds the indices in its memory.
Here is an example:
def bool_mask(arr, block_info=None):
from_ind,to_ind = block_info[0]["array-location"][0]
return arr[inds[from_ind:to_ind]]
def getitem(var):
original_chunks = var.chunks[0]
tmp_inds = np.cumsum([0]+list(original_chunks))
from_inds = tmp_inds[:-1]
to_inds = tmp_inds[1:]
new_chunks_0 = np.array(list(map(lambda f,t:inds[f:t].sum(),from_inds,to_inds)))
new_chunks = tuple([tuple(new_chunks_0.tolist())] + list(var.chunks[1:]))
return var.map_blocks(bool_mask,dtype=var.dtype,chunks=new_chunks)

dask equivalent of df.loc[df.index.intesection(mylabels)]

When I run df.loc[mylabels] in dask I get a warning with the link to
Warning Starting in 0.21.0, using .loc or [] with a list with one or more missing labels, is deprecated, in favor of .reindex *
This page also says:
Alternatively, if you want to select only valid keys, the following is idiomatic and efficient; it is guaranteed to preserve the dtype of the selection.
In [106]: labels = [1, 2, 3]
In [107]: s.loc[s.index.intersection(labels)]
Out[107]:
1 2
2 3
dtype: int64
Dask indexes do not have an intersection method.
So hat is the recommended way to achieve the above effect in dask?
The problem with df.loc[mylabels] is that mylabels contains items not in df.index.
For now it looks like you should continue calling df.loc[labels].
It looks like things have changed upstream and probably dask.dataframe needs to follow a bit. I recommend submitting a bug report to https://github.com/dask/dask/issues/new

Resources