Iterate sequentially over a dask bag - dask

I need to submit the elements of a very large dask.bag to a non-threadsafe store, ie I need something like
for x in dbag:
store.add(x)
I can not use compute since the bag is to large to fit in memory.
I need something more like distributed.as_completed but that works on bags, which distributed.as_completed does not.

I would probably continue to use normal compute, but add a lock
def commit(x, lock=None):
with lock:
store.add(x)
b.map(commit, lock=my_lock)
Where you might create a threading.Lock, or multiprocessing.Lock depending on the kind of processing you're doing
If you want to use as_completed you can convert your bag to futures and use as_completed on them.
from distributed.client import futures_of, as_completed
b = b.persist()
futures = futures_of(b)
for future in as_completed(futures):
for x in future.result():
store.add(x)
You can also convert to a dataframe, which I believe does iterate more sensibly
df = b.to_dataframe(...)
for x in df.iteritems(...):
...

Related

Posting a numpy array to a dask.distributed.Queue object

We've noticed that the communication objects like Queue and Variable don't support dask or numpy arrays.
https://distributed.dask.org/en/latest/api.html#distributed.Queue
It appears that msgpack doesn't like numpy arrays. Would it make sense to use something like https://github.com/lebedov/msgpack-numpy ?
We are trying to post intermediate results from a loop on a worker back to a Queue. We'd like to consume these results in realtime and plot them in a notebook.
Copying documentation from https://docs.dask.org/en/latest/futures.html#coordination-primitives
Queues can also send small pieces of information, anything that is msgpack encodable (ints, strings, bools, lists, dicts, etc.). This can be useful to send back small scores or administrative messages:
def func(x):
try:
...
except Exception as e:
error_queue.put(str(e))
error_queue = Queue()
Queues are mediated by the central scheduler, and so they are not ideal for sending large amounts of data (everything you send will be routed through a central point). They are well suited to move around small bits of metadata, or futures. These futures may point to much larger pieces of data safely:
>>> x = ... # my large numpy array
# Don't do this!
>>> q.put(x)
# Do this instead
>>> future = client.scatter(x)
>>> q.put(future)
# Or use futures for metadata
>>> q.put({'status': 'OK', 'stage=': 1234})
If you’re looking to move large amounts of data between workers, then you might also want to consider the Pub/Sub system described a few sections below.

Batch results of intermediate dask computation

I have a large (10s of GB) CSV file that I want to load into dask, and for each row, perform some computation. I also want to write the results of the manipulated CSV into BigQuery, but it'd be better to batch network requests to BigQuery in groups of say, 10,000 rows each, so I don't incur network overhead per row.
I've been looking at dask delayed and see that you can create an arbitrary computation graph, but I'm not sure if this is the right approach: how do I collect and fire off intermediate computations based on some group size (or perhaps time elapsed). Can someone provide a simple example on that? Say for simplicity we have these functions:
def change_row(r):
# Takes 10ms
r = some_computation(r)
return r
def send_to_bigquery(rows):
# Ideally, in large-ish groups, say 10,000 rows at a time
make_network_request(rows)
# And here's how I'd use it
import dask.dataframe as dd
df = dd.read_csv('my_large_dataset.csv') # 20 GB
# run change_row(r) for each r in df
# run send_to_big_query(rows) for each appropriate size group based on change_row(r)
Thanks!
The easiest thing that you can do is provide a block size parameter to read_csv, which will get you approximately the right number of rows per block. You may need to measure some of your data or experiment to get this right.
The rest of your task will work the same way as any other "do this generic thing to blocks of data-frame": the `map_partitions' method (docs).
def alter_and_send(df):
rows = [change_row(r) for r in df.iterrows()]
send_to_big_query(rows)
return df
df.map_partitions(alter_and_send)
Basically, you are running the function on each piece of the logical dask dataframe, which are real pandas dataframes.
You may actually want map, apply or other dataframe methods in the function.
This is one way to do it - you don't really need the "output" of the map, and you could have used to_delayed() instead.

Operation over two arrays with different numblocks

I am trying to implement take_along_axis for dask array.
What is the standard way to map an operation that takes in a block from dask array A, and the corresponding block from dask array B?
Should I use a rechunk when A.numblocks != B.numblocks?
Rechunking is fine. Dask.array algorithms align chunks internally fairly frequently. You might also consider using functions like map_blocks or atop if they work for you. Dask developers optimize these.
http://dask.pydata.org/en/latest/array-api.html#dask.array.core.atop
http://dask.pydata.org/en/latest/array-api.html#dask.array.core.map_blocks
These are only important though if you're constructing your own algorithms. Normal use of dask.arrays does not really require thinking about rechunking explicitly.

Computing in-place with dask

Short version
I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the original backup numpy arrays, making the whole thing an in-place operation?
If you're thinking "you're using dask wrong" then see the long version below for why I feel the need to do this.
Long version
I'm using dask for an application where the original data is sourced from in-memory numpy arrays that contain data collected from a scientific instrument. The goal is to fill most of the RAM (say 75%+) with the original data, which means that there isn't enough to make an in-memory copy. That makes it semantically a bit like an out-of-core problem, in that any derived value can only be realised in memory in chunks rather than all at once.
Dask is well-suited to this, except for one wrinkle. I'm simplifying a lot, but on most of the data (call it X), we need to apply an element-wise operation f, compute some summary statistics s(f(X)), and use that to compute another result over the data, say t(s(f(X)), f(X)). While all the functions are dask-friendly (can be done on a per-chunk basis), trying to simply run this dask graph would cause f(X) to all be held in memory at once because the chunks are all needed for the second pass. An alternative is to explicitly compute s before asking for t (as suggested by https://github.com/dask/dask/issues/874), and thus pay to compute f(X) twice, but it's a somewhat expensive operation so I'd like to avoid that.
However, once f has been applied, the original data are no longer needed. So I'd like to run da.store(f(X)) and have it store the results in the original backing numpy arrays. Technically I think I know how to set that up, and as long as I can be sure that each piece of data is fully consumed before it is overwritten then there are no race conditions, but I'm worried that I may be breaking an API contract by changing back data underneath dask and that it might go wrong in some way. Is there any way to guarantee that it is safe?
One way I can immediately see this going wrong is if several of the input arrays have the same contents and hence get given the same name in dask, causing them to the unified in the graph. I'm using name=False in da.from_array though, so that shouldn't be an issue.

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

Resources