Override dask scheduler to concurrently load data on multiple workers - dask

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this:
from dask.distributed import Client
client = Client(scheduler_ip)
load_data_future = client.submit(load_data_func, 'path/to/data/')
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Running this as above the scheduler gets one worker to read the file, then it spills that data to disk to share it with the other workers. However, loading the data is usually reading from a large HDF5 file, which can be done concurrently, so I was wondering if there was a way to force all workers to read this file concurrently (they all compute the root task) instead of having them wait for one worker to finish then slowly transferring the data from that worker.
I know there is the client.run() method which I can use to get all workers to read the file concurrently, but how would you then get the data you've read to feed into the downstream tasks?
I cannot use the dask data primitives to concurrently read HDF5 files because I need things like multi-indexes and grouping on multiple columns.

Revisited this question and found a relatively simple solution, though it uses internal API methods and involves a blocking call to client.run(). Using the same variables as in the question:
from distributed import get_worker
client_id = client.id
def load_dataset():
worker = get_worker()
data = {'load_dataset-0': load_data_func('path/to/data')}
info = worker.update_data(data=data, report=False)
worker.scheduler.update_data(who_has={key: [worker.address] for key in data},
nbytes=info['nbytes'], client=client_id)
client.run(load_dataset)
Now if you run client.has_what() you should see that each worker holds the key load_dataset-0. To use this in downstream computations you can simply create a future for the key:
from distributed import Future
load_data_future = Future('load_dataset-0', client=client)
and this can be used with client.compute() or dask.delayed as usual. Indeed the final line from the example in the question would work fine:
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Bear in mind that it uses internal API methods Worker.update_data and Scheduler.update_data and works fine as of distributed.__version__ == 1.21.6 but could be subject to change in future releases.

As of today (distributed.__version__ == 1.20.2) what you ask for is not possible. The closest thing would be to compute once and then replicate the data explicitly
future = client.submit(load, path)
wait(future)
client.replicate(future)
You may want to raise this as a feature request at https://github.com/dask/distributed/issues/new

Related

TFF: Remote Executor

We are setting up a federated scenario with Server and Client on different physical machines.
On the server, we have used the docker container to kickstart:
The above has been borrowed from Kubernetes tutorial. We believe this creates a 'local executor' [Ref 1] which helps create a gRPC server [Ref 2].
Ref 1:
Ref 2:
Next on the client 1, we are calling tff.framework.RemoteExecutor that connects to the gRPC server.
Our understanding based on the above is that the Remote Executor runs on the client which connects to the gRPC server.
Assuming the above is correct, how can we send a
tff.tf_computation
from the server to the client and print the output on the client side to ensure the whole setup works well.
Your understanding is definitely correct.
If you construct an ExecutorFactory directly, as seems to be the case in the code above, passing it to tff.framework.set_default_context will install your remote stack as the default mechanism for executing computations in the TFF runtime. You should additionally be able to pass the appropriate channels to tff.backends.native.set_remote_execution_context to handle the remote executor construction and context installation if desired, but the way you are doing it certainly works, and allows for greater customization.
Once you have set this up, running an example end-to-end should be fairly simple. We will set up a computation which takes a set of federated integers, prints on the clients, and sums the integers up. Let:
#tff.tf_computation(tf.int32)
def print_and_return(x):
# We must use tf.print here, as this logic will be
# serialized and run on the clients as TensorFlow.
tf.print('hello world')
return x
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def print_and_sum(federated_arg):
same_ints = tff.federated_map(print_and_return, federated_arg)
return tff.federated_sum(same_ints)
Suppose we have N clients; we simply instantiate the set of federated integers, and invoke our computation.
federated_ints = [1] * N
total = print_and_sum(federated_ints)
assert total == N
This should cause the tf.prints defined above to run on the remote machine; as long as tf.print is directed to an output stream which you can monitor, you should be able to see it.
PS: you may note that the federated sum above is unnecessary; it certainly is. The same effect can be had by simply mapping the identity function with the serialized print.

Stream PubSub to Spanner - Wait.on Step

Requirement is to delete the data in spanner tables before inserting the data from pubsub messages. As MutationGroup does not guarantee the order of execution, separated delete mutations into separate set and so have two sets, one for Delete and other to AddReplace Mutations.
PCollection<Data> dataJson =
pipeLine
.apply(PubsubIO.readStrings().fromSubscription(options.getInputSubscription()))
.apply("ParsePubSubMessage", ParDo.of(new PubSubToDataFn()))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(10))))
;
SpannerWriteResult deleteResult = dataJson
.apply("DeleteDataMutation", MapElements.via(......))
.apply("DeleteData", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
dataJson
.apply("WaitOnDeleteMutation", Wait.on(deleteResult.getOutput()))
.apply("AddReplaceMutation", MapElements.via(...))
.apply("UpsertInfoToSpanner", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
This is a streaming dataflow job and I tried multiple Windowing but it never executes "UpsertInfoToSpanner" Step.
How can I fix this issue? Can someone suggest a path forward.
Update:
Requirement is to apply Two Mutation Groups sequential on same input data i.e. Read JSON from PubSub message to delete existing data from multiple tables with mutation group and then insert data reading from the JSON PubSub message.
Re-pasting the comment earlier for better visibility:
The Mutation operations within a single MutationGroup are guaranteed to be executed in order within a single transaction, so I don't see what the issue is here... The reason why Wait.on() never releases is because the output stream that is being waited on is on the global window, so will never be closed in a streaming pipeline.

How to use completed_count to track task group completion in Celery?

I am trying to use "completed_count()" to track how many tasks are left in a group in Celery.
My "client" runs this:
from celery import group
from proj import do
wordList=[]
with open('word.txt') as wordData:
for line in wordData:
wordList.append(line)
readAll = group(do.s(i) for i in wordList)
result = readAll.apply_async()
while not result.ready():
print(result.completed_count())
result.get()
The 'word.txt" is just a file with one word on each line.
Then I have the celery worker(s) set to run the do task as:
#app.task(task_acks_late = True)
def do(word):
sleep(1)
return f"I'm doing {word}"
My broker is pyamqp and I use rpc for the backend.
I thought it would print an increasing count of tasks for each loop on the client side but all I get are "0"s.
The problem is not in completed_count method. You are getting zeros because of result.ready() stays False after all the tasks have been completed. It seems like we have a bug with rpc backend, there is an issue on github. Consider to change the backend setting to amqp, it is working correctly as I can see

How to count total number of rows in a file using google dataflow

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
So, above method will return total count of rows and its implementation will be dataflow code.
Thanks
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:
Something like this:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

Using Neo4j with React JS

Can we use graph database neo4j with react js? If not so is there any alternate option for including graph database in react JS?
Easily, all you need is neo4j-driver: https://www.npmjs.com/package/neo4j-driver
Here is the most simplistic usage:
neo4j.js
//import { v1 as neo4j } from 'neo4j-driver'
const neo4j = require('neo4j-driver').v1
const driver = neo4j.driver('bolt://localhost', neo4j.auth.basic('username', 'password'))
const session = driver.session()
session
.run(`
MATCH (n:Node)
RETURN n AS someName
`)
.then((results) => {
results.records.forEach((record) => console.log(record.get('someName')))
session.close()
driver.close()
})
It is best practice to close the session always after you get the data. It is inexpensive and lightweight.
It is best practice to only close the driver session once your program is done (like Mongo DB). You will see extreme errors if you close the driver at a bad time, which is incredibly important to note if you are beginner. You will see errors like 'connection to server closed', etc. In async code, for example, if you run a query and close the driver before the results are parsed, you will have a bad time.
You can see in my example that I close the driver after, but only to illustrate proper cleanup. If you run this code in a standalone JS file to test, you will see node.js hangs after the query and you need to press CTRL + C to exit. Adding driver.close() fixes that. Normally, the driver is not closed until the program exits/crashes, which is never in a Backend API, and not until the user logs out in the Frontend.
Knowing this now, you are off to a great start.
Remember, session.close() immediately every time, and be careful with the driver.close().
You could put this code in a React component or action creator easily and render the data.
You will find it no different than hooking up and working with Axios.
You can run statements in a transaction also, which is beneficial for writelocking affected nodes. You should research that thoroughly first, but transaction flow is like this:
const session = driver.session()
const tx = session.beginTransaction()
tx
.run(query)
.then(// same as normal)
.catch(// errors)
// the difference is you can chain multiple transactions:
const tx1 = await tx.run().then()
// use results
const tx2 = await tx.run().then()
// then, once you are ready to commit the changes:
if (results.good !== true) {
tx.rollback()
session.close()
throw error
}
await tx.commit()
session.close()
const finalResults = { tx1, tx2 }
return finalResults
// in my experience, you have to await tx.commit
// in async/await syntax conditions, otherwise it may not commit properly
// that operation is not instant
tl;dr;
Yes, you can!
You are mixing two different technologies together. Neo4j is graph database and React.js is framework for front-end.
You can connect to Neo4j from JavaScript - http://neo4j.com/developer/javascript/
Interesting topic. I am using the driver in a React App and recently experienced some issues. I am closing the session every time a lifecycle hook completes like in your example. When there where more intensive queries I would see a timeout error. Going back to my setup decided to experiment by closing the driver in some more expensive queries and it looks like (still need more testing) the crashes are gone.
If you are deploying a real-world application I would urge you to think about Authentication and Authorization when using a DB-React setup only as you would have to store username/password of the neo4j server in the client. I am looking into options of having the Neo4J server issuing a token and receiving it for Authorization but the best practice is for sure to have a Node.js server in the middle with something like Passport to handle Authentication.
So, all in all, maybe the best scenario is to only use the driver in Node and have the browser always communicating with the Node server using axios...

Resources