Why dask delayed do nothing? - dask

I am using dask to process files line by line. However, dask seems that do not do anything. My code logic is as follows:
import dask
from dask import delayed
from time import sleep
#dask.delayed
def inc(x):
sleep(1)
print(x)
def test():
for i in range(5):
delayed(inc)(i)
dask.compute(test())
However, no any outputs in console. Why?

Your function test does not return anything.
Perhaps you meant to do something like
def test():
out = []
for i in range(5):
out.append(inc(i))
return out
(note that you already decorated inc with delayed, there is no need to call delayed(inc) again)

Related

Display progress on dask.compute(*something) call

I have the following structure on my code using Dask:
#dask.delayed
def calculate(data):
services = data.service_id
prices = data.price
return [services, prices]
output = []
for qid in notebook.tqdm(ids):
r = calculate(parts[parts.quotation_id == qid])
output.append(r)
Turns out that, when I call the dask.compute() method over my output list, I don't have any progress indication. The Diagnostic UI don't "capture" this action, and I'm not even sure that's properly running (judging by my processor usage, I think it's not).
result = dask.compute(*output)
I'm following the "best practices" article from the dask's documentation:
https://docs.dask.org/en/latest/delayed-best-practices.html
What I'm missing?
Edit: I think it's running, because I still got memory leak/high usage warnings. Still no progress indication.
As pointed out in the related post, dask has two methods for displaying the progress: one for "normal" dask, and one for dask.distributed.
Here's a reproducible example:
import random
from time import sleep
import dask
from dask.diagnostics import ProgressBar
from dask.distributed import Client, progress
# simulate work
#dask.delayed
def work(x):
sleep(x)
return True
# generate tasks
random.seed(42)
tasks = [work(random.randint(1,5)) for x in range(50)]
Using plain dask
ProgressBar().register()
dask.compute(*tasks)
produces:
using dask.distributed
client = Client()
futures = client.compute(tasks)
progress(futures)
produces:

Dask opportunistic caching in custom graphs

I have a custom DAG such as:
dag = {'load': (load, 'myfile.txt'),
'heavy_comp': (heavy_comp, 'load'),
'simple_comp_1': (sc_1, 'heavy_comp'),
'simple_comp_2': (sc_2, 'heavy_comp'),
'simple_comp_3': (sc_3, 'heavy_comp')}
And I'm looking to compute the keys simple_comp_1, simple_comp_2, and simple_comp_3, which I perform as follows,
import dask
from dask.distributed import Client
from dask_yarn import YarnCluster
task_1 = dask.get(dag, 'simple_comp_1')
task_2 = dask.get(dag, 'simple_comp_2')
task_3 = dask.get(dag, 'simple_comp_3')
tasks = [task_1, task_2, task_3]
cluster = YarnCluster()
cluster.scale(3)
client = Client(cluster)
dask.compute(tasks)
cluster.shutdown()
It seems, that without caching, the computation of these 3 keys will lead to the computation of heavy_comp also 3 times. And since this is a heavy computation, I tried to implement opportunistic caching from here as follows:
from dask.cache import Cache
cache = Cache(2e9)
cache.register()
However, when I tried to print the results of what was being cached I got nothing:
>>> cache.cache.data
[]
>>> cache.cache.heap.heap
{}
>>> cache.cache.nbytes
{}
I even tried increasing the cache size to 6GB, however to no effect. Am I doing something wrong? How can I get Dask to cache the result of the key heavy_comp?
Expanding on MRocklin's answer and to format code in the comments below the question.
Computing the entire graph at once works as you would expect it to. heavy_comp would only be executed once, which is what you want. Consider the following code you provided in the comments completed by empty function definitions:
def load(fn):
print('load')
return fn
def sc_1(i):
print('sc_1')
return i
def sc_2(i):
print('sc_2')
return i
def sc_3(i):
print('sc_3')
return i
def heavy_comp(i):
print('heavy_comp')
return i
def merge(*args):
print('merge')
return args
dag = {'load': (load, 'myfile.txt'), 'heavy_comp': (heavy_comp, 'load'), 'simple_comp_1': (sc_1, 'heavy_comp'), 'simple_comp_2': (sc_2, 'heavy_comp'), 'simple_comp_3': (sc_3, 'heavy_comp'), 'merger_comp': (merge, 'sc_1', 'sc_2', 'sc_3')}
import dask
result = dask.get(dag, 'merger_comp')
print('result:', result)
It outputs:
load
heavy_comp
sc_1
sc_2
sc_3
merge
result: ('sc_1', 'sc_2', 'sc_3')
As you can see, "heavy_comp" is only printed once, showing that the function heavy_comp has only been executed once.
The opportunistic cache in the core Dask library only works for the single-machine scheduler, not the distributed scheduler.
However, if you just compute the entire graph at once Dask will hold onto intermediate values intelligently. If there are values that you would like to hold onto regardless you might also look at the persist function.

How to let all worker do same task in dask?

I want to let all workers do same task ,like this:
from dask import distributed
from distributed import Client,LocalCluster
import dask
import socket
def writer(filename,data):
with open(filename,'w') as f:
f.writelines(data)
def get_ip(x):
return socket.gethostname()
#writer('/data/1.txt',a)
client = Client('192.168.123.1:8786')
A=client.submit(get_ip, 0,workers=['w1','w2'], pure=False)
print(client.ncores(),
client.scheduler_info()
# dask.config.get('distributed')
)
A.result()
i have 2 workers,but just print 1 workers'hostname
A simple way to achieve what you want is to use the Client.run method
client.run(socket.gethostname)
This runs the function on all workers and returns all results. It does not use the normal task scheduling system, which is designed for a very different purpose from what you want.

Clock.schedule_interval doesn't schedule callback

I am testing the functionality of the kivy.clock.Clock.schedule_interval function.
My schedule_interval isn't calling the test function but rather exits without any errors.
What is it that I'm not understanding? I have correctly modeled this test by the documentation.
from kivy.clock import Clock
class TestClass:
def __init__(self):
print("function __init__.")
schedule = Clock.schedule_interval(self.test, 1)
def test(self, dt):
print("function test.")
if __name__ == '__main__':
a = TestClass()
The expected output should be:
function __init__.
function test.
function test.
function test.
function test.
function test.
function test.
Instead I'm just getting:
function __init__.
The main problem is that your program exits before one second passes. I'm not sure but I also assume that there has to be a kivy app in order for the Clock to work (I tried to make an empty while loop instead of running an app but that didn't help).
Here's an easy fix that gives the desired output:
from kivy.clock import Clock
from kivy.base import runTouchApp
class TestClass:
def __init__(self, **kwargs):
print("function __init__.")
schedule = Clock.schedule_interval(self.test, 1)
def test(self, dt):
print("function test.")
if __name__ == '__main__':
test = TestClass()
runTouchApp() # run an empty app so the program doesn't close
Otherwise consider making TestClass inherit from kivy's App and running it with TestClass().run() - you will achieve the same result.

Kivy - threads, queues, clocks and Python sockets

I'm brand new to Kivy, and also new to GUI, but not new to programming.
I am completely missing the boat, the canoe, and the airplane on using Kivy.
In 30 years of programming, from machine code, assembly, Fortran, C, C++, Java, Python, I've never tried to use a language such as Kivy who's documentation is this thin, because it's so new. I know it'll get better, but I'm trying to use it now.
In my code, I'm trying to implement Queueing, so that I can obtain Python socket data. In normal Python programming, I would have IPC via a Queue - put data in, get data out.
I understand from Kivy, mostly from what I've read in various forums, but can't say I've found it in the documentation at kivy.org, that I can't do the following:
Kivy needs to be in it's own thread.
Nothing in Kivy should sleep.
Nothing in Kivy should do blocking IO.
After a LOT of Google searching, the only thing I've actually found that approaches being useful, is an informative note here on StackOverFlow . However, while it almost solves my problem, the answer assumes I know more about Kivy than I do; I don't know how to incorporate the answer.
If someone could take the time to put together a COMPLETE short demo of using that example, or one of your own unique COMPLETE answers, I would much appreciate it!
Here's some short code I put together, but it doesn't work, because it blocks on the get() call.
from Queue import Queue
from kivy.lang import Builder
from kivy.app import App
from kivy.uix.boxlayout import BoxLayout
from kivy.properties import StringProperty
from kivy.clock import Clock
from threading import Thread
class ClockedQueue(BoxLayout):
text1 = StringProperty("Start")
def __init__(self):
super(ClockedQueue,self).__init__()
self.q = Queue()
self.i=0
Clock.schedule_interval(self.get, 2)
def get(self,dt):
print("get entry")
val = self.q.get()
print(self.i + val)
self.i += 1
class ClockedQueueApp(App):
def build(self):
return ClockedQueue()
class SourceQueue(Queue):
def __init__(self):
q = Queue()
for word in ['First','Second']:
q.put(word)
print("SourceQueue finished.")
def main():
th = Thread(target=SourceQueue)
th.start()
ClockedQueueApp().run()
return 0
if __name__ == '__main__':
main()
Thanks!
Here's some short code I put together, but it doesn't work, because it blocks on the get() call.
So what you really want to do is get items from your queue in a non-blocking way?
There are multiple ways to do this. The simplest seems to be to just check if the queue has any items before getting one - Queue has several methods that help with this, including checking if it is empty or setting whether get is allowed to be blocking (by setting its first argument to False). If you just do this instead of calling get on its own, you won't block things waiting for the queue to have any items - if it's empty or you can't immediately get anything, you just do nothing.
I don't know what you want to do with the items you get from the queue, but if it's short operations that don't take long then you won't need anything more than this. For instance, you could Clock.schedule_interval the get method to happen every frame, do nothing if the queue is empty, or operate on the data if you get something back. No blocking, and no messing with your own threads.
You can also create your own thread and run the blocking code in it, which is general way to deal with blocking issues, especially tasks that can't be split up into short sections that can be performed between frames. I don't know about the details of this, but it should just involve using python threads normally. You can check the source of kivy's UrlRequest for an example, this can download a web source in a background thread.
Edit: Also your SourceQueue is messed up (you override its __init__ to make a new queue that you don't store anywhere), and your clock scheduling has a meaningless third argument false which isn't even defined. I don't know what's going on here, it probably affects what you're trying to do, but doesn't matter to my general answer above.
I was finally able to create something that worked.
Thanks everyone for your suggestions!
Here's the code (because I'm new, Stackoverflow wouldn't let me post it as answering my own question until 5:00 AM)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# threads_and_kivy.py
#
'''threads_and_kivy.py
Trying to build up a foundation that satisfies the following:
- has a thread that will implement code that:
- simulates reads data from a Python socket
- works on the data
- puts the data onto a Python Queue
- has a Kivy mainthread that:
- via class ShowGUI
- reads data from the Queue
- updates a class variable of type StringProperty so it will
update the label_text property.
'''
from threading import Thread
from Queue import Queue, Empty
import time
from kivy.app import App
from kivy.uix.boxlayout import BoxLayout
from kivy.properties import StringProperty
from kivy.lang import Builder
from kivy.clock import Clock
kv='''
<ShowGUI>:
Label:
text: str(root.label_text)
'''
Builder.load_string(kv)
q = Queue()
class SimSocket():
global q
def __init__(self, queue):
self.q = queue
def put_on_queue(self):
print("<-----..threaded..SimSocket.put_on_queue(): entry")
for i in range(10):
print(".....threaded.....SimSocket.put_on_queue(): Loop " + str(i))
time.sleep(1)#just here to sim occassional data send
self.some_data = ["SimSocket.put_on_queue(): Data Loop " + str(i)]
self.q.put(self.some_data)
print("..threaded..SimSocket.put_on_queue(): thread ends")
class ShowGUI(BoxLayout):
label_text = StringProperty("Initial - not data")
global q
def __init__(self):
super(ShowGUI, self).__init__()
print("ShowGUI.__init__() entry")
Clock.schedule_interval(self.get_from_queue, 1.0)
def get_from_queue(self, dt):
print("---------> ShowGUI.get_from_queue() entry")
try:
queue_data = q.get(timeout = 5)
self.label_text = queue_data[0]
for qd in queue_data:
print("SimKivy.get_from_queue(): got data from queue: " + qd)
except Empty:
print("Error - no data received on queue.")
print("Unschedule Clock's schedule")
Clock.unschedule(self.get_from_queue)
class KivyGui(App):
def build(self):
return ShowGUI()
def main():
global q
ss = SimSocket(q)
simSocket_thread = Thread(name="simSocket",target=ss.put_on_queue)
simSocket_thread.start()
print("Starting KivyGui().run()")
KivyGui().run()
return 0
if __name__ == '__main__':
main()

Resources