When using Dask's distributed scheduler I have a task that is running on a remote worker that I want to stop.
How do I stop it? I know about the cancel method, but this doesn't seem to work if the task has already started executing.
If it's not yet running
If the task has not yet started running you can cancel it by cancelling the associated future
future = client.submit(func, *args) # start task
future.cancel() # cancel task
If you are using dask collections then you can use the client.cancel method
x = x.persist() # start many tasks
client.cancel(x) # cancel all tasks
If it is running
However if your task has already started running on a thread within a worker then there is nothing that you can do to interrupt that thread. Unfortunately this is a limitation of Python.
Build in an explicit stopping condition
The best you can do is to build in some sort of stopping criterion into your function with your own custom logic. You might consider checking a shared variable within a loop. Look for "Variable" in these docs: http://dask.pydata.org/en/latest/futures.html
from dask.distributed import Client, Variable
client = Client()
stop = Varible()
stop.put(False)
def long_running_task():
while not stop.get():
... do stuff
future = client.submit(long_running_task)
... wait a while
stop.put(True)
Related
I am using global window with repeated forever after processing time trigger to process streaming data from pub-sub as below :
PCollection<KV<String,SMMessage>> perMSISDNLatestEvents = messages
.apply("Apply global window",Window.<SMMessage>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))
.discardingFiredPanes())
.apply("Convert into kv of msisdn and SM message", ParDo.of(new SmartcareMessagetoKVFn()))
.apply("Get per MSISDN latest event",Latest.perKey()).apply("Write into Redis", ParDo.of(new WriteRedisFn()));
Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed ? The reason for my question is because the next trigger processing will need to read data from redis, written by the previous trigger execution.
Thank You
So the trigger here would fire at the interval you provided. The trigger is not aware of any downstream processing so it's unable to depend on such steps of your pipeline.
Instead of depending on the trigger for consistency here, you could add a barrier (a DoFn) that exists before the Write step and only gives up execution after you see the previous data in Redis.
You could try and explicitly declare a global window trigger, as the example below:
Trigger subtrigger = AfterProcessingTime.pastFirstElementInPane();
Trigger maintrigger = Repeatedly.forever(subtrigger);
I think that triggers would help you on your case, since it will allow you to create event times, which will run when you or your code trigger them, so you would only run repeatedly forever when a trigger finishes first.
I found this documentation which might guide you on the triggers you are trying to create.
I'm using sidekiq for background job and I enqueue a job like this:
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***) ・・・ ①
When waiting_time is nil,
it becomes
SampleJob.set(wait: 0.seconds).perform_later(***)
Of course it works well, but I'm worried about performance because worker enqueued with wait argument is derived by poller,
so I wonder if I should remove set(wait: waiting_time.to_i.seconds) when
waiting_time is nil.
i.e.)
if waiting_time.present?
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***)
else
SampleJob.perform_later(***)
end ・・・ ②
Is there any differences in performance or speed between ① and ②?
Thank you in advance.
There is no difference. It looks like this is already considered in the Sidekiq library.
https://github.com/mperham/sidekiq/blob/main/lib/sidekiq/worker.rb#L261
# Optimization to enqueue something now that is scheduled to go out now or in the past
#opts["at"] = ts if ts > now
I am trying to use "completed_count()" to track how many tasks are left in a group in Celery.
My "client" runs this:
from celery import group
from proj import do
wordList=[]
with open('word.txt') as wordData:
for line in wordData:
wordList.append(line)
readAll = group(do.s(i) for i in wordList)
result = readAll.apply_async()
while not result.ready():
print(result.completed_count())
result.get()
The 'word.txt" is just a file with one word on each line.
Then I have the celery worker(s) set to run the do task as:
#app.task(task_acks_late = True)
def do(word):
sleep(1)
return f"I'm doing {word}"
My broker is pyamqp and I use rpc for the backend.
I thought it would print an increasing count of tasks for each loop on the client side but all I get are "0"s.
The problem is not in completed_count method. You are getting zeros because of result.ready() stays False after all the tasks have been completed. It seems like we have a bug with rpc backend, there is an issue on github. Consider to change the backend setting to amqp, it is working correctly as I can see
I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...
I have 3 ubuntu machine(CPU). my dask scheduler and client both are present on the same machine, whereas the two dask workers are running on other two machines. when I launch first task, it gets scheduled on first worker, but then upon launching second worker, while the first one still executing, it does not get scheduled on second worker. here is the sample client code that I tried.
### client.py
from dask.distributed import Client
import time, sys, os, random
def my_task(arg):
print("doing something in my_task")
time.sleep(2)
print("inside my task..", arg)
print("again doing something in my_task")
time.sleep(2)
print("return some random value")
value = random.randint(1,100)
print("value::", value)
return value
client = Client("172.25.49.226:8786")
print("client::", client)
future = client.submit(my_task, "hi")
print("future result::", future.result())
print("closing the client..")
client.close()
I am running "python client.py" two times almost at the same time from two different terminal/machines. both the client seems to be executing, but it results in exactly the same output which it should not because the return type of the my_task() is a random value. I tested this on ubuntu machines.
However a month back, I was able to run same tasks in parallel on CentOs machines. And now if check back and ran same two tasks from those CentOs machines, the problem persist. This is strange. it did not run in parallel. Not able to figure out this behavior by dask. Am I missing any OS level settings or something else.?
Run the below almost at the same time,
python client.py # from one machine/terminal
python client.py # from another machine/terminal
these two tasks should run in parallel, each task should run on different worker(we have two free workers available), but this is not happening. I can't see any log on the second worker console nor on the scheduler, while the first task continues to execute. At the end I noticed both the tasks finishes exactly at the same time with exactly same output.
However the above client code works well in "parallel" in windows OS, each task running through multiple terminals. but I would like to run it on Ubuntu machines.
By default if you call the same function on the same inputs Dask will assume that this will produce the same value, and only compute it once. You can override this behavior with the pure=False keyword
future = client.submit(func, *args, pure=False)