EvaluateAsync(), serializing evaluate on 2 different sessions - windows-container

I want to have 2 evaluates running parallelly on 2 different devices(and 2 different sessions) that I have created, for which I am using EvaluateAsync()
Code:
std::cout<<" Starting Evaluate " << std::endl;
auto start = high_resolution_clock::now();
auto eval_1= session.EvaluateAsync(binding, L""();
auto eval_2 = session_2.EvaluateAsync( binding_2, L"" )
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>( stop - start );
std::cout << " Ending Evaluate " << duration.count() << std::endl;
Expected behavior:
With only one evaluate call (let's assume only auto eval_1= session.EvaluateAsync(binding, L""(); between time recorded), I know duration is 10 ms.
If the EvaluateAsyn is truly asynchronous I expect with 2 calls, the time should be max of the 2 calls, however, it takes double the time ie 20 ms to execute.

Looks like this question was also asked in the Windows Machine Learning GitHub repo and there are answers there. In summary, asynchronous evaluation is complicated because on the CPU certain operators will still try to use all the threads, which leads to contention, and on GPU the work of queuing the GPU work is still synchronous and and operators that need to run on the CPU will cause the pipeline to stall while it waits for the GPU work.

Related

waitForCompletion(timeout) in Abaqus API does not actually kill the job after timeout passes

I'm doing a parametric sweep of some Abaqus simulations, and so I'm using the waitForCompletion() function to prevent the script from moving on prematurely. However, occassionally the combination of parameters causes the simulation to hang on one or two of the parameters in the sweep for something like half an hour to an hour, whereas most parameter combos only take ~10 minutes. I don't need all the data points, so I'd rather sacrifice one or two results to power through more simulations in that time. Thus I tried to use waitForCompletion(timeout) as documented here. But it doesn't work - it ends up functioning just like an indefinite waitForCompletion, regardless of how low I set the wait time. I am using Abaqus 2017, and I was wondering if anyone else had gotten this function to work and if so how?
While I could use a workaround like adding a custom timeout function and using the kill() function on the job, I would prefer to use the built-in functionality of the Abaqus API, so any help is much appreciated!
It seems like starting from a certain version the timeOut optional argument was removed from this method: compare the "Scripting Reference Manual" entry in the documentation of v6.7 and v6.14.
You have a few options:
From Abaqus API: Checking if the my_abaqus_script.023 file still exists during simulation:
import os, time
timeOut = 600
total_time = 60
time.sleep(60)
# whait untill the the job is completed
while os.path.isfile('my_job_name.023') == True:
if total_time > timeOut:
my_job.kill()
total_time += 60
time.sleep(60)
From outside: Launching the job using the subprocess
Note: don't use interactive keyword in your command because it blocks the execution of the script while the simulation process is active.
import subprocess, os, time
my_cmd = 'abaqus job=my_abaqus_script analysis cpus=1'
proc = subprocess.Popen(
my_cmd,
cwd=my_working_dir,
stdout='my_study.log',
stderr='my_study.err',
shell=True
)
and checking the return code of the child process suing poll() (see also returncode):
timeOut = 600
total_time = 60
time.sleep(60)
# whait untill the the job is completed
while proc.poll() is None:
if total_time > timeOut:
proc.terminate()
total_time += 60
time.sleep(60)
or waiting until the timeOut is reached using wait()
timeOut = 600
try:
proc.wait(timeOut)
except subprocess.TimeoutExpired:
print('TimeOut reached!')
Note: I know that terminate() and wait() methods should work in theory but I haven't tried this solution myself. So maybe there will be some additional complications (like looking for all children processes created by Abaqus using psutil.Process(proc.pid) )

How to get results of tasks when they finish and not after all have finished in Dask?

I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed.
I created a local Client and use client.compute() to send tasks. Then I use future.result() to get the result of each task.
I'm using threads to ask for results at the same time and measure the time for each result to compute like this:
def get_result(future,i):
t0 = time.time()
print("calculating result", i)
result = future.result()
print("result {} took {}".format(i, time.time() - t0))
client = Client()
df = dd.read_csv(path_to_csv)
future1 = client.compute(df[df.x > 200])
future2 = client.compute(df[df.x > 500])
threading.Thread(target=get_result, args=[future1,1]).start()
threading.Thread(target=get_result, args=[future2,2]).start()
I expect the output of the above code to be something like:
calculating result 1
calculating result 2
result 2 took 10
result 1 took 46
Since the first task is larger.
But instead I got both at the same time
calculating result 1
calculating result 2
result 2 took 46.3046760559082
result 1 took 46.477620363235474
I asume that is because future2 actually computes in the background and finishes before future1, but it waits until future1 is completed to return.
Is there a way I can get the result of future2 at the moment it finishes ?
You do not need to make threads to use futures in an asynchronous fashion - they are already inherently async, and monitor their status in the background. If you want to get results in the order they are ready, you should use as_completed.
However, fo your specific situation, you may want to simply view the dashboard (or use df.visulalize()) to understand the computation which is happening. Both futures depend on reading the CSV, and this one task will be required before either can run - and probably takes the vast majority of the time. Dask does not know, without scanning all of the data, which rows have what value of x.

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this:
The first step is doing some basic processing and takes about 2 seconds per message to get to the windowing. We are using sliding windows of 3 seconds and 3 second interval (might change later so we have overlapping windows). As last step we have the SOG prediction that takes about 15ish seconds to process and which is clearly our bottleneck transform.
So, The issue we seem to face is that the workload is perfectly distributed over our workers before the windowing, but the most important transform is not distributed at all. All the windows are processed one at a time seemingly on 1 worker, while we have 50 available.
The logs show us that the sog prediction step has an output once every 15ish seconds which should not be the case if the windows would be processed over more workers, so this builds up huge latency over time which we don't want. With 1 minute of messages, we have a latency of 5 minutes for the last window. When distribution would work, this should only be around 15sec (the SOG prediction time). So at this point we are clueless..
Does anyone see if there is something wrong with our code or how to prevent/circumvent this?
It seems like this is something happening in the internals of google cloud dataflow. Does this also occur in java streaming pipelines?
In batch mode, Everything works fine. There, one could try to do a reshuffle to make sure no fusion etc occurs. But that is not possible after windowing in streaming.
args = parse_arguments(sys.argv if argv is None else argv)
pipeline_options = get_pipeline_options(project=args.project_id,
job_name='XX',
num_workers=args.workers,
max_num_workers=MAX_NUM_WORKERS,
disk_size_gb=DISK_SIZE_GB,
local=args.local,
streaming=args.streaming)
pipeline = beam.Pipeline(options=pipeline_options)
# Build pipeline
# pylint: disable=C0330
if args.streaming:
frames = (pipeline | 'ReadFromPubsub' >> beam.io.ReadFromPubSub(
subscription=SUBSCRIPTION_PATH,
with_attributes=True,
timestamp_attribute='timestamp'
))
frame_tpl = frames | 'CreateFrameTuples' >> beam.Map(
create_frame_tuples_fn)
crops = frame_tpl | 'MakeCrops' >> beam.Map(make_crops_fn, NR_CROPS)
bboxs = crops | 'bounding boxes tfserv' >> beam.Map(
pred_bbox_tfserv_fn, SERVER_URL)
sliding_windows = bboxs | 'Window' >> beam.WindowInto(
beam.window.SlidingWindows(
FEATURE_WINDOWS['goal']['window_size'],
FEATURE_WINDOWS['goal']['window_interval']),
trigger=AfterCount(30),
accumulation_mode=AccumulationMode.DISCARDING)
# GROUPBYKEY (per match)
group_per_match = sliding_windows | 'Group' >> beam.GroupByKey()
_ = group_per_match | 'LogPerMatch' >> beam.Map(lambda x: logging.info(
"window per match per timewindow: # %s, %s", str(len(x[1])), x[1][0][
'timestamp']))
sog = sliding_windows | 'Predict SOG' >> beam.Map(predict_sog_fn,
SERVER_URL_INCEPTION,
SERVER_URL_SOG )
pipeline.run().wait_until_finish()
In beam the unit of parallelism is the key--all the windows for a given key will be produced on the same machine. However, if you have 50+ keys they should get distributed among all workers.
You mentioned that you were unable to add a Reshuffle in streaming. This should be possible; if you're getting errors please file a bug at https://issues.apache.org/jira/projects/BEAM/issues . Does re-windowing into GlobalWindows make the issue with reshuffling go away?
It looks like you do not necessarily need GroupByKey because you are always grouping on the same key. Instead you could maybe use CombineGlobally to append all the elements inside the window in stead of the GroupByKey (with always the same key).
combined = values | beam.CombineGlobally(append_fn).without_defaults()
combined | beam.ParDo(PostProcessFn())
I am not sure how the load distribution works when using CombineGlobally but since it does not process key,value pairs I would expect another mechanism to do the load distribution.

Apache Beam - Sliding Windows outputting multiple windows

I'm taking a PCollection of sessions and trying to get average session duration per channel/connection. I'm doing something where my early triggers are firing for each window produced - if 60min windows sliding every 1 minute, an early trigger will fire 60 times. Looking at the timestamps on the outputs, there's a window every minute for 60minutes into the future. I'd like the trigger to fire once for the most recent window so that every 10 seconds I have an average of session durations for the last 60 minutes.
I've used sliding windows before and had the results I expected. By mixing sliding and sessions windows, I'm somehow causing this.
Let me paint you a picture of my pipeline:
First, I'm creating sessions based on active users:
.apply("Add Window Sessions",
Window.<KV<String, String>> into(Sessions.withGapDuration(Duration.standardMinutes(60)))
.withOnTimeBehavior(Window.OnTimeBehavior.FIRE_ALWAYS)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Group Sessions", Latest.perKey())
Steps after this create a session object, compute session duration, etc. This ends with a PCollection(Session).
I create a KV of connection,duration from the Pcollection(Session).
Then I apply the sliding window and then the mean.
.apply("Apply Rolling Minute Window",
Window. < KV < String, Integer >> into(
SlidingWindows
.of(Duration.standardMinutes(60))
.every(Duration.standardMinutes(1)))
.triggering(
Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10)))
)
)
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes()
)
.apply("Get Average", Mean.perKey())
It's at this point where I'm seeing issues. What I'd like to see is a single output per key with the average duration. What I'm actually seeing is 60 outputs for the same key for each minute into the next 60 minutes.
With this log in a DoFn with C being the ProcessContext:
LOG.info(c.pane().getTiming() + " " + c.timestamp());
I get this output 60 times with timestamps 60 minutes into the future:
EARLY 2017-12-17T20:41:59.999Z
EARLY 2017-12-17T20:43:59.999Z
EARLY 2017-12-17T20:56:59.999Z
(cont)
The log was printed at Dec 17, 2017 19:35:19.
The number of outputs is always window size/slide duration. So if I did 60 minute windows every 5 minutes, I would get 12 output.
I think I've made sense of this.
Sliding windows create a new window with the .every() function. Setting early firings applies to each window so getting multiple firings makes sense.
In order to fit my use case and only output the "current window", I'm checking c.pane().isFirst() == true before outputting results and adjusting the .every() to control the frequency.

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

Resources