cudaStreamWaitEvent does not seem to wait - stream

I am attempting to write a small demo program that has two cuda streams progressing and, governed by events, waiting for each other. So far this program looks like this:
// event.cu
#include <iostream>
#include <cstdio>
#include <cuda_runtime.h>
#include <cuda.h>
using namespace std;
__global__ void k_A1() { printf("\tHi! I am Kernel A1.\n"); }
__global__ void k_B1() { printf("\tHi! I am Kernel B1.\n"); }
__global__ void k_A2() { printf("\tHi! I am Kernel A2.\n"); }
__global__ void k_B2() { printf("\tHi! I am Kernel B2.\n"); }
int main()
{
cudaStream_t streamA, streamB;
cudaEvent_t halfA, halfB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);
cudaEventCreate(&halfA);
cudaEventCreate(&halfB);
cout << "Here is the plan:" << endl <<
"Stream A: A1, launch 'HalfA', wait for 'HalfB', A2." << endl <<
"Stream B: Wait for 'HalfA', B1, launch 'HalfB', B2." << endl <<
"I would expect: A1,B1, (A2 and B2 running concurrently)." << endl;
k_A1<<<1,1,0,streamA>>>(); // A1!
cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
k_A2<<<1,1,0,streamA>>>(); // A2!
cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
k_B1<<<1,1,0,streamB>>>(); // B1!
cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
k_B2<<<1,1,0,streamB>>>(); // B2!
cudaEventDestroy(halfB);
cudaEventDestroy(halfA);
cudaStreamDestroy(streamB);
cudaStreamDestroy(streamA);
cout << "All has been started. Synchronize!" << endl;
cudaDeviceSynchronize();
return 0;
}
My grasp of CUDA streams is the following: A stream is a kind of list to which I can add tasks. These tasks are tackled in series. So in my program I can rest assured that streamA would in order
Call kernel k_A1
Trigger halfA
Wait for someone to trigger halfB
Call kernel k_A2
and streamB would
Wait for someone to trigger halfA
Call kernel k_B1
Trigger halfB
Call kernel k_B2
Normally both streams might run asynchronous to each other. However, I would like to block streamB until A1 is done and then block streamA until B1 is done.
This appears not to be as simple. On my Ubuntu with Tesla M2090 (CC 2.0) the output of
nvcc -arch=sm_20 event.cu && ./a.out
is
Here is the plan:
Stream A: A1, launch 'HalfA', wait for 'HalfB', A2.
Stream B: Wait for 'HalfA', B1, launch 'HalfB', B2.
I would expect: A1,B1, (A2 and B2 running concurrently).
All has been started. Synchronize!
Hi! I am Kernel A1.
Hi! I am Kernel A2.
Hi! I am Kernel B1.
Hi! I am Kernel B2.
And I really would have expected B1 to be completed before the cudaEventRecord(halfB,streamB). Nevertheless stream A obviously does not wait for the completion of B1 and so not for the recording of halfB.
What's more: If I altogether delete the cudaEventRecord commands I would expect the program to lock down on the cudaStreamWait commands. But it does not and produces the same output. What am I overlooking here?

I think this is because "cudaStreamWaitEvent(streamA,halfB,0); " was called before "halfB" was recorded (cudaEventRecord(halfB,streamB);). It's likely that the cudaStreamWaitEvent call was searching for the closed "halfB" prior to it; since it was not found, it just quietly moved forward. See the following documentation:
The stream stream will wait only for the completion of the most recent host call to cudaEventRecord() on event. Once this call has returned, any functions (including cudaEventRecord() and cudaEventDestroy()) may be called on event again, and the subsequent calls will not have any effect on stream.
I could not find a solution if you have to do a depth-first coding; however, the following code may lead to what you want:
k_A1<<<1,1,0,streamA>>>(d); // A1!
cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
k_B1<<<1,1,0,streamB>>>(d); // B1!
cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
k_A2<<<1,1,0,streamA>>>(d); // A2!
k_B2<<<1,1,0,streamB>>>(d); // B2!
which is confirmed by the profiling:
Note that I changed the kernel interfaces.

From the docs:
If cudaEventRecord() has not been called on event, this call acts as if the record has already completed, and so is a functional no-op.
https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/html/group__CUDART__STREAM_gfe68d207dc965685d92d3f03d77b0876.html#gfe68d207dc965685d92d3f03d77b0876
So we need to sort these lines so that the record is in the program before the eventwait. That is, for the stream of the event wait to be forced to run before the record, the record must be earlier in the code!
Here's the original code:
k_A1<<<1,1,0,streamA>>>(); // A1!
cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
k_A2<<<1,1,0,streamA>>>(); // A2!
cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
k_B1<<<1,1,0,streamB>>>(); // B1!
cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
k_B2<<<1,1,0,streamB>>>(); // B2!
We see that the record of halfB is called on the second to last line but the wait is called above, on the third line. No good. So we re-order. The first thing on streamB is that wait and our only requirement is that is happen after the record. So that line can move up to be the third line.
Likewise, the k_B1 can follow it directly. And then the cudaEventRecord for halfB can be moved up before the waitevent. Hmm, does this prevent deadlock I wonder?
k_A1<<<1,1,0,streamA>>>(); // A1!
cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
k_B1<<<1,1,0,streamB>>>(); // B1!
cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
k_A2<<<1,1,0,streamA>>>(); // A2!
k_B2<<<1,1,0,streamB>>>(); // B2!

Related

Move from EmitterProcessor to Sinks.many()

For some time have been using create an EmitterProcessor with built in sink as follows:
EmitterProcessor<String> emitter = EmitterProcessor.create();
FluxSink<String> sink = emitter.sink(FluxSink.OverflowStrategy.LATEST);
The sink publishes using a Flux .from command
Flux<String> out = Flux
.from(emitter
.log(log.getName()));
and the sink can be passed around, and populated with strings, simply using the next instruction.
Now we see that EmitterProcessor is deprecated.
It's all replaced with Sinks.many() like this
Many<String> sink = Sinks.many().unicast().onBackpressureBuffer();
but how to use that to publish from?
The answer was casting the Sinks.many() to asFlux()
Flux<String> out = Flux
.from(sink.asFlux()
.log(log.getName()));
Also using that for cancel and termination of the flux
sink.asFlux().doOnCancel(() -> {
cancelSink(id, request);
});
/* Handle errors, eviction, expiration */
sink.asFlux().doOnTerminate(() -> {
disposeSink(id);
});
UPDATE The cancel and terminate don't appear to work per this question

How to create three infinity loop inside isolate?

I am learning Dart and working with Isolate. I wrote next code, and expected that it will create three isolate process that will work infinity:
main() {
Isolate.spawn(echo, "Hello");
Isolate.spawn(echo, "Hello2");
Isolate.spawn(echo, "Hello3");
}
void echo(var message)
{
while(true)
{
print(message);
}
}
But I am getting very strange output like (every time different):
$ dart app.dart
Hello
Hello
Hello
Hello
HelloHello2
Hello
Hello3
Hello2
Hello
The VM will terminate the entire program as soon as the main isolate ends. For you, that happens after you have spawned all three isolates. There is nothing keeping the main isolate alive, so the entire program just ends ... eventually, when the isolate is done shutting down. When that is depends on timing, so it can vary quite a lot.
To keep an isolate alive forever, you can create a ReceivePort. Try addig:
var keepalive = ReceivePort();
to your program, then it should keep running forever.
Also, the printing is not just a list of lines containing hello's, they are intermixed.
The three isolates are running concurrently. They all write to the same output (stdout), so the outputs get intermixed. There is no promise that a print call is atomic, and it isn't, so a print call in one isolate can happen in the middle of a print call in another isolate.
What happens here is that print doesn't just print the argument, it also prints a newline afterwards. Those are two different writes to stdout, so it is possible for another isolate to print its message between the "Hello" and the "\n" following it.

How to stop parsing and reset yacc?

I am writing a job control shell. I use Yacc and Lex for parsing. The top rule in my grammar is pipeline_list, which is a list of pipelines separated by a comma. Thus, examples of pipelinelists are as follows:
cmd1 | cmd2; cmd3; cmd4 | cmd5 <newline>
cmd1 <newline>
<nothing> <newline>
I represent a pipeline with the pipeline rule (showed below). within that rule, I do the following:
1. call execute_pipeline() to execute the pipeline. execute_pipeline() returns -1 if anything went wrong in the execution of the pipeline.
2. Check the return value of execute_pipeline() and if it is -1, then STOP parsing the rest of the input, AND make sure YACC starts fresh when called again in the main function (shown below). The rationale for doing this is this:
Take, for example, the following pipelinelist: cd ..; ls -al. My intent here would be to move one directory up, and then list its content. However, if execution of the first pipeline (i.e., "cd ..") in the pipelinelist fails, then carrying on to execute the second pipeline (i.e. " ls -al") would list of the contents of the current directory (not the parent), which is wrong! For this reason, when parsing a pipelinelist of length n, if executing of some pipeline k > n fails, then I want to discard the rest of the pipelinelist (i.e., pipelines k+1..n), AND make sure the next invocation of yyparse() starts brand new (i.e. recieve new input from readline() -- see code below).
if tried the following, but it does not work:
pipeline:
simple_command_list redirection_list background pipeline_terminator // ignore these
{
if (execute_pipeline() == -1)
{
// do some stuff
// then call YYABORT, YYACCEPT, or YYERROR, but none of them works
}
}
int main()
{
while(1)
{
char *buffer = readline("> ");
if (buffer)
{
struct yy_buffer_state *bp;
bp = yy_scan_string(buffer);
yy_switch_to_buffer(bp);
yyparse()
yy_delete_buffer(bp);
free(buffer);
} // end if
} // end while
return 0;
} // end main()
You can use YYABORT; in an action to abort the current parse and immediately return from yyparse with a failure. You can use YYACCEPT; to immediately return success from yyparse.
Both of these macros can only be used directly in an action in the grammar -- they can't be used in some other function called by the action.

Consuming unbounded data in windows with default trigger

I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery.
Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written.
Here is my word publisher (it uses kinglear.txt from the examples as input file):
public static class AddCurrentTimestampFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
c.outputWithTimestamp(c.element(), new Instant(System.currentTimeMillis()));
}
}
public static class ExtractWordsFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
String[] words = c.element().split("[^a-zA-Z']+");
for (String word:words){ if(!word.isEmpty()){ c.output(word); }}
}
}
// main:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
p.apply("ReadLines", TextIO.Read.from(o.getInputFile()))
.apply("Lines2Words", ParDo.of(new ExtractWordsFn()))
.apply("AddTimestampFn", ParDo.of(new AddCurrentTimestampFn()))
.apply("WriteTopic", PubsubIO.Write.topic(o.getTopic()));
p.run();
Here is my windowed word counter:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
BigQueryIO.Write.Bound tablePipe = BigQueryIO.Write.to(o.getTable(o))
.withSchema(o.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);
Window.Bound<String> w = Window
.<String>into(FixedWindows.of(Duration.standardSeconds(1)));
p.apply("ReadTopic", PubsubIO.Read.subscription(o.getSubscription()))
.apply("FixedWindow", w)
.apply("CountWords", Count.<String>perElement())
.apply("CreateRows", ParDo.of(new WordCountToRowFn()))
.apply("WriteRows", tablePipe);
p.run();
The above subscriber will not work, since the window does not seem to trigger using the default trigger. However, if I manually define a trigger the code works and the counts are written to BigQuery.
Window.Bound<String> w = Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes();
I like to avoid specifying custom triggers if possible.
Questions:
Why does my solution not work with Dataflow's default trigger?
How do I have to change my publisher or subscriber to trigger windows using the default trigger?
How are you determining the trigger never fires?
Your PubSubIO.Write and PubSubIO.Read transforms should both specify a timestamp label using withTimestampLabel, otherwise the timestamps you've added will not be written to PubSub and the publish times will be used.
Either way, the input watermark of the pipeline will be derived from the timestamps of the elements waiting in PubSub. Once all inputs have been processed, it will stay back for a few minutes (in case there was a delay in the publisher) before advancing to real time.
What you are likely seeing is that all the elements are published in the same ~1 second window (since the input file is pretty small). These are all read and processed relatively quickly, but the 1-second window they are put in will not trigger until after the input watermark has advanced, indicating that all data in that 1-second window has been consumed.
This won't happen until several minutes, which may make it look like the trigger isn't working. The trigger you wrote fired after 1 second of processing time, which would fire much earlier, but there is no guarantee all the data has been processed.
Steps to get better behavior from the default trigger:
Use withTimestampLabel on both the write and read pubsub steps.
Have the publisher spread the timestamps out further (eg., run for several minutes and spread the timestamps out across that range)

Optimizing Lua for cyclic execution

I'm exeucting my Lua script once per program cycle of 10 ms. using the same Lua_state (luaL_newstate called once in my app)
Calling luaL_loadbuffer complies the script very fast for sure, still it seems unneccessary to do this every time the script is executed since the script does not change.
Tried to save binary using lua_dump() and then execute it, but lua_pcall() didn't accept the binary for some reason.
Any ideas on how to optimize? (LuaJIT is not an unfortenately an option here)
Jan
You're correct, if the code is not changing, there is no reason to reprocess the code. Perhaps you could do something like the following:
luaL_loadbuffer(state, buff, len, name); // TODO: check return value
while (true) {
// sleep 10ms
lua_pushvalue(state, -1); // make another reference to the loaded chunk
lua_call(state, 0, 0);
}
You'll note that we simply duplicate the function reference on the top of the stack, since lua_call removes the function that it calls from the stack. This way, you do not lose a reference to the loaded chunk.
Executing the loadbuffer compiles the script into a chunk of lua code, which you can treat as an anonymous function. The function is put at the top of the stack. You can "save" it the way you would any other value in Lua: push a name for the function onto the stack, then call lua_setglobal(L, name). After that, every time you want to call your function (the chunk), you push it onto the Lua stack, push the parameters onto the stack, and call lua_pcall(L, nargs, nresults). Lua will pop the function and put nresults results onto the stack (regardless of how many results are returned by your function -- if more are returned they are discarded, if fewer then the extras are nil). Example:
int stat = luaL_loadbuffer(L, scriptBuffer, scriptLen, scriptName);
// check status, if ok save it, else handle error
if (stat == 0)
lua_setglobal(L, scriptName);
...
// re-use later:
lua_getglobal(L, scriptName);
lua_pushinteger(L, 123);
stat = lua_pcall(L, 1, 1, 0);
// check status, if ok get the result off the stack

Resources