Using cyclops-react for batching on a async queue stream - batching

I am trying to use cyclops-react to batch the elements from a queue, based on size, but also on time, so it doesn't block when there are no elements
Maybe the functionality is not what I expected or I am doing something wrong
The complete code (Groovy) is like this with the producer in another thread:
Queue<String> queue = QueueFactories.<String>unboundedQueue().build();
new Thread({
while (true) {
sleep(1000)
queue.offer("New message " + System.currentTimeMillis());
}
}).start();
StreamSource.futureStream(queue, new LazyReact(ThreadPools.queueCopyExecutor))
.groupedBySizeAndTime(10,500,TimeUnit.MILLISECONDS)
.forEach({i->println(i + " Batch Time: ${System.currentTimeMillis()}")})
The output is:
[New message 1487673650332, Batch Time: 1487673651356]
[New message 1487673651348, New message 1487673652352, Batch Time: 1487673653356]
[New message 1487673653355, New message 1487673654357, Batch Time: 1487673655362]
[New message 1487673655362, New message 1487673656364, Batch Time: 1487673657365]
But I was expecting one element in each batch since the delay between elements offered is 10seconds but the batching is every half a second
Also I tried with an asynchronous stream (Groovy code):
Queue<String> queue = QueueFactories.<String>unboundedQueue().build();
StreamSource.futureStream(queue, new LazyReact(ThreadPools.queueCopyExecutor))
.async()
.groupedBySizeAndTime(10, 500,TimeUnit.MILLISECONDS)
.peek({i->println(i + "Batch Time: ${System.currentTimeMillis()}")}).run();
while (true) {
queue.offer("New message " + System.currentTimeMillis());
sleep(1000)
}
Again, it only batches every 2 seconds, sometimes waiting for two elements per batch, even if the timeout in the batch is half second:
[New message 1487673877780, Batch Time: 1487673878819]
[New message 1487673878811, New message 1487673879812, Batch Time: 1487673880815]
[New message 1487673880814, New message 1487673881819, Batch Time: 1487673882823]
[New message 1487673882823, New message 1487673883824, Batch Time: 1487673884828]
[New message 1487673884828, New message 1487673885831, Batch Time: 1487673886835]
I did a third experiment with a non future non lazy stream, and this time it worked.
Queue<String> queue = QueueFactories.<String>unboundedQueue().build();
new Thread({
while (true) {
sleep(1000)
queue.offer("New message " + System.currentTimeMillis());
}
}).start();
queue.stream()
.groupedBySizeAndTime(10,500,TimeUnit.MILLISECONDS)
.forEach({i->println(i + " Batch Time " + System.currentTimeMillis())})
Result:
[New message 1487673288017, New message 1487673289027, Batch Time , 1487673289055]
[New message 1487673290029, Batch Time , 1487673290029]
[New message 1487673291033, Batch Time , 1487673291033]
[New message 1487673292037, Batch Time , 1487673292037]
Why the behaviour of the batching seems to be wrong when you use a future stream?

The differential behaviour is due to a bug that reduces the efficiency of grouping FutureStreams of an async.Queue (basically this means that next result is present within the 500ms limit of the previous and the Stream will ask the Queue for another value and wait until it arrives). This will be fixed in future releases of cyclops-react.
It is possible to work around this in a couple of ways
Using a workaround suggested by Jesus Menendez in the bug report
queue.stream()
.groupedBySizeAndTime(batchSize, batchTimeoutMillis, TimeUnit.MILLISECONDS)
.futureStream(new LazyReact(ThreadPools.getSequential()))
.async()
.peek(this::executeBatch)
.run();
This avoids the overhead that results in two values being batched together.
We can timeout after 500ms (and not wait until a value arrives in the Queue for batching) by making use of the streamBatch operator
Queue<String> queue = QueueFactories.<String>unboundedQueue().build();
new Thread(()->{
for(int i=0;i<10;i++){
queue.offer("New message " + i);
sleep(10000);
}
queue.close();
}).start();
long toRun = TimeUnit.MILLISECONDS.toNanos(500l);
queue.streamBatch(new Subscription(), source->{
return ()->{
List<String> result = new ArrayList<>();
long start = System.nanoTime();
while (result.size() < 10 && (System.nanoTime() - start) < toRun) {
try {
String next = source.apply(1l, TimeUnit.MILLISECONDS);
if (next != null) {
result.add(next);
}
}catch(Queue.QueueTimeoutException e){
}
}
start=System.nanoTime();
return result;
};
}).filter(l->l.size()>0)
.futureStream(new LazyReact(ThreadPools.getSequential()))
.async()
.peek(System.out::println)
.run();
In this case we will always group after 500ms and not wait until a value we have asked for arrives in the Queue.

Related

Jenkins pipeline - custom timeout behavior

I need custom behavior for the timeout function. For example, when I use:
timeout(time: 10, unit: 'MINUTES') {
doSomeStuff()
}
it terminates the doSomeStuff() function.
What I want to achieve is not to terminate the execution of the function, but to call another function every 10 minutes until doSomeStuff() is done with executing.
I can't use the Build-timeout Plugin from Jenkins since I need to apply this behavior to pipelines.
Any help would be appreciated.
In case anyone else has the same issue: After some research, the only way that came to my mind to solve my problem was to modify the notification plugin for the jenkins pipeline, in a way to add a new field that would contain value of time (in minutes) to delay the invoking of the url. In the code itself, where the url was invoked, i put those lines in a new thread and let that thread sleep for the needed amount of time before executing the remaining code. Something like this:
#Override
public void onStarted(final Run r, final TaskListener listener) {
HudsonNotificationProperty property = (HudsonNotificationProperty) r.getParent().getProperty(HudsonNotificationProperty.class);
int invokeUrlTimeout = 0;
if (property != null && !property.getEndpoints().isEmpty()){
invokeUrlTimeout = property.getEndpoints().get(0).getInvokeUrlTimeout();
}
int finalInvokeUrlTimeout = invokeUrlTimeout;
new Thread(() -> {
sleep(finalInvokeUrlTimeout * 60 * 1000);
Executor e = r.getExecutor();
Phase.QUEUED.handle(r, TaskListener.NULL, e != null ? System.currentTimeMillis() - e.getTimeSpentInQueue() : 0L);
Phase.STARTED.handle(r, listener, r.getTimeInMillis());
}).start();
}
Maybe not the best solution but it works for me, and I hope it helps other people too.

Reactor Flux and asynchronous processing

I am trying to learn Reactor but I am having a lot of trouble with it. I wanted to do a very simple proof of concept where I simulate calling a slow down stream service 1 or more times. If you use reactor and stream the response the caller doesn't have to wait for all the results.
So I created a very simple controller but it is not behaving like I expect. When the delay is "inside" my flatMap (inside the method I call) the response is not returned until everything is complete. But when I add a delay after the flatMap the data is streamed.
Why does this code result in a stream of JSON
#GetMapping(value = "/test", produces = { MediaType.APPLICATION_STREAM_JSON_VALUE })
Flux<HashMap<String, Object>> customerCards(#PathVariable String customerId) {
Integer count = service.getCount(customerId);
return Flux.range(1, count).
flatMap(k -> service.doRestCall(k)).delayElements(Duration.ofMillis(5000));
}
But this does not
#GetMapping(value = "/test2", produces = { MediaType.APPLICATION_STREAM_JSON_VALUE })
Flux<HashMap<String, Object>> customerCards(#PathVariable String customerId) {
Integer count = service.getCount(customerId);
return Flux.range(1, count).
flatMap(k -> service.doRestCallWithDelay(k));
}
It think I am missing something very basic of the reactor API. On that note. can anyone point to a good book or tutorial on reactor? I can't seem to find anything good to learn this.
Thanks
The code inside the flatMap runs on the main thread (that is the thread the controller runs). As a result the whole process is blocked and the method doesnt return immediately. Have in mind that Reactor doesnt impose a particular threading model.
On the contrary, according to the documentation, in the delayElements method signals are delayed and continue on the parallel default Scheduler. That means that the main thread is not blocked and returns immediately.
Here are two corresponding examples:
Blokcing code:
Flux.range(1, 500)
.map(i -> {
//blocking code
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(Thread.currentThread().getName() + " - Item : " + i);
return i;
})
.subscribe();
System.out.println("main completed");
Result:
main - Item : 1
main - Item : 2
main - Item : 3
...
main - Item : 500
main completed
Non-blocking code:
Flux.range(1, 500)
.delayElements(Duration.ofSeconds(1))
.subscribe(i -> {
System.out.println(Thread.currentThread().getName() + " - Item : " + i);
});
System.out.println("main Completed");
//sleep main thread in order to be able to print the println of the flux
try {
Thread.sleep(30000);
} catch (InterruptedException e) {
e.printStackTrace();
}
Result:
main Completed
parallel-1 - Item : 1
parallel-2 - Item : 2
parallel-3 - Item : 3
parallel-4 - Item : 4
...
Here is the project reactor reference guide
"delayElements" method only delay flux element by a given duration, see javadoc for more details
I think you should post details about methods "service.doRestCallWithDelay(k)" and "service.doRestCall(k)" if you need more help.

how to make executor service wait until all thread finish

i use executor service to launch multiple thread to sent request to api and get data back. sometimes i see some threads haven't finished their job yet, the service kill that thread already, how can i force the service to wait until the thread finish their job?
here is my code:
ExecutorService pool = Executors.newFixedThreadPool(10);
List<Future<List<Book>>> futures = Lists.newArrayList();
final ObjectMapper mapper1 = new ObjectMapper();
for (final Author a : authors) {
futures.add(pool.submit(new Callable<List<Book>>() {
#Override
public List<Book> call() throws Exception {
String urlStr = "http://localhost/api/book?limit=5000&authorId=" + a.getId();
List<JsonBook> Jsbooks = mapper1.readValue(
new URL(urlStr), BOOK_LIST_TYPE_REFERENCE);
List<Book> books = Lists.newArrayList();
for (JsonBook jsonBook : Jsbooks) {
books.add(jsonBook.toAvro());
}
return books;
}
}));
}
pool.shutdown();
pool.awaitTermination(3, TimeUnit.MINUTES);
List<Book> bookList = Lists.newArrayList();
for (Future<List<Book>> future : futures) {
if (!future.isDone()) {
LogUtil.info("future " + future.toString()); <-- future not finished yet
throw new RuntimeException("Future to retrieve books: " + future + " did not complete");
}
bookList.addAll(future.get());
}
and i saw some excepitons at the (!future.isDone()) block. how can i make sure every future is done when executor service shutdown?
I like to use the countdown latch.
Set the latch to the size that you're iterating and pass that latch into your callables, then in your run / call method have a try/finally block that decrements the countdown latch.
After everything has been enqueued to your executor service, just call your latch's await method, which will block until it's all done. At that time all your callables will be finished, and you can properly shut down your executor service.
This link has an example of how to set it up.
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CountDownLatch.html

Several parallel batches in Neo4jphp

Is it possible to create several batches at one time?
For example I have a code that has a running batch (batch 1). And inside this batch I have a method called which has another batch inside it (batch 2). The code is not working.
When I remove the upper batch (batch 1) I have a created node. Maybe there is only 1 batch possible at one time?
The example code is below:
$batch = $client->startBatch();
$widget = NULL;
try {
$widgetLabel = $client->makeLabel('Widget');
$widget = $client->makeNode();
$widget
->setProperty('base_filename', md5(uniqid('', TRUE)))
->setProperty('datetime_added', time())
->setProperty('current_version', 0)
->setProperty('shared', 0)
->setProperty('active', 1)
->save();
// add widget history
$history = Model_History::create($widget, $properties);
if ($history == NULL) {
throw new Exception('Could not create widget history!');
}
$widget->setProperty('current_version', $history->getID());
$widget->save();
$client->commitBatch($batch);
} catch (Exception $e) {
$client->endBatch();
}
The batch 2 is inside the Model_History::create() method. I don't get a valid $widget - Neo4jphp node from this code.
If the second batch is being create with another call to $client->startBatch() it will actually be the same batch object as $batch. If you call $client->commitBatch() from there, it will commit the outer batch (since they are the same.)
Don't start a second batch in Model_History::create(). Start the outer batch, go through all your code, and commit it once at the end.

Timeout Notification for Asynchronous Request

I am sending SPARQL queries as asynchronous requests to a SPARQL endpoint, currently DBpedia using the dotNetRDF library. While simpler queries usually work, more complex queries sometimes result in timeouts.
I am looking for a way to handle the timeouts by capturing some event when they occur.
I am sending my queries by using one of the asynchronous QueryWithResultSet overloads of the SparqlRemoteEndpoint class.
As described for SparqlResultsCallback, the state object will be replaced with an AsyncError instance if the asynchronous request failed. This does indicate that there was a timeout, however it seems that it only does so 10 minutes after the request was sent. When my timeout is, for example, 30 seconds, I would like to know 30 seconds later whether the request was successful. (35 seconds are ok, too, but you get the idea.)
Here is a sample application that sends two requests, the first of which is very simple and likely to succeed within the timeout (here set to 120 seconds), while the second one is rather complex and may easily fail on DBpedia:
using System;
using System.Collections.Concurrent;
using VDS.RDF;
using VDS.RDF.Query;
public class TestTimeout
{
private static string FormatResults(SparqlResultSet results, object state)
{
var result = new System.Text.StringBuilder();
result.AppendLine(DateTime.Now.ToLongTimeString());
var asyncError = state as AsyncError;
if (asyncError != null) {
result.AppendLine(asyncError.State.ToString());
result.AppendLine(asyncError.Error.ToString());
} else {
result.AppendLine(state.ToString());
}
if (results == null) {
result.AppendLine("results == null");
} else {
result.AppendLine("results.Count == " + results.Count.ToString());
}
return result.ToString();
}
public static void Main(string[] args)
{
Console.WriteLine("Launched ...");
Console.WriteLine(DateTime.Now.ToLongTimeString());
var output = new BlockingCollection<string>();
var ep = new SparqlRemoteEndpoint(new Uri("http://dbpedia.org/sparql"));
ep.Timeout = 120;
Console.WriteLine("Server == " + ep.Uri.AbsoluteUri);
Console.WriteLine("HTTP Method == " + ep.HttpMode);
Console.WriteLine("Timeout == " + ep.Timeout.ToString());
string query = "SELECT DISTINCT ?a\n"
+ "WHERE {\n"
+ " ?a <http://www.w3.org/2000/01/rdf-schema#label> ?b.\n"
+ "}\n"
+ "LIMIT 10\n";
ep.QueryWithResultSet(query,
(results, state) => {
output.Add(FormatResults(results, state));
},
"Query 1");
query = "SELECT DISTINCT ?v5 ?v8\n"
+ "WHERE {\n"
+ " {\n"
+ " SELECT DISTINCT ?v5\n"
+ " WHERE {\n"
+ " ?v6 ?v5 ?v7.\n"
+ " FILTER(regex(str(?v5), \"[/#]c[^/#]*$\", \"i\")).\n"
+ " }\n"
+ " OFFSET 0\n"
+ " LIMIT 20\n"
+ " }.\n"
+ " OPTIONAL {\n"
+ " ?v5 <http://www.w3.org/2000/01/rdf-schema#label> ?v8.\n"
+ " FILTER(lang(?v8) = \"en\").\n"
+ " }.\n"
+ "}\n"
+ "ORDER BY str(?v5)\n";
ep.QueryWithResultSet(query,
(results, state) => {
output.Add(FormatResults(results, state));
},
"Query 2");
Console.WriteLine("Queries sent.");
Console.WriteLine(DateTime.Now.ToLongTimeString());
Console.WriteLine();
string result = output.Take();
Console.WriteLine(result);
result = output.Take();
Console.WriteLine(result);
Console.ReadLine();
}
}
When I run this, I reproducibly get an output like the following:
13:13:23
Server == http://dbpedia.org/sparql
HTTP Method == GET
Timeout == 120
Queries sent.
13:13:25
13:13:25
Query 1
results.Count == 10
13:23:25
Query 2
VDS.RDF.Query.RdfQueryException: A HTTP error occurred while making an asynchron
ous query, see inner exception for details ---> System.Net.WebException: Der Rem
oteserver hat einen Fehler zurückgegeben: (504) Gatewaytimeout.
bei System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
bei VDS.RDF.Query.SparqlRemoteEndpoint.<>c__DisplayClass13.<QueryWithResultSe
t>b__11(IAsyncResult innerResult)
--- Ende der internen Ausnahmestapelüberwachung ---
results == null
Obviously, the exact times will be different, but the crucial point is that the error message based on the second query is received approximately 10 minutes after the request was sent, nowhere near the 2 minutes set for the timeout.
Am I using dotNetRDF incorrectly here, or is it intentional that I have to run an additional timer to measure the timeout myself and react on my own unless any response has been received meanwhile?
No you are not using dotNetRDF incorrectly rather there appears to be a bug that the timeouts set on an endpoint don't get honoured when running queries asynchronously. This has been filed as CORE-393
By the way even with this bug fixed you won't necessarily get a hard timeout at the set timeout. Essentially the value you set for the Timeout property of the SparqlRemoteEndpoint instance that value is used to set the Timeout property of the .Net HttpWebRequest. The documentation for HttpWebRequest.Timeout states the following:
Gets or sets the time-out value in milliseconds for the GetResponse
and GetRequestStream methods.
So you could wait up to the time-out to make the connection to POST the query and then up to the time-out again to start receiving a response. Once you start receiving a response the timeout becomes irrelevant and is not respected by the code that processes the response.
Therefore if you want a hard timeout you are better off implementing it yourself, longer term this may be something we can add to dotNetRDF but this is more complex to implement that simply fixing the bug about the timeout not getting honoured for the HTTP request.

Resources