I use ext_timed_batch to aggregate batch data into windows. It works, the only problem I have is how to emit the last window after all events are processed. Take this code:
public static void main(String[] args) {
EPServiceProvider engine = EPServiceProviderManager.getDefaultProvider();
EPRuntime runtime = engine.getEPRuntime();
final EPStatement stmt = engine.getEPAdministrator().createEPL(
"select sum(value) as valueSum " +
"from MyEvent.win:ext_timed_batch(time, 3 msec, 0)");
stmt.addListener((newT, oldT) -> Arrays.stream(newT).forEach(
t -> System.out.println("valueSum=" + t.get("valueSum"))));
runtime.sendEvent(new MyEvent(0, 0));
runtime.sendEvent(new MyEvent(1, 1));
runtime.sendEvent(new MyEvent(2, 2));
runtime.sendEvent(new MyEvent(3, 3));
runtime.sendEvent(new MyEvent(4, 4));
}
MyEvent.java:
public class MyEvent {
public final long time;
public final int value;
public MyEvent(long time, int value) {
this.time = time;
this.value = value;
}
// ... getters
}
Here is the output:
valueSum=3
The output is just one event, which includes values from events 0, 1 and 2 (sum of which is 3). The aggregate for events 3 & 4 is missing (it would be valueSum=7). How to process the remaining events at the end?
Things tried
engine.getEPAdministrator().stopAllStatements()
send a forged event with Long.MAX_VALUE timestamp: this works, however I'm embedding the esper engine and want a general purpose solution. I have no knowledge of the query and no means to forge an event.
use stmt.safeIterator(): it's not for this purpose, returns the events from last window, not from the unfinished one.
send timer event runtime.sendEvent(new CurrentTimeEvent(Long.MAX_VALUE)): no effect. It does not affect ext_timed_batch.
Workaround
Send forged event with Long.MAX_VALUE timestamp:
runtime.sendEvent(new MyEvent(Long.MAX_VALUE, 0));
This is not a solution in my case, because I'm embedding esper into another app and I have no knowledge of the type of events nor of the query (it could be anything, not necessarily a query with ext_timed_batch).
You could just use "stmt.safeIterator()" when you are done processing events. The engine doesn't know what the last event is. The other option is to send an artificial "MyLastEvent" event and declare a context with "output when terminated".
Related
I have a simple widget subscribed to a Stream of elements.
Each time a new element is received I would like to get also the previous element and decide which one of them pass downstream.
Currently I am using the map operator to store the previous element and calculate the next, like this:
elements.map((e) {
if (this.previous == null) {
this.previous = e;
return e;
}
final next = merge(this.previous, e);
this.previous = e;
return next;
}).listen(...);
How can I do this better and avoid having this.previous?
If you use the rxdart package there is an extension method called pairwise which according to the documentation:
Emits the n-th and n-1th events as a pair. The first event won't be emitted until the second one arrives.
Then you should be able to do something along the lines of this:
elements.pairwise().map((pair) => merge(pair.first, pair.last)).listen(...);
Here is one possibility
void main() {
List list = [12, 24, 48, 60];
list.reduce((value, element) {
print(value + element); // Push to another list maybe?
return element;
});
}
If you are working with a stream try this
void main() {
var counterStream = Stream<int>.periodic(const Duration(seconds: 1), (x) => x)
.reduce((previous, element) {
print(previous + element); // Push to another stream maybe?
return element;
});
}
I know there was in Vaadin7 (8) some methods the get / set the cursor position of the textfield from server side.
I'm using Vaadin Flow v21, but there are no such methods. How can I get/set the cursor position from server side synchronously? To set it seens to work fine using some javascript, but I cannot read the actual cursor position, since it is only going async. How to do it sync?
I try to read like this:
public int getSelectionStart() throws InterruptedException
{
Future<Integer> value = UI.getCurrent().getPage().executeJs(
"return $0.shadowRoot.querySelector('[part=\"value\"]').selectionStart;"
, getElement()).toCompletableFuture(Integer.class);
final int[] val = new int[1];
Thread t = new Thread(() ->
{
Integer result = null;
try
{
result = value.get();
}
catch (Exception e)
{
e.printStackTrace();
}
val[0] = result == null ? 0 : result;
});
t.start();
t.join();
return val[0];
};
But above method gives me exception, that the ongoing UI session request has not been closed yet, so basically it cannot execute the javascript, unless I end the request.
This is how I set cursor position, which seems to work since I don't want to get any return value back, so ending the request is going to execute the script and set the proper cursor position.
public void setSelectionStart(int pos)
{
UI.getCurrent().getPage().executeJs("$0.shadowRoot.querySelector('[part=\"value\"]').setSelectionRange($1, $1);", getElement(), pos);
}
Thanks for any help / suggestion!
I hope the documentation for RPC return values could help here: https://vaadin.com/docs/latest/flow/element-api/client-server-rpc/#return-values
I am using rxdart package to handle stream in dart. I am stuck in handling a peculiar problem.
Please have a look at this dummy code:
final userId = BehaviorSubject<String>();
Stream<T> getStream(String uid) {
// a sample code that returns a stream
return BehaviorSubject<T>().stream;
}
final Observable<Stream<T>> oops = userId.map((uid) => getStream(uid));
Now I want to convert the oops variable to get only Observable<T>.
I am finding it difficult to explain clearly. But let me try. I have a stream A. I map each output of stream A to another stream B. Now I have Stream<Stream<B>> - a kind of recurrent stream. I just want to listen to the latest value produced by this pattern. How may I achieve this?
I will list several ways to flatten the Stream<Stream<T>> into single Stream<T>.
1. Using pure dart
As answered by #Irn, this is a pure dart solution:
Stream<T> flattenStreams<T>(Stream<Stream<T>> source) async* {
await for (var stream in source) yield* stream;
}
Stream<int> getStream(String v) {
return Stream.fromIterable([1, 2, 3, 4]);
}
void main() {
List<String> list = ["a", "b", "c"];
Stream<int> s = flattenStreams(Stream.fromIterable(list).map(getStream));
s.listen(print);
}
Outputs: 1 2 3 4 1 2 3 4 1 2 3 4
2. Using Observable.flatMap
Observable has a method flatMap that flattens the output stream and attach it to ongoing stream:
import 'package:rxdart/rxdart.dart';
Stream<int> getStream(String v) {
return Stream.fromIterable([1, 2, 3, 4]);
}
void main() {
List<String> list = ["a", "b", "c"];
Observable<int> s = Observable.fromIterable(list).flatMap(getStream);
s.listen(print);
}
Outputs: 1 2 3 4 1 2 3 4 1 2 3 4
3. Using Observable.switchLatest
Convert a Stream that emits Streams (aka a "Higher Order Stream") into a single Observable that emits the items emitted by the most-recently-emitted of those Streams.
This is the solution I was looking for! I just needed the latest output emitted by the internal stream.
import 'package:rxdart/rxdart.dart';
Stream<int> getStream(String v) {
return Stream.fromIterable([1, 2, 3, 4]);
}
void main() {
List<String> list = ["a", "b", "c"];
Observable<int> s = Observable.switchLatest(
Observable.fromIterable(list).map(getStream));
s.listen(print);
}
Outputs: 1 1 1 2 3 4
It's somewhat rare to have a Stream<Stream<Something>>, so it isn't something that there is much explicit support for.
One reason is that there are several (at least two) ways to combine a stream of streams of things into a stream of things.
Either you listen to each stream in turn, waiting for it to complete before starting on the next, and then emit the events in order.
Or you listen on each new stream the moment it becomes available, and then emit the events from any stream as soon as possible.
The former is easy to write using async/await:
Stream<T> flattenStreams<T>(Stream<Stream<T>> source) async* {
await for (var stream in source) yield* stream;
}
The later is more complicated because it requires listening on more than one stream at a time, and combining their events. (If only StreamController.addStream allowed more than one stream at a time, then it would be much easier). You can use the StreamGroup class from package:async for this:
import "package:async/async" show StreamGroup;
Stream<T> mergeStreams<T>(Stream<Stream<T>> source) {
var sg = StreamGroup<T>();
source.forEach(sg.add).whenComplete(sg.close);
// This doesn't handle errors in [source].
// Maybe insert
// .catchError((e, s) {
// sg.add(Future<T>.error(e, s).asStream())
// before `.whenComplete` if you worry about errors in [source].
return sg.stream;
}
If you want a Stream> to return a Stream, you basically need to flatten the stream.
The flatmap function is what you would use here
public static void main(String args[]) {
StreamTest st = new StreamTest<String>();
List<String> l = Arrays.asList("a","b", "c");
Stream s = l.stream().map(i -> st.getStream(i)).flatMap(i->i);
}
Stream<T> static getStream(T uid) {
// a sample code that returns a stream
return Stream.of(uid);
}
If you need the first object, then use the findFirst() method
public static void main(String args[]) {
StreamTest st = new StreamTest<String>();
List<String> l = Arrays.asList("a","b", "c");
String str = l.stream().map(i -> st.getStream(i)).flatMap(i->i).findFirst().get();
}
You need to call asyncExpand method on your stream - it lets you transform each element into a sequence of asynchronous events.
I'd like to accomplish the following using Apache Beam:
calculate every 5 seconds the events that are read from pubsub in the last minute
The goal is to have a semi-realtime view on the rate data comes in. This can then be expanded towards more complex use cases afterwards.
After searching, I've not come across a way to solve this seemingly simple problem. Things that do not work:
global window + repeated triggers (triggers do not fire when there is no input)
sliding window + withoutDefaults (does not allow empty windows to be emitted apparently)
Any suggestion on how to solve this problem?
As already discussed, Beam does not emit data for empty windows. In addition to the reasons given by Rui Wang we can add the challenge of how the latter stages would handle those empty panes.
Anyway, the specific use case that you describe -monitoring rolling count of number of messages - should be possible with some work even if the metric falls down to zero eventually. One possibility would be to publish a steady number of dummy messages which would advance the watermark and fire the panes but are filtered out later within the pipeline. The problem with this approach is that the publishing source needs to be adapted and that might not always be convenient/possible. Another one would involve generating this fake data as another input and co-group it with the main stream. The advantage is that everything can be done in Dataflow without the need to tweak the source or the sink. To illustrate this I provide an example.
The inputs are divided in two streams. For the dummy one, I used GenerateSequence to create a new element every 5 seconds. I then window the PCollection (windowing strategy needs to be compatible with the one for the main stream so I will use the same). Then I map the element to a key-value pair where the value is 0 (we could use other values as we know from which stream the element comes but I want to evince that dummy records are not counted).
PCollection<KV<String,Integer>> dummyStream = p
.apply("Generate Sequence", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5)))
.apply("Window Messages - Dummy", Window.<Long>into(
...
.apply("Count Messages - Dummy", ParDo.of(new DoFn<Long, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
c.output(KV.of("num_messages", 0));
}
}));
For the main stream, that reads from Pub/Sub, I map each record to value 1. Later on, I will add all the ones as in typical word count examples using map-reduce stages.
PCollection<KV<String,Integer>> mainStream = p
.apply("Get Messages - Data", PubsubIO.readStrings().fromTopic(topic))
.apply("Window Messages - Data", Window.<String>into(
...
.apply("Count Messages - Data", ParDo.of(new DoFn<String, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
c.output(KV.of("num_messages", 1));
}
}));
Then we need to join them using a CoGroupByKey (I used the same num_messages key to group counts). This stage will output results when one of the two inputs has elements, therefore unblocking the main issue here (empty windows with no Pub/Sub messages).
final TupleTag<Integer> dummyTag = new TupleTag<>();
final TupleTag<Integer> dataTag = new TupleTag<>();
PCollection<KV<String, CoGbkResult>> coGbkResultCollection = KeyedPCollectionTuple.of(dummyTag, dummyStream)
.and(dataTag, mainStream).apply(CoGroupByKey.<String>create());
Finally, we add all the ones to obtain the total number of messages for the window. If there are no elements coming from dataTag then the sum will just default to 0.
public void processElement(ProcessContext c, BoundedWindow window) {
Integer total_sum = new Integer(0);
Iterable<Integer> dataTagVal = c.element().getValue().getAll(dataTag);
for (Integer val : dataTagVal) {
total_sum += val;
}
LOG.info("Window: " + window.toString() + ", Number of messages: " + total_sum.toString());
}
This should result in something like:
Note that results from different windows can come unordered (this can happen anyway when writing to BigQuery) and I did not play with the window settings to optimize the example.
Full code:
public class EmptyWindows {
private static final Logger LOG = LoggerFactory.getLogger(EmptyWindows.class);
public static interface MyOptions extends PipelineOptions {
#Description("Input topic")
String getInput();
void setInput(String s);
}
#SuppressWarnings("serial")
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
String topic = options.getInput();
PCollection<KV<String,Integer>> mainStream = p
.apply("Get Messages - Data", PubsubIO.readStrings().fromTopic(topic))
.apply("Window Messages - Data", Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(1))
.every(Duration.standardSeconds(5)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes())
.apply("Count Messages - Data", ParDo.of(new DoFn<String, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
//LOG.info("New data element in main output");
c.output(KV.of("num_messages", 1));
}
}));
PCollection<KV<String,Integer>> dummyStream = p
.apply("Generate Sequence", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5)))
.apply("Window Messages - Dummy", Window.<Long>into(
SlidingWindows.of(Duration.standardMinutes(1))
.every(Duration.standardSeconds(5)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes())
.apply("Count Messages - Dummy", ParDo.of(new DoFn<Long, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
//LOG.info("New dummy element in main output");
c.output(KV.of("num_messages", 0));
}
}));
final TupleTag<Integer> dummyTag = new TupleTag<>();
final TupleTag<Integer> dataTag = new TupleTag<>();
PCollection<KV<String, CoGbkResult>> coGbkResultCollection = KeyedPCollectionTuple.of(dummyTag, dummyStream)
.and(dataTag, mainStream).apply(CoGroupByKey.<String>create());
coGbkResultCollection
.apply("Log results", ParDo.of(new DoFn<KV<String, CoGbkResult>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
Integer total_sum = new Integer(0);
Iterable<Integer> dataTagVal = c.element().getValue().getAll(dataTag);
for (Integer val : dataTagVal) {
total_sum += val;
}
LOG.info("Window: " + window.toString() + ", Number of messages: " + total_sum.toString());
}
}));
p.run();
}
}
Another way to approach this problem is using a stateful DoFn with a looping Timer that triggers at each 5 second tick. This looping timer generates the default data necessary for the live monitoring, and ensures that each window has at least one event to process.
One issue with the approach described by https://stackoverflow.com/a/54543527/430128 is that, in a system with multiple keys, these "dummy" events need to be generated for every key.
See https://beam.apache.org/blog/looping-timers/. Option 1 and 2 in that article are an external heartbeat source and a generated source in the beam pipeline respectively. Option 3 is the looping timer.
My Dataflow job (using Java SDK 2.1.0) is quite slow and it is going to take more than a day to process just 50GB. I just pull a whole table from BigQuery (50GB), join with one csv file from GCS (100+MB).
https://cloud.google.com/dataflow/model/group-by-key
I use sideInputs to perform join (the latter way in the documentation above) while I think using CoGroupByKey is more efficient, however I'm not sure that is the only reason my job is super slow.
I googled and it looks by default, a cache of sideinputs set as 100MB and I assume my one is slightly over that limit then each worker continuously re-reads sideinputs. To improve it, I thought I can use setWorkerCacheMb method to increase the cache size.
However it looks DataflowPipelineOptions does not have this method and DataflowWorkerHarnessOptions is hidden. Just passing --workerCacheMb=200 in -Dexec.args results in
An exception occured while executing the Java class.
null: InvocationTargetException:
Class interface com.xxx.yyy.zzz$MyOptions missing a property
named 'workerCacheMb'. -> [Help 1]
How can I use this option? Thanks.
My pipeline:
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> rows = p.apply("Read from BigQuery",
BigQueryIO.read().from("project:MYDATA.events"));
// Read account file
PCollection<String> accounts = p.apply("Read from account file",
TextIO.read().from("gs://my-bucket/accounts.csv")
.withCompressionType(CompressionType.GZIP));
PCollection<TableRow> accountRows = accounts.apply("Convert to TableRow",
ParDo.of(new DoFn<String, TableRow>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String line = c.element();
CSVParser csvParser = new CSVParser();
String[] fields = csvParser.parseLine(line);
TableRow row = new TableRow();
row = row.set("account_id", fields[0]).set("account_uid", fields[1]);
c.output(row);
}
}));
PCollection<KV<String, TableRow>> kvAccounts = accountRows.apply("Populate account_uid:accounts KV",
ParDo.of(new DoFn<TableRow, KV<String, TableRow>>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
String uid = (String) row.get("account_uid");
c.output(KV.of(uid, row));
}
}));
final PCollectionView<Map<String, TableRow>> uidAccountView = kvAccounts.apply(View.<String, TableRow>asMap());
// Add account_id from account_uid to event data
PCollection<TableRow> rowsWithAccountID = rows.apply("Join account_id",
ParDo.of(new DoFn<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
if (row.containsKey("account_uid") && row.get("account_uid") != null) {
String uid = (String) row.get("account_uid");
TableRow accRow = (TableRow) c.sideInput(uidAccountView).get(uid);
if (accRow == null) {
LOG.warn("accRow null, {}", row.toPrettyString());
} else {
row = row.set("account_id", accRow.get("account_id"));
}
}
c.output(row);
}
}).withSideInputs(uidAccountView));
// Insert into BigQuery
WriteResult result = rowsWithAccountID.apply(BigQueryIO.writeTableRows()
.to(new TableRefPartition(StaticValueProvider.of("MYDATA"), StaticValueProvider.of("dev"),
StaticValueProvider.of("deadletter_bucket")))
.withFormatFunction(new SerializableFunction<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableRow apply(TableRow row) {
return row;
}
}).withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
p.run();
Historically my system have two identifiers of users, new one (account_id) and old one(account_uid). Now I need to add new account_id to our event data stored in BigQuery retroactively, because old data only has old account_uid. Accounts table (which has relation between account_uid and account_id) is already converted as csv and stored in GCS.
The last func TableRefPartition just store data into BQ's corresponding partition depending on each event timestamp. The job is still running (2017-10-30_22_45_59-18169851018279768913) and bottleneck looks Join account_id part.
That part of throughput (xxx elements/s) goes up and down according to the graph. According to the graph, estimated size of sideInputs is 106MB.
If switching to CoGroupByKey improves performance dramatically, I will do so. I was just lazy and thought using sideInputs is easier to handle event data which doesn't have account info as well.
Try one of:
1) setting the option using some code:
options.as(DataflowWorkerHarnessOptions.class).setWorkerCacheMb(500);
2) having your application register DataflowWorkerHarnessOptions with the PipelineOptionsFactory
3) Having your own options class extend DataflowWorkerHarnessOptions
There's a few things you can do to improve the performance of your code:
Your side input is a Map<String, TableRow>, but you're using only a single field in the TableRow - accRow.get("account_id"). How about making it a Map<String, String> instead, having the value be the account_id itself? That'll likely be quite a bit more efficient than the bulky TableRow object.
You could extract the value of the side input into a lazily initialized member variable in your DoFn, to avoid repeated invocations of .sideInput().
That said, this performance is unexpected and we are investigating whether there's something else going on.