Apache Flink metric to count late elements - stream

I'd like to measure how many events arrive within allowed lateness grouped by particular feature of the event. We assume particular type of events have way more late arrivals and would like to verify this.
The place to make the measurement I thought of is our custom trigger within onElement method as this is the place where we know whether event is late of not. Yet in case of SlidingEventTimeWindows that means that a single element can be counted multiple timess if it's late by more than a slide.
Any suggestions?

You might do this separately from the windowing. You could set the allowed lateness to zero, and divert all late events to a side output. You can then key that stream of late events by the feature(s) of interest, and use a RichFlatMapFunction or KeyedProcessFunction to count the events, which can then be reported as a custom metric, or sent to a sink.

Related

How to find recurrence exceptions through Microsoft Outlook Calendar Graph API

I'm currently using the Microsoft Graph API to sync calendar events to my local application. On my end, I don't care to save each individual occurrence in a series, but prefer instead to just save the series master and then extrapolate out the instances of the series myself. For this reason, I am using the /me/events call rather than the /me/calendarView call.
My problem is when editing a single occurrence in a series. After editing the single occurrence, I make the /me/events call and I can see the newly added "Exception" type -- which is great. However, I don't see how to relate that new event back to which occurrence was changed to cause the exception.
For example, if I have a weekly meeting on Monday at noon, and I change today's meeting from noon to 2:00, it's pretty easy to tell that today's meeting is the one that changed. But if I change today's meeting to Friday, how can I tell that it was today's meeting that changed and not next week's? Keep in mind that I am only storing the master, and not every single calendarView occurrence.
Another example is if I delete an occurrence. In this case, the /me/calendarView call will simply not return that occurrence anymore. No exception type is generated. And the series master returned from the /me/events call doesn't change at all to indicate that a date is missing.
The format that I'm used to is something like the iCal/vCal format, where there is a start date, end date, and then a list of exception dates. Using that format, I can easily tell from the series master which dates to skip, without needing to "render" the entire occurrence and skip the exceptions. And if an occurrence is deleted, it is added to the EXDATE list and then it is never considered on rendering. Does the Microsoft Graph API not have an easy way to see these changed/deleted occurrences?
I was having a similar issue, but I think I've now realized that Microsoft does not allow recurring events to move later than the next instance, or earlier than the preceding one (at least while using Outlook calendar in the browser). So you can always assume that the 3rd event is 3rd in the series, the 4th is 4th, etc.
So as long as you know the series number, you can locate it by getting all of the instances with /me/events/[event_id]/instances?startDateTime=[start_date_time]&endDateTime=[end_date_time].
The error in Outlook Calendar when I do this isn't very clear, so maybe something else is up, but I am able to move the exception events otherwise. Unfortunately, I'm not sure if there's a definite way to know what end_date_time to use, as events can be moved indefinitely later.
Based on the response object from the event marked as an exception, you can use the seriesMasterId to relate the exception to its parent recurrence.

Is there a way to tell Google Cloud Dataflow that the data coming in is already ordered?

We have an input data source that is approximately 90 GB (it can be either a CSV or XML, it doesn't matter) that contains an already ordered list of data. For simplicity, you can think of it as having two columns: time column, and a string column. The hundreds of millions of rows in this file are already ordered by the time column in ascending order.
In our Google cloud DataFlow, we have modeled each row as an element in our Pcollection, and we apply DoFn transformations to the string field (e.g. count the number of characters that are uppercase in the string etc.). This works fine.
However, we then need to apply functions that are supposed to be calculated for a block of time (e.g. five minutes) with a one minute overlap. So, we are thinking about using a sliding windowing function (even though the data is bounded).
However, the calculations logic that needs to be applied over these five-minute windows assumes that the data is ordered logically ( i.e. ascending) by the time field. My understanding is that even when using these windowing functions, one cannot assume that within each window the P collection objects are ordered in any way – so one would need to manually iterate through every P collection and reorder them, right? However, this seems like a huge waste of computational power, since the incoming data already contains ordered data. So is there a way to teach/inform Google cloud data flow that the input data is ordered and so to maintain that order even within the windows?
On a minor note, I had another question: my understanding is that if the data source is unbounded, there is never a "overall aggregation" function that would ever execute, as it never really make sense (since there is no end to the incoming data); however, if one uses a windowing function for bounded data, there is a true end state which corresponds to when all the data has been read from the CSV file. Therefore, is there a way to tell Google cloud data flow to do a final calculation once all the data has been read in, even though we are using a windowing function to divide the data up?
SlidingWindows sounds like the right solution for your problem. The ordering of the incoming data is not preserved across a GroupByKey, so informing Dataflow of that would not be useful currently. However, the batch Dataflow runner does already sort by timestamp in order to implement windowing efficiently, so for simple windowing like SlidingWindows, your code will see the data in order.
If you want to do a final calculation after doing some windowed calculations on a bounded data set, you can re-window your data into the global window again, and do your final aggregation after that:
p.apply(Window.into(new GlobalWindows()));

Custom Date queries using Cumulocity API

Is it possible to aggregate measurements or create custom queries beyond the standard dateFrom dateTo queries?
As an example, I have measurements which have a time delta of 1 minute (2015-01-01T05:05:00, 2015-01-01T05:05:00, 2015-01-01T05:05:00, ...) and I would like to query the measurements at 15 minute intervals (2015-01-01T05:15:00, 2015-01-01T05:30:00, 2015-01-01T05:45:00, ...)
So far I have only come up with these solutions:
Using the standard api request as in
https://tenant.cumulocity.com/measurement/measurements?dateFrom=2015-10-01&dateTo=2015-11-05
and then throwing away most of the data will use a massive amount of time loading the data.
Using cep (cumulocity event language) to generate a new measurement every 15 minutes using the nearest 1 minute measurement seems like a bit of overkill and not very elegant.
Batch requesting the exact minute
https://tenant.cumulocity.com/measurement/measurements?dateFrom=2015-11-05T05:15:00%2B01:00&dateTo=2015-11-05T05:16:00%2B01:00
which will in a massive amount of API requests and also does not seem very efficient.
Use the /measurements/series endpoint which will only give me all series, even those I do not want, as well as only having the aggregation options hourly and daily (as far as I can tell).
Is there a better way of doing this?
you have captured nearly all of the mechanisms that are currently available. There is one more possibility -- not sure if this is an option for you:
Mark the fifteenth measurement when sending it from the device, using e.g. a different type.
I would normally use 2. It's actually quite efficient, it's similar to a materialized view in traditional SQL, plus you can use the data everywhere and in all widgets.
Good luck :-)
Cheers,
André
I would prefer the CEP solution. The rule wouldn't be that complicated. You would of course then store these measurements twice which is not that nice but having your desired measurement with a specific type or fragment will give you the fastest way to query it.
Instead of copying the measurement you could just add a special fragment to the measurement every 15 min in the CEP rule. You cannot update measurements so you would have to delete the measurement incoming every 15 min and then create a new measurement with exactly the same values but add a fragement (e.g. "aggregatedMeasurement": {}).
Your query then looks like this:
https://tenant.cumulocity.com/measurement/measurements?dateFrom=2015-10-01&dateTo=2015-11-05&fragmentType=aggregatedMeasurement
One more idea for point 3:
You could use SmartREST to create a template with the query string and leave the dateFrom and dateTo as placeholders.
From the client side you then would have to make only one request using the bulking feature in SmartREST.
On the server side this would still be transformed into the single requests so you wouldn't gain anything in speed.

How to exchange TList items during a for-loop

I have a TList which' items are continuously processed by many for-loops. I sometimes need to exchange items in the list in order to rearrange the order of the visual representation of the list (in a StringGrid).
How do I exchange these items?
My preliminary thoughts are:
During the for-loop I think the items should not be exchanged.
If I do the exchange in a Timers' OnTimer event, having set the Timers' interval to a very short interval (e.g. 1 millisecond), then I think the for-loop will have only an intermission of that one millisecond.
Will this work? Or are there better alternatives?
As long as you ensure that the count of the items in a TList does not change, exchanging items is perfectly fine during a for-loop. Note that, depending on the index of the items that are about to be exchanged, some of the items may not be processed or may be processed twice.
If the exchange operation is not called from within the for-loop, then an already started for-loop will run until it is done. You cannot expect to "break in" with a Timer, because that Timer's message will not be processed until the for-loop and all surrounding code is done.
So, the solution for your problem could be:
exchange the items within the for-loop,
use a threading solution to be able to do two different things simultaniously on one list (this may need some learning about threads),
wait until the for-loop is done, and exchange then,
split the for-loop in multiple slices to reduce the time needed, or
use a timer to start multiple for-loops in order to give your program some breathing time in between.

Efficiently retrieving ice_cube schedules for a given time period

I'm looking into using Ice Cube https://github.com/seejohnrun/ice_cube for recurring events.
My question is, if I then need to get any events that fall within a given time period (say, on a day or within a week), is there any better way than to loop through them all like this:
items = Records.find(:all)
items.each do |item|
schedule = item.schedule
if schedule.occurs_on?(Date.new)
#if today is a recurrence, add to array
end
end
This seems horribly inefficient but I'm not sure how else to go about it.
That's one approach - but what people do more often is end up denormalizing their schedules into a format that is conveniently queryable.
You may have a collection called something like ScheduleOccurrences - that you build each week / and then query that instead.
Its unfortunate it has to work this way, but sticking to the iCal way of managing schedules has led IceCube to need to format its data in certain ways (specifically ways that can line up with the requirements of the iCal RFC).
I've been doing some thinking recently about what a library would look like that shook away some of those restrictions, for greater flexibility like this - but its definitely still a bit off.
Hope this helps
I faced a similar problem and here was my approach:
Create a column on Event table to store the next occurrence date, and write a method which stores that value after_save. (method available through ice_cube. Perhaps index column too for faster querying.)
Then you can query the database for occurrences happening in the timeframe you need. See below:
Event.where(next_occurrence: Date.today.all_day)
Store EventOccurrences on a separate table.
Update the next_occurrence column for the rows returned to you by your query. Or something similar. This works for me because I'm running a daily job, so that update next_occurrence will run regularly. But you may need to tweak a bit.

Resources