Dataflow - Fixed Window- AfterProcessingTrigger - google-cloud-dataflow

I am using a fixed window of 60 seconds with a trigger time 10 second. I am facing few unexpected results. Could you please help me in understanding how exactly it works.All the detail I have provided below.
My Input to the pubsub topic is :
*name* *score* publish timestamp(every 5 seconds I am publishing one element)
Laia 30 2021-04-10 09:38:29.708000+0000
Victor 20 2021-04-10 09:38:34.695000+0000
Victor 50 2021-04-10 09:38:39.703000+0000
Laia 40 2021-04-10 09:38:44.701000+0000
Victor 10 2021-04-10 09:38:49.711000+0000
Victor 40 2021-04-10 09:38:54.721000+0000
Laia 40 2021-04-10 09:38:59.715000+0000
Laia 50 2021-04-10 09:39:04.741000+0000
Laia 20 2021-04-10 09:39:09.867000+0000
Laia 20 2021-04-10 09:39:14.749000+0000
My Code :
window_withTrigger = (words
| "window" >> beam.WindowInto(beam.window.FixedWindows(60),
trigger=AfterProcessingTime(1 * 10),
accumulation_mode= AccumulationMode.ACCUMULATING)
| "Group" >> GroupByKey())
window_withoutTrigger = (words
| "window" >> beam.WindowInto(beam.window.FixedWindows(60))
| "Group" >> GroupByKey())
O/P for window_withTrigger:
Laia [30]
Victor [20, 50, 10, 40]
Laia [50, 20, 20]
O/P for window_withoutTrigger:
Laia [30, 40, 40]
Victor [20, 50, 10, 40]
Laia [50, 20, 20]
Output without trigger I am getting all the 10 elements that I published to the topic and with trigger I am getting 8 elements. I notice with trigger it does not emit results in 10 seconds if there is no change in the key item i.e only if the i/p name is changing from laila to victor it emits result and once it emits for one key in a window it does not emit again even if I publish with the same key.

You are probably dropping the elements because of not using Repeatedly.
Here you have another answer where this is explained. Basically the idea is that if you don't add Repeatedly, the trigger would only fire once.
Official doc.

Related

Averaging a Data Series in a Google Sheet to a single entry per period regardless of the number of samples in the larger period?

I have a small data set of ~200 samples taken over twenty years with two columns of data that sometimes have multiple entries for the period (i.e. age or date). When I go to plot it, even though the data is over 20 years the graph heavily reflects the number of samples in the period and not the period itself. For example during age 23 there may be 2 or 3 samples, 1 for age 24, 20 for age 25, and 10 for age 35.. the number of samples entirely on needs for additional data at the time.. so simply there is no consistency to the sample rate.
How do I get an Max or an Average / Max for a period (age) and ensure there is only one entry per period in the sheet (about one entry per year) without having to create a separate sheet full of separate queries and charting off of that?
What I have tried in Google Sheets (where my data is) is on the x-series chart choosing "aggregate" (which is on the age period) which helps flatten the graph a bit, but doesn't reduce the series.
A read only link to the the spreadsheet is HERE for reference.
Data Looking something like this:
3/27/2013 36.4247 2.5 29.3
4/10/2013 36.4630 1.8 42.8
4/15/2013 36.4767 2.2 33.9
5/2/2013 36.5233 2.2 33.9
5/21/2013 36.5753 1.91 39.9
5/29/2013 36.5973 1.94 39.2
7/29/2013 36.7644 1.98 38.3
10/25/2013 37.0055 1.7 45.6
2/28/2014 37.3507 1.85 50 41.3
6/1/2014 37.6055 1.98 38 38.1
12/1/2014 38.1068 37
6/1/2015 38.6055 2.18 34 33.9
12/11/2015 39.1342 3.03 23 23.1
12/14/2015 39.1425 3.18 22 21.9
12/15/2015 39.1452 3.44 20 20.0
12/17/2015 39.1507 3.61 19 18.9
12/21/2015 39.1616 3.62 19 18.8
12/23/2015 39.1671 3.32 21 20.8
12/25/2015 39.1726 3.08 23 22.7
12/28/2015 39.1808 3.12 22 22.4
12/29/2015 39.1836 2.97 24 23.7
12/30/2015 39.1863 3.57 19 19.1
12/31/2015 39.1890 3.37 20 20.5
1/1/2016 39.1918 3.37 20 20.5
1/3/2016 39.1973 2.65 27 27.0
1/4/2016 39.2000 2.76 26 25.8
try:
=QUERY(SORTN(SORT({YEAR($A$6:$A), B6:B}, 1, 0, 2, 0), 9^9, 2, 1, 1),
"where Col1 <> 1899")
demo spreadsheet
and build a chart from there

How to create a "on/off" graphs with HighCharts?

I've read the documentation quite a few times, but I just can't seem to find a way to make a graph like this. Perhaps it's because I don't know what it's called, so I'm not even sure what to look for. Let me try to explain what I'm trying to do.
Normally if you have a series of points like this:
3 May, 5:00 PM ---> 0
3 May, 5:20 PM ---> 3
4 May, 5:00 PM ---> 0
4 May, 5:20 PM ---> 3
If you make a standard LINE GRAPH, high charts will plot the values INCREASE between the two. So I end up with this:
But the problem is, the values being shown are actually values changing at a point in time. In other words, what I want is this:
And even more importantly, it seems the spacing between time isn't correct. You'll notice that it creates a perfect zigzag, even though the times between the first and second point is 20 minutes (5PM to 5:20 PM), and the second point and 3rd point is 23 hours and 40 minutes (3 May 5:20 PM and 4 May 5PM). So what I really want is this:
Any idea what a graph like this is called?
Any idea how to make it using HighCharts?
UPDATE
The only solution I can think of right now, is to fake points between the real points. so for example if the value is 0 at 5PM and turns to 3 at 5:20 PM, then I will add 19 points in between these two. So at 5:01 I will make it 0, and 5:02 I will also make it 0, and 5:03 etc. Until 5:19. But even this method will result in a SLIGHTLY skewed line going up from 5:19 to 5:20. Which is what I'm actually trying to avoid.
Any ideas?
UPDATE 2
The "step : left" solution has definitely solved half of my problem, but for some reason I still have this:
You should now see that even though I have steps, they are not quite making the expected spacing. For 17:13 on 5 May, I expect the graph to be closer to the 6 May mark, than to the 5 May mark.
Any ideas as to why this is happening?
UPDATE 3
I created a jFiddle for my problem: https://jsfiddle.net/coderama/ubz7m0Lh/4/
UPDATE 4
Based on wergeld's input, it seems using "ordinal" on the x axis is the way to go --> http://api.highcharts.com/highstock#xAxis.ordinal
But it produces a pretty weird graph: https://jsfiddle.net/coderama/6tz8h53x/1/
I'll keep looking, but at least it feels like there's progress being made!
What you are looking for is the step option. You can set up something like:
$(function() {
$('#container').highcharts({
title: {
text: 'Step line types, with null values in the series'
},
xAxis: {
type: 'datetime',
tickInterval: 86400000
},
series: [{
data: [
[Date.UTC(2016, 04, 3, 17, 00), 0],
[Date.UTC(2016, 04, 3, 20, 00), 3],
[Date.UTC(2016, 04, 4, 17, 00), 0],
[Date.UTC(2016, 04, 5, 18, 00), 3],
[Date.UTC(2016, 04, 5, 19, 00), 0],
[Date.UTC(2016, 04, 6, 20, 00), 3],
[Date.UTC(2016, 04, 7, 17, 00), 0]
],
step: 'left'
}]
});
});
The step parameter tells highcharts how to go from your given point to the next point.

How to think about weights in Myrrix

I have the following input for Myrrix:
11, 101, 1
11, 102, 1
11, 103, 1
11, 104, 1000
11, 105, 1000
11, 106, 1000
12, 101, 1
12, 102, 1
12, 103, 1
12, 222, 1
13, 104, 1000
13, 105, 1000
13, 106, 1000
13, 333, 1000
I am looking for items to recommend to user 11. The expectation is that item 333 will be recommended first (because of the higher weights for user 13 and items 104, 105, 106).
Here are the recommendation results from Myrrix:
11, 222, 0.04709
11, 333, 0.0334058
Notice that item 222 is recommended with strength 0.047, but item 333 is only given a strength of 0.033 --- the opposite of the expected results.
I also would have expected the difference in strength to be larger (since 1000 and 1 are so different), but obviously that's moot when the order isn't even what I expected.
How can I interpret these results and how should I think about the weight parameter? We are working with a large client under a tight deadline and would appreciate any pointers.
It's hard to judge based on a small and synthetic data set. I think the biggest factor will be parameters here -- what are the # of features? lambda? I would expect features = 2 here. If it's higher I think you quickly over-fit this and the results are mostly the noise left over from that after it perfectly explains that user 11 doesn't interact with 222 and 333.
The values are quite low, suggesting both of these are not likely results, and so their order may be more noise than anything. Do you see different results if the model is rebuilt from another random starting point?

Can I set Jenkins' "Build periodically" to build every other Tuesday starting March 13?

I want to schedule Jenkins to run a certain job at 8:00 am every Monday, Wednesday Thursday and Friday and 8:00 am every other Tuesday.
Right now, the best I can think of is:
# 8am every Monday, Wednesday, Thursday, and Friday:
0 8 * * 1,3-5
# 8am on specific desired Tuesdays, one line per month:
0 8 13,27 3 2
0 8 10,24 4 2
0 8 8,22 5 2
0 8 5,19 6 2
0 8 3,17,31 7 2
0 8 14,28 8 2
0 8 11,25 9 2
0 8 9,23 10 2
0 8 6,20 11 2
0 8 4,18 12 2
which is is fine (if ugly) for the remainder of 2012, but it almost certainly won't do what I want in 2013.
Is there a more concise way to do this, or one that's year-independant?
This is something that comes up quite often, see e.g. this document, this forum thread or this stackoverflow question.
The answer is basically no. What I would do in your situtation is to run the job every Tuesday and have the first build step check whether to actually run by e.g. checking whether a file exists and only running if it doesn't. If it exists, it would be deleted so that the job can run the next time this check occurs. You would of course also have to check whether it's Tuesday.
I got you fam: crontab.guru
10 22 1-7,14-21,28-31 * 6
If you abandon every other Tuesday, and can be satisfied with the first and third Tuesdays a month, the following should work:
0 9 1-7 * 2
0 9 15-21 * 2
You're running every day from 1-7, but only on Tuesday, and every day from 15-21, again only on Tuesday. A Tuesday will occur only once in each of those intervals.
Yes, it's not strictly every other week, as a 5-Tuesday month will throw off your cadence, but here you have a predictable job schedule that doesn't need to be adjusted in Jenkins as time goes on.
I use Excel to generate the cron expressions. The following formulas generate every other Monday at 8:00 AM starting from Oct 22.
A B C D
1 41204 =MONTH(A1) =DAY(A1) =CONCATENATE("0 8 ", C1, " ", B1, " 1")
2 =A1+14 =MONTH(A2) =DAY(A2) =CONCATENATE("0 8 ", C2, " ", B2, " 1")
This generates
A B C D
1 22-Oct 10 22 0 8 22 10 1
2 5-Nov 11 5 0 8 5 11 1
Just auto fill Row 2 to get additional days. I'm not sure how many separate expressions you can give to Jenkins. I know it works with 26 expressions.

How to find date of end of week in Ruby?

I have the following date object in Ruby
Date.new(2009, 11, 19)
How would I find the next Friday?
You could use end_of_week (AFAIK only available in Rails)
>> Date.new(2009, 11, 19).end_of_week - 2
=> Fri, 20 Nov 2009
But this might not work, depending what exactly you want. Another way could be to
>> d = Date.new(2009, 11, 19)
>> (d..(d+7)).find{|d| d.cwday == 5}
=> Fri, 20 Nov 2009
lets assume you want to have the next friday if d is already a friday:
>> d = Date.new(2009, 11, 20) # was friday
>> ((d+1)..(d+7)).find{|d| d.cwday == 5}
=> Fri, 27 Nov 2009
Old question but if you're using rails you can now do the following to get next Friday.
Date.today.sunday + 5.days
Likewise you can do the following to get this Friday.
Date.today.monday + 4.days
d = Date.new(2009, 11, 19)
d+=(5-d.wday) > 0 ? 5 - d.wday : 7 + 5 - d.wday
Rails' ActiveSupport::CoreExtensions::Date::Calculations has methods that can help you. If you're not using Rails, you could just require ActiveSupport.
As Ruby's modulo operation (%) returns positive numbers when your divisor is positive, you can do this:
some_date = Date.new(2009, 11, 19)
next_friday = some_date + (5 - some_date.cwday) % 7
The only issue I can see here is that if some_date is a Friday, next_friday will be the same date as some_date. If that's not the desired behavior, a slight modification can be used instead:
some_date = Date.new(...)
day_increment = (5 - some_date.cwday) % 7
day_increment = 7 if day_increment == 0
next_friday = some_date + day_increment
This code doesn't rely on additional external dependencies, and relies mostly on integer arithmetic.

Resources