Apache Flink: How to get timestamp of events in ingestion time mode? - stream

I am wondering is it possible to obtain the timestamp of record by using Flink's ingestion time mode. Considering the following flink code example (https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/scala/org/apache/flink/streaming/scala/examples/join/WindowJoinSampleData.scala),
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
val grades = WindowJoinSampleData.getGradeSource(env, rate)
val salaries = WindowJoinSampleData.getSalarySource(env, rate)
val joined = joinStreams(grades, salaries, windowSize)
...
case class Grade(name: String, level: Int)
case class Salary(name: String, salary: Int)
By default, neither grade nor salary contains timestamp field. However, since Flink allows to use "ingestionTime" to assign the wall clock timestamp to the records in data stream, is it possible to obtain such timestamp at runtime? For example, here is what I am trying to do:
val oldDatastream = env.addSource... // Using ingestion time
val newDatastream = oldDatastream.map{record =>
val ts = getRecordTimestamp(record)
// do some thing with ts
}
Thanks for any help.

Use ProcessFunction wich gives you a Context, that you can use to get the element's timestamp (whether its ingestion, processing or event time).

Related

How do I structure f# code in idiomatic manner to cater for input states (dependency)

Whilst I am learning F#, I am trying to build a payroll processing engine to put in practice what I am learning.
On a high level, the payroll pipeline can be summarised as having the following steps
Input Earnings
Apply deductions on the earnings if any
Apply taxes on earnings after step 2
Apply any post tax deductions
I have got the following code that calculates the payroll for an employee
module Payroll=
let calculate(payPeriods: PayPeriod list, employee:Employee, payrollEntries: Entry list )=
// implementations, function calls go here
Now looking at step 3 above, you will see that we need to get tax rates (Steps have been overly simplified) to perform calculation.
Do we pass the tax rates as a parameter or is there another idiomatic way to achieve what I want to achieve.
The tax rates may be injected from a datastore.
How do I do to manage the tax part? Do inject the taxes in a parameter or I pass function that will allow me to manage this?
It is hard to answer your question without any calculations, but if the question is about structuring the code in a very general way, then I can give an example vaguely inspired by your question.
For simplicity, my earnings will be just a float:
type Earnings =
{ Amount : float }
There are also some environment parameters such as the tax and deductions:
type Environment =
{ Deductions : float
Tax : float }
Your core logic can be written as pure functions taking the Environment and Earnings:
let applyDeductions env earnings =
{ earnings with Amount = earnings.Amount - env.Deductions }
let applyTaxes env earnings =
{ earnings with Amount = earnings.Amount * (1.0 - env.Tax) }
To read input, you could read stuff from a console or a file, but here I'll just return a constant sample:
let readInput () =
{ Amount = 5000.0 }
Now, the main function initializes the environment (possibly from a file), reads the input and passes the env and the earnings to all the processing functions in a pipeline:
let run () =
let env = { Deductions = 1000.0; Tax = 0.2 }
let earnings =
readInput()
|> applyDeductions env
|> applyTaxes env
printfn "Final: %f" earnings.Amount
This is way simpler than what your snippet suggests, but the structure should work pretty much the same.

Apache Beam - Sliding Windows Only Emit Earliest Active Window

I'm trying to use Apache Beam (via Scio) to run a continuous aggregation of the last 3 days of data (processing time) from a streaming source and output results from the earliest, active window every 5 minutes. Earliest meaning the window with the earliest start time, active meaning that the end of the window hasn't yet passed. Essentially I'm trying to get a 'rolling' aggregation by dropping the non-overlapping period between sliding windows.
A visualization of what I'm trying to accomplish with an example sliding window of size 3 days and period 1 day:
early firing - ^ no firing - x
|
** stop firing from this window once time passes this point
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | ** stop firing from this window once time passes this point
w1: +====================+^ ^ ^
x x x x x x x | | |
w2: +====================+^ ^ ^
x x x x x x x | | |
w3: +====================+
time: ----d1-----d2-----d3-----d4-----d5-----d6-----d7---->
I've tried using sliding windows (size=3 days, period=5 min), but they produce a new window for every 3 days/5 min combination in the future and are emitting early results for every window. I tried using trigger = AfterWatermark.pastEndOfWindow(), but I need early results when the job first starts. I've tried comparing the pane data (isLast, timestamp, etc.) between windows but they seem identical.
My most recent attempt, which seems somewhat of a hack, included attaching window information to each key in a DoFn, re-windowing into a fixed window, and attempting to group and reduce to the oldest window from the attached data, but the final reduceByKey doesn't seem to output anything.
DoFn to attach window information
// ValueType is just a case class I'm using for objects
type DoFnT = DoFn[KV[String, ValueType], KV[String, (ValueType, Instant)]]
class Test extends DoFnT {
// Window.toString looks like the following:
// [2020-05-16T23:57:00.000Z..2020-05-17T00:02:00.000Z)
def parseWindow(window: String): Instant = {
Instant.parse(
window
.stripPrefix("[")
.stripSuffix(")")
.split("\\.\\.")(1))
}
#ProcessElement
def process(
context: DoFnT#ProcessContext,
window: BoundedWindow): Unit = {
context.output(
KV.of(
context.element().getKey,
(context.element().getValue, parseWindow(window.toString))
)
)
}
}
sc
.pubsubSubscription(...)
.keyBy(_.key)
.withSlidingWindows(
size = Duration.standardDays(3),
period = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
allowedLateness = Duration.ZERO,
trigger = Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))))))
.reduceByKey(ValueType.combineFunction())
.applyPerKeyDoFn(new Test())
.withFixedWindows(
duration = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
trigger = AfterWatermark.pastEndOfWindow(),
allowedLateness = Duration.ZERO))
.reduceByKey((x, y) => if (x._2.isBefore(y._2)) x else y)
.saveAsCustomOutput(
TextIO.write()...
)
Any suggestions?
First, regarding processing time: If you want to window according to processing time, you should set your event time to the processing time. This is perfectly fine - it means that the event you are processing is the event of ingesting the record, not the event that the record represents.
Now you can use sliding windows off-the-shelf to get the aggregation you want, grouped the way you want.
But you are correct that it is a bit of a headache to trigger the way you want. Triggers are not easily expressive enough to say "output the last 3 day aggregation but only begin when the window is 5 minutes from over" and even less able to express "for the first 3 day period from pipeline startup, output the whole time".
I believe a stateful ParDo(DoFn) will be your best choice. State is partitioned per key and window. Since you want to have interactions across 3 day aggregations you will need to run your DoFn in the global window and manage the partitioning of the aggregations yourself. You tagged your question google-cloud-dataflow and Dataflow does not support MapState so you will need to use a ValueState that holds a map of the active 3 day aggregations, starting new aggregations as needed and removing old ones when they are done. Separately, you can easily track the aggregation from which you want to periodically output, and have a timer callback that periodically emits the active aggregation. Something like the following pseudo-Java; you can translate to Scala and insert your own types:
DoFn<> {
#StateId("activePeriod") StateSpec<ValueState<Period>> activePeriod = StateSpecs.value();
#StateId("accumulators") StateSpec<ValueState<Map<Period, Accumulator>>> accumulators = StateSpecs.value();
#TimerId("nextPeriod") TimerSpec nextPeriod = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#TimerId("output") TimerSpec outputTimer = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void process(
#Element element,
#TimerId("nextPeriod") Timer nextPeriod,
#TimerId("output") Timer output,
#StateId("activePeriod") ValueState<Period> activePeriod
#StateId("accumulators") ValueState<Map<Period, Accumulator>> accumulators) {
// Set nextPeriod if it isn't already running
// Set output if it isn't already running
// Set activePeriod if it isn't already set
// Add the element to the appropriate accumulator
}
#OnTimer("nextPeriod")
public void onNextPeriod(
#TimerId("nextPeriod") Timer nextPeriod,
#StateId("activePriod") ValueState<Period> activePeriod {
// Set activePeriod to the next one
// Clear the period we will never read again
// Reset the timer (there's a one-time change in this logic after the first window; add a flag for this)
}
#OnTimer("output")
public void onOutput(
#TimerId("output") Timer output,
#StateId("activePriod") ValueState<Period> activePeriod,
#StateId("accumulators") ValueState<MapState<Period, Accumulator>> {
// Output the current accumulator for the active period
// Reset the timer
}
}
I do have some reservations about this, because the outputs we are working so hard to suppress are not comparable to the outputs that are "replacing" them. I would be interesting in learning more about the use case. It is possible there is a more straightforward way to express the result you are interested in.

Linear Regression Using CEP

What is the best approach to achieve a linear regression using CEP ?. We have tried two different options.
We do want to have the algorithm working in real time.
Basic code for both approach :
create context IntervalSpanning3Seconds start #now end after 30 sec;
create schema measure (
temperature float,
water float,
_hours float,
persons float,
production float
);
#Name("gattering_measures")
insert into measure
select
cast(getNumber(m,"measurement.bsk_mymeasurement.temperature.value"),
float) as temperature,
cast(getNumber(m, "measurement.bsk_mymeasurement.water.value"), float) as water,
cast(getNumber(m, "measurement.bsk_mymeasurement._hours.value"), float) as _hours,
cast(getNumber(m, "measurement.bsk_mymeasurement.persons.value"), float) as persons,
cast(getNumber(m, "measurement.bsk_mymeasurement.production.value"),float) as production
from MeasurementCreated m
where m.measurement.type = "bsk_mymeasurement";
1. Using the function stat:linest
#Name("get_data")
context IntervalSpanning3Seconds
select * from measure.stat:linest(water,production,_hours,persons,temperature)
output snapshot when terminated;
EDIT: The problem here is that it seems like the "get_data" is getting execute by each measurement and not by the entire collection of measurement.
2. Get data and passed a javascript function.
create expression String exeReg(data) [
var f = f(data)
function f(d){
.....
// return the linear regression as a string
}
return f
];
#Name("get_data")
insert into CreateEvent
select
"bsk_outcome_linear_regression" as type,
exeReg(m) as text,
....
from measure m;
EDIT: Here, I would like to know what is the type of the variable that is passed to the exeReg() function and how I should iterate it ? example would be nice.
I'll appreciate any help.
Using JavaScript would mean that the script computes a new result (recomputes) for each collection it receives. Instead of recomputing it the #lineest data window is a good choice. Or you can add a custom aggregation function or custom data window to the engine if there is certain code your want to use. Below is how the script can receive multiple events for the case when a script is desired.
create expression String exeReg(data) [
.... script here...
];
select exeReg(window(*)) ....
from measure#time(10 seconds);

Q: F#: Iterate through highscore file and pick top 3

On my mission to master F# I'm creating a pocket game.
I'm at the point where I want to implement some sort of a highscore list.
So far I'm writing Name, Score and Time to a file which then reads in to the application displaying all previous scores. Yes, this isn't ideal as the list grows pretty quick.
I somewhat want to pick the top 3 scores, not caring about Name or Time.
Question: Should I read the file into an array/list and from there pick out the top scores or is there a nicer way to pick out the top scores directly from the file?
Pointers, Code, Tips and Tricks are warmly welcome.
let scoreFile = sprintf ("Name: %s\nTime: %i sec\nScore: %d\n\n") name stopWatch.Elapsed.Seconds finalScore
let readExistingFile = File.ReadAllText ("hiscore.txt")
File.WriteAllText ("hiscore.txt", scoreFile + readExistingFile)
let msg = File.ReadAllText ("hiscore.txt")
printfn "\nHighscores:\n\n%s\n\n\n\nPress ANY key to quit." msg
Should I read the file into an array/list and from there pick out the top scores or is there a nicer way to pick out the top scores directly from the file?
Unless the scores are already sorted in the file, you'll have to look through them all to find out what the Top 3 is. The way your file is written right now, parsing the data back might be a bit hard - scores are stored on multiple lines, so you'd have to handle that.
Assuming the file doesn't have to be human-friendly, I'd go with a list of comma-separated values instead. It's harder for a human to read by opening the file, but it makes it a lot easier to parse in your program. For example, if the lines looks like Name,Time,Score, they can be parsed like this:
type ScoreData = {
Name : string
Time : string // could be a DateTime, using string for simplicity
Score : int
}
let readHighScores file =
File.ReadAllLines file
|> Array.choose (fun line ->
match line.Split ',' with
| [| name; time; score |] ->
{
Name = name
Time = time
Score = (int)score // This will crash if the score isn't an integer - see paragraph below.
}
|> Some
| _ ->
// Line doesn't match the expected format, we'll just drop it
None
)
|> Array.sortBy (fun scoreData -> -scoreData.Score) // Negative score, so that the highest score comes first
|> Seq.take 3
This will read through your file and output the three largest scores. Using Array.choose allows you to only keep lines that match the format you're expecting. This also lets you add extra validation as needed, such as making sure that the score is an integer and perhaps parsing the Time into a System.DateTime instead of storing it as an int.
You can then print your high scores by doing something like this:
let highScores = readHighScores "hiscore.txt"
printfn "High scores:"
highScores
|> Seq.iteri (fun index data ->
printfn "#%i:" (index + 1)
printfn " Name: %s" data.Name
printfn " Time: %s" data.Time
printfn " Score: %i" data.Score
)
This calls the previously defined function and prints each of the scores returned - the top 3, in this case. Using Seq.iteri, you can include the index in the output in addition to the score data itself. Using some data I made up, it ends up looking like this:
High scores:
#1:
Name: Awots
Time: 2015-06-15
Score: 2300
#2:
Name: Roujo
Time: 2016-03-01
Score: 2200
#3:
Name: Awots
Time: 2016-03-02
Score: 2100
Now, there might be a way to do this without loading the entire file at once in memory, but I don't think it'd be worth it unless you have a really large file - in which case you might want to either keep it sorted or use a more fit storage method like a database.

Force multiple evaluations of a function in F#

I am trying to develop a random number "generator" in F#.
I successfully created the following function:
let draw () =
let rand = new Random()
rand.Next(0,36)
This works fine and it generates a number between 0 and 36.
However, I am trying to create a function that runs this function several times.
I tried the following
let multipleDraws (n:int) =
[for i in 1..n -> draw()]
However, I only get a single result, as draw is evaluated only once in the for comprehension.
How could I force multiple executions of the draw function?
The problem is with the Random type. It uses the computer's time to generate a seed and then generate the random numbers. Since practically the time of the calls is identical, the same seed is generated and so are also same numbers returned.
This will solve your problem:
let draw =
let rand = new Random()
fun () ->
rand.Next(0,36)
And then:
let multipleDraws (n:int) =
[for i in 1..n -> draw()]
Adding this to help explain Ramon's answer.
This code uses a lambda function.
let draw =
let rand = new Random()
fun () ->
rand.Next(0,36)
It may be easier to understand what's happening if you give the lambda function a name.
let draw =
let rand = new Random()
let next() =
rand.Next(0,36)
next
The variable draw is being assigned the function next. You can move rand and next out of the scope of draw to see the assignment directly.
let rand = new Random()
let next() =
rand.Next(0,36)
let draw = next
You can see from the above code that in Ramon's answer new Random is only called once while it is called many times in SRKX's example.
As mentioned by Ramon Random generates a sequence of numbers based on a random seed. It will always generate the same sequence of numbers if you use the same seed. You can pass random a seed like this new Random(2). If you do not pass it a value it uses the current time. So if you call new Random multiple times in a row without a seed it will most likely have the same seed (because the time hasn't changed). If the seed doesn't change then the first random number of the sequence will always be the same. If you try SRKX's original code and call multipleDraws with a large enough number, then the time will change during the loop and you will get back a sequence of numbers that changes every so often.

Resources