Apache Beam Streaming Lag Operator - google-cloud-dataflow

I'm currently thinking to build pipeline that have LAG operator like in SQL. But i'm not sure if it's possible.
To be clearer, let's say I have stream of data like this:
# sensor_name, temperature
("station 1", 30.0)
("station 1", 31.0)
("station 1", 32.0)
("station 1", 33.0)
("station 2", 30.0)
("station 2", 31.0)
("station 2", 32.0)
and do PTransform and the output become
("station 1", {"now":30.0, "before":None})
("station 1", {"now":31.0, "before":30.0})
("station 1", {"now":32.0, "before":31.0})
("station 1", {"now":33.0, "before":32.0})
("station 2", {"now":30.0, "before":None})
("station 2", {"now":31.0, "before":30.0})
("station 2", {"now":32.0, "before":31.0})
Is it possible to do so? thanks!

Here you have a working sample using the public topic for taxis
This is the StatefulDoFn
class UpdateLast(beam.DoFn):
RIDE_TRACK = BagStateSpec('rides', TupleCoder((FloatCoder(), FloatCoder())))
def process(self,
element,
timestamp_param=beam.DoFn.TimestampParam,
ride_state=beam.DoFn.StateParam(RIDE_TRACK)):
key = element[0]
meter_reading = element[1]
timestamp = float(timestamp_param)
bag_content = [x for x in ride_state.read()]
if not bag_content:
logging.info("Generating entry %s for key %s", (meter_reading, timestamp), key)
ride_state.add((meter_reading, timestamp))
output = {"now": meter_reading, "before": None}
yield (key, output)
else:
# There should only be one element in the bag
bag_ride = bag_content[0]
old_meter = bag_ride[0]
old_timestamp = bag_ride[1]
# We only need to check if the element is more recent
if timestamp > old_timestamp:
# Update bag
ride_state.clear()
ride_state.add((meter_reading, timestamp))
output = {"now": meter_reading, "before": old_meter}
logging.info("KEY %s: updating from %s to %s", key, old_meter, meter_reading)
yield (key, output)
else:
# Invert old and new if element is old
output = {"now": old_meter, "before": meter_reading}
yield (key, output)
And a pipeline for you to test it"
options = PipelineOptions(
temp_location=f"{bucket}/tmp/",
project=project,
region=region,
streaming=True,
job_name="statedofn",
num_workers=4,
max_num_workers=20,
)
p = beam.Pipeline(DataflowRunner(), options=options)
topic = "projects/pubsub-public-data/topics/taxirides-realtime"
pubsub = (p | "Read Topic" >> ReadFromPubSub(topic=topic)
| "Json Loads" >> Map(json.loads)
| beam.Filter(lambda x: x["ride_status"] == "enroute")
| "KV" >> Map(lambda x: (x["ride_id"], x["meter_reading"]))
)
state_df = (pubsub | "Stateful Do Fn" >> ParDo(UpdateLast())
| Map(logging.info)
)
p.run()
output:
('052b8a40-1c57-4a3c-a012-73ffeddb1f02', {'now': 9.875244, 'before': 9.857124})
('835a9a99-c2fc-4f3d-9284-59098827fe05', {'now': 26.973698, 'before': 26.940273})
('952c0fa5-2bb8-4c9a-b38c-72d66dedfddc', {'now': 17.828278, 'before': 17.808857})
('952c0fa5-2bb8-4c9a-b38c-72d66dedfddc', {'now': 17.847698, 'before': 17.828278})
('d5641df2-2fd8-4416-bde7-4def6d477a29', {'now': 2.3575556, 'before': 2.3346667})
('d5641df2-2fd8-4416-bde7-4def6d477a29', {'now': 2.3804445, 'before': 2.3575556})

Related

Cross join with Deedle

I'm trying to learn some F# and Deedle by analyzing my electricity costs.
Suppose I have two frames, one containing my electricity usage:
let consumptionsByYear =
[ (2019, "Total", 500); (2019, "Day", 200); (2019, "Night", 300);
(2020, "Total", 600); (2020, "Day", 250); (2020, "Night", 350) ]
|> Frame.ofValues
Total Day Night
2019 -> 500 200 300
2020 -> 600 250 350
The other contains two plans with different pricing structure (either a flat fee or fee varying based on the time of the day):
let prices =
[ ("Plan A", "Base fee", 50); ("Plan A", "Fixed price", 3); ("Plan A", "Day price", 0); ("Plan A", "Night price", 0);
("Plan B", "Base fee", 40); ("Plan B", "Fixed price", 0); ("Plan B", "Day price", 5); ("Plan B", "Night price", 2) ]
|> Frame.ofValues
Base fee Fixed price Day price Night price
Plan A -> 50 3 0 0
Plan B -> 40 0 5 2
Previously I have solved this in SQL using a cross join and in Excel using nested joins. To copy those, I found Frame.mapRows, but constructing the expected output seems very tedious using it:
let costs = consumptionsByYear
|> Frame.mapRows (fun _year cols ->
["Total price" => (prices?``Base fee``
+ (prices?``Fixed price`` |> Series.mapValues ((*) (cols.GetAs<float>("Total"))))
+ (prices?``Day price`` |> Series.mapValues ((*) (cols.GetAs<float>("Day"))))
+ (prices?``Night price`` |> Series.mapValues ((*) (cols.GetAs<float>("Night"))))
)]
|> Frame.ofColumns)
|> Frame.unnest
Total price
2019 Plan A -> 1550
Plan B -> 1640
2020 Plan A -> 1850
Plan B -> 1990
Is there a better way or even small improvements?
I'm not a Deedle expert, but I think this is basically:
A dot product of two matrices: consumptionsByYear and the periodic day/night prices,
Followed by the addition of the constant base prices.
In other words:
consumptionsByYear periodicPrices basePrices
------------------- ------------------------ ---------------------------
| Day Night | | Plan A Plan B | | Plan A Plan B |
| 2019 -> 200 300 | * | Day -> 3 5 | + | Base fee -> 50 40 |
| 2020 -> 250 350 | | Night -> 3 2 | ---------------------------
------------------- ------------------------
With that approach in mind, here's how I would do it:
open Deedle
open Deedle.Math
let consumptionsByYear =
[ (2019, "Day", 200); (2019, "Night", 300)
(2020, "Day", 250); (2020, "Night", 350) ]
|> Frame.ofValues
let basePrices =
[ ("Plan A", "Base fee", 50)
("Plan B", "Base fee", 40) ]
|> Frame.ofValues
|> Frame.transpose
let periodicPrices =
[ ("Plan A", "Day", 3); ("Plan A", "Night", 3)
("Plan B", "Day", 5); ("Plan B", "Night", 2) ]
|> Frame.ofValues
|> Frame.transpose
// repeat the base prices for each year
let basePricesExpanded =
let row = basePrices.Rows.["Base fee"]
consumptionsByYear
|> Frame.mapRowValues (fun _ -> row)
|> Frame.ofRows
let result =
Matrix.dot(consumptionsByYear, periodicPrices) + basePricesExpanded
result.Print()
Output is:
Plan A Plan B
2019 -> 1550 1640
2020 -> 1850 1990
A few changes I made for simplicity:
consumptionsByYear
I mapped the years from integers to strings in order to make the matrices compatible.
I removed the Total column, since it can be derived from the other two.
prices
I broke this into two separate frames: one for the periodic prices and another for the base prices, and then transposed them to enable matrix multiplication.
I changed Day price to Day and Night price to Night to make the matrices compatible.
I got rid of the Fixed price column, since it can be represented in the Day and Night columns.
Update: As of Deedle 2.4.2, it is no longer necessary to map the years to strings. I've modified my solution accordingly.

Lua Table - Search for Items that starts with an Letter

i have this table
animals = {
{sname = "bunny", name = "bunny hase", size = 4, size2 = 8, size3 = 9},
{sname = "mouse", name = "Micky Mouse", size = 1, size2 = 12, size3 = 22},
{sname = "cow", name = "Die Kuh", size = 30, size2 = 33, size3 = 324
}
there i can search by a listed entry
for _,v in pairs(animals) do
if v.sname == "bunny" then
print(v.sname, v.name, v.size, v.size2, v.size3)
break
end
end
and get the result:
bunny bunny hase 4 8 9
Now i want to search in my table by starting with a single Letter, for example "b", that show me all the entries starting with the letter "b" to get the same result?
I found no Solution. May you can help me?
First: The table animals needs a trailing } ;-)
Put it in a Lua -i console and play around with...
>animals = {
{sname = "bunny", name = "bunny hase", size = 4, size2 = 8, size3 = 9},
{sname = "mouse", name = "Micky Mouse", size = 1, size2 = 12, size3 = 22},
{sname = "cow", name = "Die Kuh", size = 30, size2 = 33, size3 = 324}
}
-- Now set a __call metamethod on same table
>setmetatable(animals,{__call=function(tab,...)
local args={...}
for key, value in pairs(tab) do
if value.sname:find(args[1],1) then print(key,'=',value.sname) end
end
end})
table: 0x565c4a00
-- Lets try it once
>animals('b')
1 = bunny
-- Next one
>animals('c')
3 = cow
-- Last one
>animals('m')
2 = mouse
Using metatables holds your stuff together.
Another fine place is the __index metamethod that can hold all functions you need for that table and can be used like the string functions on a string.
( Like: value.sname:find(args[1],1) )
This leads to the heart of what find should do.
In first example it looks in whole sname for a matching pattern.
Check the Lua patterns what also can be useful.
Maybe a ^ only for the begining sounds smart?
So construct the find pattern: '^'..args[1]
...and use more than one letter if you have a cow, crow, frog and fish in your animals.
Example with function name find in __index
>animals = {
{sname = "bunny", name = "bunny hase", size = 4, size2 = 8, size3 = 9},
{sname = "mouse", name = "Micky Mouse", size = 1, size2 = 12, size3 = 22},
{sname = "cow", name = "Die Kuh", size = 30, size2 = 33, size3 = 324}
}
-- Place a find function into __index
>setmetatable(animals,{__index={find=function(tab,...)
local args={...}
for key, value in pairs(tab) do
if value.sname:find('^'..args[1]) then print(key,'=',value.sname) end
end
end}})
table: 0x565c3db0
-- first
>animals:find('c')
3 = cow
-- next
>animals:find('m')
2 = mouse
-- last
>animals:find('b')
1 = bunny
If you like to print all key values then extend the print() in find().
Stop, i found an issue....
Look here - i prefer the first solution:
animals = {
{sname = "bunny", name = "bunny hase", size = 4, size2 = 8, size3 = 9},
{sname = "mouse", name = "Micky Mouse", size = 1, size2 = 12, size3 = 22},
{sname = "cow", name = "Die Kuh", size = 30, size2 = 33, size3 = 324}
}
-- Now set a __call metamethod on same table
setmetatable(animals,{__call=function(tab,...)
local args={...}
for v,k in pairs(tab) do
if k.sname:find(args[1],1) then print(v,'=',k.sname) end
end
end})
-- Search Entries with Start U.....
-- there should be no result, but....
animals('u')
i get the Result:
1 = bunny
2 = mouse
that should not be the result!

ggplotly tooltip is showing data twice

I have 2 datasets included in one chart using ggplot. I am using ggplotly to create a tooltip but the information in the tooltips for the 2 points is showing twice. The following code is a little lengthy but will recreate the chart:
AreaName <- c("A", "B", "C", "A", "B", "C")
Timeperiod <- c("2018", "2018", "2018", "2019", "2019", "2019")
Value <- c(11.5, 39.3, 9.4, 14.2, 40.7, 19.1)
df <- data.frame(cbind(AreaName, Timeperiod, Value), stringsAsFactors = F)
df$Value <- as.numeric(df$Value)
AreaName <- c("A", "A")
Timeperiod <- c("2019", "2020")
qtr <- c("Q1-Q2", "Q1-Q2")
Value <- c(15.6, 10.2)
df2 <- data.frame(cbind(Timeperiod, qtr, AreaName, Value), stringsAsFactors = F)
df2$Value <- as.numeric(df2$Value)
ggp <- ggplotly(ggplot(data = df, aes(x=Timeperiod, y=Value, group = AreaName, colour = AreaName, text = paste("Area name: ", AreaName, "<br>Time period: ", Timeperiod, "<br>Rate: ", round(Value,1), "per 100,000"))) +
geom_line() +
geom_point() +
geom_point(data = df2, aes(shape = c(paste(AreaName, qtr, Timeperiod)),text = paste("Area name: ", AreaName, "<br>Quarter: ", qtr, "<br>Time period: ", Timeperiod, "<br>Rate: ", round(Value,1), "per 100,000"))) +
scale_shape_manual(values = c(18, 17)) +
theme(axis.text.x = element_text(vjust = 0.5), axis.title.x = element_blank()) +
labs(y = "Crude rate per 100,000 persons all ages", colour = "Area", shape = "") +
guides(shape = guide_legend(order = 2),colour = guide_legend(order = 1)) +
expand_limits(y=0), tooltip = "text")
ggpNames <- unique(df$AreaName)
legs <- paste(df2$AreaName, df2$qtr, df2$Timeperiod)
ggpNames <- c(ggpNames,legs)
for (i in 1:length(ggp$x$data)) { # this goes over all places where legend values are stored
n1 <- ggp$x$data[[i]]$name # and this is how the value is stored in plotly
n2 <- " "
for (j in 1:length(ggpNames)) {
if (grepl(x = n1, pattern = ggpNames[j])) {n2 = ggpNames[j]} # if the plotly legend name contains the original value, replace it with the original value
}
ggp$x$data[[i]]$name <- n2 # now is the time for actual replacement
if (n2 == " ") {ggp$x$data[[i]]$showlegend = FALSE} # sometimes plotly adds to the legend values that we don't want, this is how to get rid of them, too
}
ggp %>% config(displaylogo = FALSE, modeBarButtonsToRemove = list("autoScale2d", "resetScale2d","select2d", "lasso2d", "zoomIn2d", "zoomOut2d", "toggleSpikelines", "zoom2d", "pan2d"))
ggp
Does anyone have an elegant solution to this?
Thanks
Do not define text in geom_point for the second dataframe df2. Then you will get only one tooltip for those two points.
ggp <- ggplotly(ggplot(data = df, aes(x=Timeperiod, y=Value, group = AreaName, colour = AreaName, text = paste("Area name: ", AreaName, "<br>Time period: ", Timeperiod, "<br>Rate: ", round(Value,1), "per 100,000"))) +
geom_line() +
geom_point() +
geom_point(data = df2, aes(shape = c(paste(AreaName, qtr, Timeperiod)) #,
#text = paste("Area name: ", AreaName, "<br>Quarter: ", qtr, "<br>Time period: ", Timeperiod, "<br>Rate: ", round(Value,1), "per 100,000")
)) +
scale_shape_manual(values = c(18, 17)) +
theme(axis.text.x = element_text(vjust = 0.5), axis.title.x = element_blank()) +
labs(y = "Crude rate per 100,000 persons all ages", colour = "Area", shape = "") +
guides(shape = guide_legend(order = 2),colour = guide_legend(order = 1)) +
expand_limits(y=0), tooltip = "text")
ggpNames <- unique(df$AreaName)
legs <- paste(df2$AreaName, df2$qtr, df2$Timeperiod)
ggpNames <- c(ggpNames,legs)
for (i in 1:length(ggp$x$data)) { # this goes over all places where legend values are stored
n1 <- ggp$x$data[[i]]$name # and this is how the value is stored in plotly
n2 <- " "
for (j in 1:length(ggpNames)) {
if (grepl(x = n1, pattern = ggpNames[j])) {n2 = ggpNames[j]} # if the plotly legend name contains the original value, replace it with the original value
}
ggp$x$data[[i]]$name <- n2 # now is the time for actual replacement
if (n2 == " ") {ggp$x$data[[i]]$showlegend = FALSE} # sometimes plotly adds to the legend values that we don't want, this is how to get rid of them, too
}
ggp %>% config(displaylogo = FALSE, modeBarButtonsToRemove = list("autoScale2d", "resetScale2d","select2d", "lasso2d", "zoomIn2d", "zoomOut2d", "toggleSpikelines", "zoom2d", "pan2d"))
ggp

Exporting to text in xojo

I want to export data from a listbox,
Listbox1.AddRow "001", "Orange", "1.00","Arief"
Listbox1.AddRow "001", "Apple", "1.00","Arief"
Listbox1.AddRow "001", "Banana", "1.00","Arief"
Listbox1.AddRow "004", "Orange", "1.00","Arief"
Listbox1.AddRow "005", "Apple", "1.00","Brandon"
Listbox1.AddRow "006", "Banana", "1.00","Brenda"
dim f as folderitem
dim tisx as TextOutputStream
f = new folderitem("item.txt")
tisx = f.CreateTextFile
dim Last_first_word as String
dim maxRow as Integer = Listbox1.listcount-1
for row as integer = 0 to maxRow
if Listbox1.Cell(row,0)<> Last_first_word then
tisx.WriteLine ""
tisx.writeline listBox1.cell(row,0)
tisx.WriteLine listBox1.cell(row,1)+" "+listBox1.cell(row,2)
Last_first_word=Listbox1.Cell(row,0)
else
tisx.WriteLine listBox1.cell(row,1)+" "+listBox1.cell(row,2)
end if
next
tisx.Close
I want to categorized all the items which is has the same code,and put the name at the last.
How to make the result like ,
001
Orange 1.00
Apple 1.00
Banana 1.00
Arief
004
Orange 1.00
Arief
005
Apple 1.00
Brandon
006
Banana 1.00
Brenda
Thanks
Regards,
Arief
You'll need to also save the name so you can display it before you move onto a new group of data. Only a minor tweak to your code was needed:
Listbox1.DeleteAllRows
ListBox1.AddRow("001", "Orange", "1.00", "Arief")
ListBox1.AddRow("001", "Apple", "1.00", "Arief")
ListBox1.AddRow("001", "Banana", "1.00", "Arief")
ListBox1.AddRow("004", "Orange", "1.00", "Arief")
ListBox1.AddRow("005", "Apple", "1.00", "Brandon")
ListBox1.AddRow("006", "Banana", "1.00", "Brenda")
Dim f As FolderItem
Dim tisx As TextOutputStream
f = SpecialFolder.Desktop.Child("item.txt")
tisx = f.CreateTextFile
Dim Last_first_word As String
Dim lastName As String
Dim maxRow As Integer = Listbox1.ListCount - 1
For row As Integer = 0 To maxRow
If Listbox1.Cell(row, 0) <> Last_first_word Then
If lastName <> "" Then tisx.WriteLine(lastName)
tisx.WriteLine("")
tisx.WriteLine(ListBox1.Cell(row, 0))
tisx.WriteLine(ListBox1.Cell(row, 1) + " " + ListBox1.Cell(row, 2))
Last_first_word = ListBox1.Cell(row, 0)
lastName = ListBox1.Cell(row, 3)
Else
tisx.WriteLine(ListBox1.Cell(row, 1) + " " + ListBox1.Cell(row, 2))
End If
Next
If lastName <> "" Then tisx.WriteLine(lastName)
tisx.Close
The data has to be sorted by that group number in order for this to work.

Display array in tabular form

I have an array in this way -
arr = ["0.5", " 2016-08-25 11:02:00 +0530", " test 1",
" 0.75", " 2016-08-25 11:02:00 +0530", " test 2"]
and I want it to be displayed in a tabular form like this -
0.5 11:02 test 1
0.75 11:02 test 2
a = ["0.5", " 2016-08-25 11:02:00 +0530", " test 1", " 0.75", " 2016-08-25 11:02:00 +0530", " test 2"]
a.each_slice(3) do |x, y, z|
puts "#{x.strip} #{y[/\d\d:\d\d/]} #{z.strip}"
end
Another approach:
arr = ["0.5", " 2016-08-25 11:02:00 +0530", " test 1", " 0.75", " 2016-08-25 11:02:00 +0530", " test 2"]
arr.each_slice(3).map do |x|
x[1] = Time.parse(x[1]).strftime("%H:%M"); x.map(&:strip)
end.map{ |y| puts y.join(' ') }
0.5 11:02 test 1
0.75 11:02 test 2
I joined the elements of arr into a string with a space between each element, then scanned the string, saving the results to three capture groups, which produced an array containing two three-elements arrays. Lastly, I joined the three elements of each of the two arrays and printed the result using puts.
r = /
(\d+\.\d+) # match a float in capture group 1
.+? # match > 1 of any characters, lazily (?)
(\d{1,2}:\d2) # match the time in capture group 2
.+? # match > 1 of any characters, lazily (?)
(test\s\d+) # match 'test' followed by > 1 digits in capture group 3
/x # free-spacing regex definition mode
puts arr.join(' ').scan(r).map { |a| a.join(" ") }
prints
0.5 11:02 test 1
0.75 11:02 test 2
The three steps are as follows.
a = arr.join(' ')
#=> "0.5 2016-08-25 11:02:00 +0530 test 1 0.75 2016-08-25 11:02:00 +0530 test 2"
b = a.scan(r)
#=> [["0.5", "11:02", "test 1"],
# ["0.75", "11:02", "test 2"]]
c = b.map { |a| a.join(" ") }
#=> ["0.5 11:02 test 1", "0.75 11:02 test 2"]
Then puts c prints the result shown above.

Resources