Apache Beam - Start sliding window from first element - google-cloud-dataflow

I am trying to develop dataflow pipeline using sliding window on bounded and streaming dataset using Apache Beam Python sdk. The pipeline is as follows:
Reading data
Assigning timestamps
Windowing using SlidingWindows() of 3 size and 1 period
Grouping the elements in the windows
Printing the output
The sample Data:
data = [{'serverID': 'server_1', 'CPU_Utilization': 0, 'timestamp': 1},
{'serverID': 'server_1', 'CPU_Utilization': 1, 'timestamp': 2},
{'serverID': 'server_1', 'CPU_Utilization': 2, 'timestamp': 3},
{'serverID': 'server_1', 'CPU_Utilization': 3, 'timestamp': 4}]
The beam pipeline:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
with beam.Pipeline(options=pipeline_options) as p:
events = (p | 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x,
x['timestamp'])) \
| 'PairWithOne' >> beam.Map(lambda x: (None, x)) \
|'Sliding Window'>>beam.WindowInto(beam.window.SlidingWindows(3,1))
| 'Group by key' >> beam.GroupByKey()
|beam.Map(print))
The output I got:
(None, [{'serverID': 'server_1', 'CPU_Utilization': 0, 'timestamp': 1}, {'serverID':
'server_1', 'CPU_Utilization': 1, 'timestamp': 2}, {'serverID': 'server_1',
'CPU_Utilization': 2, 'timestamp': 3}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 0, 'timestamp': 1}, {'serverID':
'server_1', 'CPU_Utilization': 1, 'timestamp': 2}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 0, 'timestamp': 1}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 1, 'timestamp': 2}, {'serverID':
'server_1', 'CPU_Utilization': 2, 'timestamp': 3}, {'serverID': 'server_1',
'CPU_Utilization': 3, 'timestamp': 4}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 2, 'timestamp': 3}, {'serverID':
'server_1', 'CPU_Utilization': 3, 'timestamp': 4}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 3, 'timestamp': 4}])
The expected output (i.e to discard data when starting window timestamp is less than the timestamp of 1st row or element in data):
(None, [{'serverID': 'server_1', 'CPU_Utilization': 0, 'timestamp': 1}, {'serverID':
'server_1', 'CPU_Utilization': 1, 'timestamp': 2}, {'serverID': 'server_1',
'CPU_Utilization': 2, 'timestamp': 3}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 1, 'timestamp': 2}, {'serverID':
'server_1', 'CPU_Utilization': 2, 'timestamp': 3}, {'serverID': 'server_1',
'CPU_Utilization': 3, 'timestamp': 4}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 2, 'timestamp': 3}, {'serverID':
'server_1', 'CPU_Utilization': 3, 'timestamp': 4}])
(None, [{'serverID': 'server_1', 'CPU_Utilization': 3, 'timestamp': 4}])
I have also tried AfterCount(n) trigger but this trigger does not consider the data when number of data points is less than n.
Any help on this would be really appreciated.

It's probably working as intended. Could you try validating the output by writing it into a file instead of using print? It could be that there are multiple panes fired and printed but the accumulation_mode is discarding.
Also, you can try using Beam Notebooks when debugging streaming pipelines. Use something like show(pcoll, include_window_info=True) to visualize the event time, windows and pane info of a PCollection instead of printing them.

Related

How to predict Total Hours needed with List as Input?

I am struggling with the problem I am facing:
I have a dataset of different products (Cars) that have certain Work Orders open at a given time. I know from historical data how much time this work in TOTAL has caused.
Now I want to predict it for another Car (e.g. Car 3).
Which type of algorithm, regression shall I use for this?
My idea was to transform this row based dataset into column based with binary values e.g. Brake: 0/1, Screen 0/1.. But then I will have lots of Inputs as the number of possible Inputs is 100-200..
Here's a quick idea using multi-factor regression for 30 jobs, each of which is some random accumulation of 6 tasks with a "true cost" for each task. We can regress against the task selections in each job to estimate the cost coefficients that best explain the total job costs.
First done w/ no "noise" in the system (tasks are exact), then with some random noise.
A "more thorough" job would include examining the R-squared value and plotting the residuals to ensure linearity.
In [1]: from sklearn import linear_model
In [2]: import numpy as np
In [3]: jobs = np.random.binomial(1, 0.6, (30, 6))
In [4]: true_costs = np.array([10, 20, 5, 53, 31, 42])
In [5]: jobs
Out[5]:
array([[0, 1, 1, 1, 1, 0],
[1, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 0],
[1, 0, 0, 1, 1, 0],
[1, 1, 1, 1, 0, 1],
[1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 0]])
In [6]: tot_job_costs = jobs # true_costs
In [7]: reg = linear_model.LinearRegression()
In [8]: reg.fit(jobs, tot_job_costs)
Out[8]: LinearRegression()
In [9]: reg.coef_
Out[9]: array([10., 20., 5., 53., 31., 42.])
In [10]: np.random.normal?
In [11]: noise = np.random.normal(0, scale=5, size=30)
In [12]: noisy_costs = tot_job_costs + noise
In [13]: noisy_costs
Out[13]:
array([113.94632664, 103.82109478, 78.73776288, 145.12778089,
104.92931235, 48.14676751, 94.1052639 , 134.64827785,
109.58893129, 67.48897806, 75.70934522, 143.46588308,
143.12160502, 147.71249157, 53.93020167, 44.22848841,
159.64772255, 52.49447057, 102.70555991, 69.08774251,
125.10685342, 45.79436364, 129.81354375, 160.92510393,
108.59837665, 149.1673096 , 135.12600871, 60.55375843,
107.7925208 , 88.16833899])
In [14]: reg.fit(jobs, noisy_costs)
Out[14]: LinearRegression()
In [15]: reg.coef_
Out[15]:
array([12.09045186, 19.0013987 , 3.44981506, 55.21114084, 33.82282467,
40.48642199])
In [16]:

Implementing WeightedRandomSampler on imbalanced data set: RuntimeError: invalid multinomial distribution

I am trying to implement a weighted sampler for a very imbalanced data set. There are 182 different classes. Here is an array of the bin counts per class:
array([69487, 5770, 5753, 138, 4308, 10, 1161, 29, 5611,
350, 7, 183, 218, 4, 3, 3872, 5, 950,
33, 3, 443, 16, 20, 330, 4353, 186, 19,
122, 546, 6, 44, 6, 3561, 2186, 3, 48,
8440, 338, 9, 610, 74, 236, 160, 449, 72,
6, 37, 1729, 2255, 1392, 12, 1, 3426, 513,
44, 3, 28, 12, 9, 27, 5, 75, 15,
3, 21, 549, 7, 25, 871, 240, 128, 28,
253, 62, 55, 12, 8, 57, 16, 99, 6,
5, 150, 7, 110, 8, 2, 1296, 70, 1927,
470, 1, 1, 511, 2, 620, 946, 36, 19,
21, 39, 6, 101, 15, 7, 1, 90, 29,
40, 14, 1, 4, 330, 1099, 1248, 1146, 7414,
934, 156, 80, 755, 3, 6, 6, 9, 21,
70, 219, 3, 3, 15, 15, 12, 69, 21,
15, 3, 101, 9, 9, 11, 6, 32, 6,
32, 4422, 16282, 12408, 2959, 3352, 146, 1329, 1300,
3795, 90, 1109, 120, 48, 23, 9, 1, 6,
2, 1, 11, 5, 27, 3, 7, 1, 3,
70, 1598, 254, 90, 20, 120, 380, 230, 180,
10, 10])
In some classes, instances are as low as 1. I am trying to implement a Weighted random sampler from torch for this dataset. However, as the class imbalance is so large, when I calculate weights using
count_occr = np.bincount(dataset.y)
lbl_weights = 1. / count_occr
weights = np.array(lbl_weights)
weights = torch.from_numpy(weights)
sampler = WeightedRandomSampler(weights.type('torch.DoubleTensor'), len(weights*2))
I get two error messages:
RuntimeWarning: divide by zero encountered in true_divide
and
RuntimeError: invalid multinomial distribution (encountering probability entry = infinity or NaN)
Does anyone have a work around for this ? I was considering multiplying the lbl_weights by some scalar however I am not sure if this is a viable option.

Normalizing data that has low variation

I ran several benchmarks on a CPU cache simulation to count the number of the reuse distances (the distance between two accesses of the same cache entry during execution of the program), here are two examples of the data I got from two different benchmarks:
Benchmark 1:
"reuse_dist_counts": {
"-1": 340,
"0": 623,
"1": 930,
"100": 1,
"107": 1,
"114": 1,
"121": 1,
"128": 1,
"135": 1,
"142": 1,
"149": 1,
"156": 1,
"163": 1,
"170": 1,
"177": 1,
"184": 1,
"191": 1,
"198": 1,
"2": 617,
"205": 1,
"212": 1,
"219": 1,
"226": 1,
"233": 1,
"240": 1,
"247": 1,
"254": 1,
"261": 1,
"268": 1,
"275": 1,
"282": 1,
"289": 1,
"296": 1,
"3": 617,
"303": 1,
"310": 1,
"311": 1,
"314": 1,
"4": 1,
"48": 1,
"55": 1,
"62": 1,
"69": 1,
"76": 1,
"79": 1,
"86": 1,
"93": 1
}
Benchmark2:
"reuse_dist_counts": {
"-1": 58,
"0": 128,
"1": 320,
"11": 17,
"12": 1,
"13": 2,
"14": 4,
"15": 18,
"16": 14,
"17": 13,
"18": 13,
"19": 16,
"2": 256,
"20": 16,
"21": 17,
"22": 17,
"23": 2,
"24": 1,
"25": 3,
"26": 3,
"27": 2,
"28": 3,
"29": 2,
"3": 289,
"30": 2,
"31": 2,
"34": 2,
"35": 6,
"38": 2,
"39": 2,
"4": 198,
"40": 4,
"41": 1,
"43": 1,
"44": 2,
"45": 1,
"47": 1,
"48": 2,
"5": 63,
"50": 1,
"6": 81,
"7": 106,
"8": 1
}
it has the form a:b where a is a reuse distance and b is how many times that reuse distance has been found during the execution.
As I'm going to design a neural network based on this data, it's necessary to find a good normalization to represent it in a different way. Any idea on how to normalize such data with such low variations, and how to represent its features vectors?

InfluxDB: Points are Missing

When running the following piece of code:
!pip install influxdb-client
from influxdb_client import InfluxDBClient, BucketRetentionRules,WritePrecision
import influxdb_client
import time
client = InfluxDBClient(url="http://localhost:8086", token= "my_password", org="primary")
write_api = client.write_api()
query_api =client.query_api()
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=1"],write_precision=WritePrecision.MS)
time.sleep(1)
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=2"],write_precision=WritePrecision.MS)
time.sleep(1)
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=3"],write_precision=WritePrecision.MS)
time.sleep(1)
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=4"],write_precision=WritePrecision.MS)
time.sleep(1)
tables=query_api.query('from(bucket:"myFirstBucket") |> range(start: -15s)') #last 15 seconds
for table in tables:
print(table)
for row in table.records:
print (row.values)
I get:
FluxTable() columns: 9, records: 2
{'result': '_result', 'table': 0, '_start': datetime.datetime(2022, 3, 2, 11, 32, 18, 591779, tzinfo=tzutc()), '_stop': datetime.datetime(2022, 3, 2, 11, 32, 33, 591779, tzinfo=tzutc()), '_time': datetime.datetime(2022, 3, 2, 11, 32, 30, 595000, tzinfo=tzutc()), '_value': 2.0, '_field': 'water_level', '_measurement': 'myMeasurement', 'location': 'coyote_creek'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2022, 3, 2, 11, 32, 18, 591779, tzinfo=tzutc()), '_stop': datetime.datetime(2022, 3, 2, 11, 32, 33, 591779, tzinfo=tzutc()), '_time': datetime.datetime(2022, 3, 2, 11, 32, 32, 590000, tzinfo=tzutc()), '_value': 3.0, '_field': 'water_level', '_measurement': 'myMeasurement', 'location': 'coyote_creek'}
Why do I get only two records? I would expect four records: one record for each write command!

Sort by doesn't sort when includes is used in rails

I've got a distance attribute in my User model :
attr_accessor :distance
So when I calculate the distance for each user, and store it in the distance then I can sort them like :
users.sort_by!(&:distance)
And the users get sorted according to the distance appropriately. But when I include other associated methods i.e :
users.includes(:photo).sort_by!(&:distance)
This doesn't sort the users at all, why is this? How can I sort it with distance but include association as well?
the ! in the sort_by! method indicates that the object itself is changed rather than returns a different object.
When you call users.includes(:photo) this method returns a different object. So, what you are actually doing is like:
users2 = users.includes(:photo)
users2.sort_by!(&:distance)
This is why the users object is not sorted after you call sort_by!. A better way to do it might be
users = users.includes(:photo).sort_by(&:distance)
Well it does for me. I do "User", not "users"
User.includes(:photo).sort_by!(&:distance)
What does "users" variable hold, anyway?. Try User.
Edited with my example, here I user Enquiry for User and Score for Distance.
1.9.3p385 :059 > Enquiry.all.sort_by!(&:score).map &:score
Enquiry Load (0.7ms) SELECT `enquiries`.* FROM `enquiries`
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 10, 10]
1.9.3p385 :060 > Enquiry.includes(:follow_ups).sort_by!(&:score).map &:score
Enquiry Load (0.1ms) SELECT `enquiries`.* FROM `enquiries`
FollowUp Load (0.1ms) SELECT `follow_ups`.* FROM `follow_ups` WHERE `follow_ups`.`enquiry_id` IN (55, 64, 65, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 85, 86, 89, 91, 92, 93, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127)
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 10, 10]
1.9.3p385 :057 > enquiries = Enquiry.where(status_id: [1,2,3])
1.9.3p385 :061 > enquiries.includes(:follow_ups).sort_by!(&:score).map &:score
Enquiry Load (0.5ms) SELECT `enquiries`.* FROM `enquiries` WHERE `enquiries`.`status_id` IN (1, 2, 3)
FollowUp Load (0.2ms) SELECT `follow_ups`.* FROM `follow_ups` WHERE `follow_ups`.`enquiry_id` IN (68, 75, 78, 91, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 113, 114, 115, 116, 117, 120, 122, 123, 124, 125, 126, 127)
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 7, 8, 8, 10]
Note: your question is wrong and you downvote me.
I believe you should do
users.sort_by!(&:distance).includes(:photo)
Use this:
User.includes(:photo).sort_by!(&:distance)
includes is used for Model, not for array.
so User is model name and users is array.
array have 'includes?' method and
Model have 'include' method
So use this
User.includes(:photo).sort_by!(&:distance)
instead of
users.includes(:photo).sort_by!(&:distance)

Resources