Simple way to analyze data based on common key - google-cloud-dataflow

What would be the simplest way to process all the records that were mapped to a specific key and output multiple records for that data.
For example (a synthetic example), assuming my key is a date and the values are intra-day timestamps with measured temperatures. I'd like to classify the temperatures into high/average/low within the day (again, below/above 1 stddev from average).
The output would be the original temperatures with their new classifications.
Using Combine.PerKey(CombineFn) allows only one output per key using the #extractOutput() method.
Thanks

CombineFns are restricted to a single output value because that allows the system to do additional parallelization: combining different subsets of the values separately, and then combining their intermediate results in an arbitrary tree reduction pattern, until a single result value is produced for each key.
If your values per key don't fit in memory (so you can't use the GroupByKey-ParDo pattern that Jeremy suggests) but the computed statistics do fit in memory, you could also do something like this:
(1) Use Combine.perKey() to calculate the stats per day
(2) Use View.asIterable() to convert those into PCollectionViews.
(3) Reprocess the original input with a ParDo that takes the statistics as side inputs
(4) In that ParDo's DoFn, have startBundle() take the side inputs and build up an in-memory data structure mapping days to statistics that can be used to do lookups in processElement.

Why not use a GroupByKey operation followed by a ParDo? The GroupBy would group all the values with a given key. Applying a ParDo then allows you to process all the values with a given key. Using a ParDo you can output multiple values for a given key.
In your temperature example, the output of the GroupByKey would be a PCollection of KV<Integer, Iterable<Float>> (I'm assuming you use an Integer to represent the Day and Float for the temperature). You could then apply a ParDo to process each of these KV's. For each KV you could iterate over the Float's representing the temperature and compute the hi/average/low temperatures. You could then classify each temperature reading using those stats and output a record representing the classification. This assumes the number of measurements for each Day is small enough as to easily fit in memory.

Related

Read parquet in chunks according to ordered column index with pyarrow

I have a dataset composed of multiple parquet files clip1.parquet, clip2.parquet,.... Each row corresponds to a point in some frame and there's an ordered column specifying the corresponding frame frame: 1,1,...1,2,2,2...2,3,3...3.... There are several thousand rows for each frame, but the exact number is not necessarily the same. Frame numbers do not reset in each clip.
What is the fastest way to iteratively read all rows belonging to one frame?
Loading the whole dataset to memory is not possible. I assume a standard row filter will check against all rows which is not optimal (I know they are ordered by frame). I was thinking it could be possible to match a row-group for each frame, but wasn't sure if it's a good practice or even possible with different sized groups.
Thanks!
It is reasonable in your case to consider the frame column as your index, and you can specify this when loading. If you scan the metadata of all the files (this is fast for local data, but not on by default), then dask will know the min and max frame values for each file. Therefore, selecting on the index will only read the files which have at least some corresponding values.
df = dd.read_parquet("clip*.parquet", index="frame", calculate_divisions=True)
df[df.index == 1] # so something with this
Alternatively, you can specify filters in readparquet, if you want even more control, and you would make a new dataframe object for each iteration.
Note, however, that a groupby might do what you want, without having to iterate over the frame numbers. Dask is pretty smart about loading only part of the data at a time and aggregating partial results from each partition. How well this works depends on how complicated an algorithm you want to do to each row set.
I should mention that both parquet backend support all these options, you don't specifically need pyarrow.

Difference of measurement filtered by tag in InfluxDB

In InfluxDB v1.3, I have a measurement with one field and a tag that can take two values.
I would like to compute (x where mytag=y) - (x where mytag=z), using the last value of each series when needed (something like an http://code.kx.com/wiki/Reference/aj). I would like to do this in one query, if possible.
If the above is not possible, is there a different schema (e.g. using separate measurements) where what I would like to do is feasible? If so, can you please elaborate on the structure and the query?
SELECT difference(mean(x))
FROM <measurement>
WHERE time > now() - 1h and (mytag='y' OR mytag='x')
GROUP BY time(60s), mytag
Functions like difference require an aggregate query (group by time()) as well as an aggregation function for the values within the grouped window (mean above).
Difference then shows the differences between sequential aggregated values for the time period specified, additionally grouped by the two tag values specified.
These can be adjusted depending on your data.

Alteryx: Creating multiple Histograms from one dataset

I have a data set that contains the following information - Date, item # and the unit price for that item on that date.What I would like to create is one histogram per item (my dataset has 17 unique items), charting the frequency of the unit prices? Is this possible in Alteryx?
What you really want is the ability to group by items within your data set. I think the closest thing to this for your specific use case is the summarize tool. You can group by item and then use the percentile operation to generate several points within the data range to add to a histogram.

Is there a way to tell Google Cloud Dataflow that the data coming in is already ordered?

We have an input data source that is approximately 90 GB (it can be either a CSV or XML, it doesn't matter) that contains an already ordered list of data. For simplicity, you can think of it as having two columns: time column, and a string column. The hundreds of millions of rows in this file are already ordered by the time column in ascending order.
In our Google cloud DataFlow, we have modeled each row as an element in our Pcollection, and we apply DoFn transformations to the string field (e.g. count the number of characters that are uppercase in the string etc.). This works fine.
However, we then need to apply functions that are supposed to be calculated for a block of time (e.g. five minutes) with a one minute overlap. So, we are thinking about using a sliding windowing function (even though the data is bounded).
However, the calculations logic that needs to be applied over these five-minute windows assumes that the data is ordered logically ( i.e. ascending) by the time field. My understanding is that even when using these windowing functions, one cannot assume that within each window the P collection objects are ordered in any way – so one would need to manually iterate through every P collection and reorder them, right? However, this seems like a huge waste of computational power, since the incoming data already contains ordered data. So is there a way to teach/inform Google cloud data flow that the input data is ordered and so to maintain that order even within the windows?
On a minor note, I had another question: my understanding is that if the data source is unbounded, there is never a "overall aggregation" function that would ever execute, as it never really make sense (since there is no end to the incoming data); however, if one uses a windowing function for bounded data, there is a true end state which corresponds to when all the data has been read from the CSV file. Therefore, is there a way to tell Google cloud data flow to do a final calculation once all the data has been read in, even though we are using a windowing function to divide the data up?
SlidingWindows sounds like the right solution for your problem. The ordering of the incoming data is not preserved across a GroupByKey, so informing Dataflow of that would not be useful currently. However, the batch Dataflow runner does already sort by timestamp in order to implement windowing efficiently, so for simple windowing like SlidingWindows, your code will see the data in order.
If you want to do a final calculation after doing some windowed calculations on a bounded data set, you can re-window your data into the global window again, and do your final aggregation after that:
p.apply(Window.into(new GlobalWindows()));

Calculating duration between a start and end event in InfluxDB

I have two write points for InfluxDB, one is the start and the other is the end. I just need to determine the duration between those two events, and make queries around it. InfluxDB has difference() aggregate method, but it doesn't work on the time meta field.
Is supplying a custom timestamp value the only way to accomplish this?
As per "Can I perform mathematical operations against timestamps?"
No:
"Currently, it is not possible to execute mathematical operators against timestamp values in InfluxDB. Most time calculations must be carried out by the client receiving the query results."
and yes, maybe:
The function ELAPSED() returns the difference between subsequent timestamps in a single field.
So it depends on the shape of your data.
If you write only the mentioned two entries then you can follow the below steps -
Limit the result to two (Eg: select * from timeseries limit 2)
Extract the time from the result set
Take the difference between the time

Resources