Read parquet in chunks according to ordered column index with pyarrow - dask

I have a dataset composed of multiple parquet files clip1.parquet, clip2.parquet,.... Each row corresponds to a point in some frame and there's an ordered column specifying the corresponding frame frame: 1,1,...1,2,2,2...2,3,3...3.... There are several thousand rows for each frame, but the exact number is not necessarily the same. Frame numbers do not reset in each clip.
What is the fastest way to iteratively read all rows belonging to one frame?
Loading the whole dataset to memory is not possible. I assume a standard row filter will check against all rows which is not optimal (I know they are ordered by frame). I was thinking it could be possible to match a row-group for each frame, but wasn't sure if it's a good practice or even possible with different sized groups.
Thanks!

It is reasonable in your case to consider the frame column as your index, and you can specify this when loading. If you scan the metadata of all the files (this is fast for local data, but not on by default), then dask will know the min and max frame values for each file. Therefore, selecting on the index will only read the files which have at least some corresponding values.
df = dd.read_parquet("clip*.parquet", index="frame", calculate_divisions=True)
df[df.index == 1] # so something with this
Alternatively, you can specify filters in readparquet, if you want even more control, and you would make a new dataframe object for each iteration.
Note, however, that a groupby might do what you want, without having to iterate over the frame numbers. Dask is pretty smart about loading only part of the data at a time and aggregating partial results from each partition. How well this works depends on how complicated an algorithm you want to do to each row set.
I should mention that both parquet backend support all these options, you don't specifically need pyarrow.

Related

Tableau has two data tables, how can i make a filter only apply to single data table

I currently have two data tables each linked to a date table. The data tables are from salesforce. I can calculate the number of a certain case type per quarter without issue. I can also calculate the running sum over quarters to show instrument install base increasing. I want to divide the number of cases per qtr by the install base. This calculation works, but when I apply a filter to see different types of cases per instrument, the filter impacts the install base as well. I would like to keep the install base consistent. I tried different LOD, but no luck. Any suggestions on filters and LOD and where to place in tableau would be beneficial.
One option is to use a parameter for filtering and then having a calculated field that changes based on parameter values in one table but not in the other table. However, this type of filter would affect all worksheets that use the same data.

Extract Last Value as Metric from Table Calculation in Tableau?

I have raw data in Tableau that looks like:
Month,Total
2021-08,17
2021-09,34
2021-10,41
2021-11,26
2021-12,6
And by using the following calculation
RUNNING_SUM(
COUNTD(IF [Inserted At]>=[Parameters].[Start Date]
AND [Inserted At]<=[End Date]
THEN [Id] ELSE NULL END
))
/
LOOKUP(RUNNING_SUM(
COUNTD(IF [Inserted At]>=[Parameters].[Start Date]
AND [Inserted At]<=[End Date]
THEN [Id] ELSE NULL END
)),-1)*100-100
I get
Month,My_Calc
2021-08,NULL
2021-09,200
2021-10,80.4
2021-11,28.3
2021-12,5.1
And all I really want is 5.1 (last monthly value) as one big metric (% Month-Over-Month Growth).
How can I accomplish this?
I'm relatively new to Tableau and don't know how to use calculated fields in conjunction with the date groupings aspect to express I want to calculate month-over-month growth. I've tried the native year-over-year growth running total table calculation but that didn't end with the same result since I think my calculation method is different.
First a brief table calc intro, and then the answer at the end.
Most calculations in Tableau are actually performed by the data source (e.g. database server), and the results are then returned to Tableau (i.e. the client) for presentation. This separation of responsibilities allows high performance, even when facing very large data sets.
By contrast, table calculations operate on the table of query results that were returned from the server. They are executed late in the order of operations pipeline. That is why table calcs operate on aggregated data -- i.e. you have to ask for WINDOW_SUM(SUM([Sales)) and not WINDOW_SUM([Sales])
Table calcs give you an opportunity to make final passes of calculations over the query results returned from the data source before presentation to the user. You can for instance calculate a running total or make the visualization layout dynamically depend in part on the contents of the query results. This flexibility comes at a cost, the calculation is only one part of defining a table calc. You also have to specify how to apply the calculation to the table of summary results, known as partitioning and addressing. The Tableau on-line help has a useful definition of partitioning and addressing.
Essentially, table calcs are applied to blocks of summary data at a time, aka vectors or windows. Partitioning is how you tell Tableau how you wish to break up the summary query results into windows for purposes of applying your table calc. Addressing is how you specify the order in which you wish to traverse those partitions. Addressing is important for some table calcs, such as RUNNING_SUM, and unimportant for others, such as WINDOW_SUM.
Besides understanding partitioning and addressing very well, it is also helpful to learn about the functions INDEX(), SIZE(), FIRST(), LAST(), WINDOW_SUM(), LOOKUP() and (eventually) PREVIOUS_VALUE() to really understand table calcs. If you really understand them, you'll be able to implement all of these functions using just two of them as the fundamental ones.
Finally, to partially address your question:
You can use the boolean formula LAST() = 0 to tell if you are at the last value of your partition. If you use that formula as a filter, you can hide all the other values. You'll have to get partitioning and addressing specified correctly. You would essentially be fetching a batch of data from your server, using it in calculations on the client side, but only displaying part of it. This can be a bit brittle depending on which fields are on which shelves, but it can work.
Normally, it is more efficient to use a calculation that can be performed server-side, such as LOD calc, if that allows you to avoid fetching data only for client side calculations. But if the data is already fetched for another purpose, or if the calculation requires table calc features, such as the ability to depend on the order of the values, then table calcs are a good tool.
However you do it, the % month-to-month change from 2021.11 (a value of 26) to the value for 2021.12 (a value of 6) is not 5.1%.
It's (( 6 - 26 ) / 26) * 100 = -76.9 %
OK, starting from scratch, this works for me: ( I don't know how to get exactly the table format I want without using ShowMe and Flip, but it works. Anyone else? )
drag Date to rows, change it to combined Month(Date)
drag sales to column shelf
in showme select TEXT-TABLES
flip rows for columns using tool bar
that gets a table like the one you show above
Drag Sales to color (This is a trick to simply hold it for a minute ),
click the down-arrow on the new SALES pill in the mark card,
select "Add a table calculation",
select Running Total, of SUM, compute using Table(down), but don't close this popup window yet.
click Add Secondary Calculation checkbox at the bottom
select Percent Different From
compute using table down
relative to Previous
Accept your work by closing the popup (x).
NOW, change the new pill in the mark card from color to text
you can see the 5.1% at the bottom. Almost done.
Reformat again by clicking table in ShowMe
and flipping axes.
click the sales column header and hide it
create a new calculated field
label 'rows-from-bottom'
formula = last()
close the popup
drag the new pill rows-from-bottom to the filters shelf
select range 0 to 0
close the popup.
Done.
For the next two weeks you can see the finished workbook here
https://public.tableau.com/app/profile/wade.schuette/viz/month-to-month/hiderows?publish=yes

elasticsearch - time series - search time intersection between two indexes

I have two indexes. Each one of them has time and value.
click to see the data structure
In the example above I would like to find a specific time where Index2.val-Index1.val>70
Note that the values do not change from the last time entry which means that if a value is set to 20 on the 1-1-14 it will be the same on the 2-1-14 if no entry exists.
A solution can be fetching both of the vectors and do it with a linear algorithm but I suppose that the performance will be bad.
Is there an out of the box solution for that?
Thanks
David

Is there a way to tell Google Cloud Dataflow that the data coming in is already ordered?

We have an input data source that is approximately 90 GB (it can be either a CSV or XML, it doesn't matter) that contains an already ordered list of data. For simplicity, you can think of it as having two columns: time column, and a string column. The hundreds of millions of rows in this file are already ordered by the time column in ascending order.
In our Google cloud DataFlow, we have modeled each row as an element in our Pcollection, and we apply DoFn transformations to the string field (e.g. count the number of characters that are uppercase in the string etc.). This works fine.
However, we then need to apply functions that are supposed to be calculated for a block of time (e.g. five minutes) with a one minute overlap. So, we are thinking about using a sliding windowing function (even though the data is bounded).
However, the calculations logic that needs to be applied over these five-minute windows assumes that the data is ordered logically ( i.e. ascending) by the time field. My understanding is that even when using these windowing functions, one cannot assume that within each window the P collection objects are ordered in any way – so one would need to manually iterate through every P collection and reorder them, right? However, this seems like a huge waste of computational power, since the incoming data already contains ordered data. So is there a way to teach/inform Google cloud data flow that the input data is ordered and so to maintain that order even within the windows?
On a minor note, I had another question: my understanding is that if the data source is unbounded, there is never a "overall aggregation" function that would ever execute, as it never really make sense (since there is no end to the incoming data); however, if one uses a windowing function for bounded data, there is a true end state which corresponds to when all the data has been read from the CSV file. Therefore, is there a way to tell Google cloud data flow to do a final calculation once all the data has been read in, even though we are using a windowing function to divide the data up?
SlidingWindows sounds like the right solution for your problem. The ordering of the incoming data is not preserved across a GroupByKey, so informing Dataflow of that would not be useful currently. However, the batch Dataflow runner does already sort by timestamp in order to implement windowing efficiently, so for simple windowing like SlidingWindows, your code will see the data in order.
If you want to do a final calculation after doing some windowed calculations on a bounded data set, you can re-window your data into the global window again, and do your final aggregation after that:
p.apply(Window.into(new GlobalWindows()));

Simple way to analyze data based on common key

What would be the simplest way to process all the records that were mapped to a specific key and output multiple records for that data.
For example (a synthetic example), assuming my key is a date and the values are intra-day timestamps with measured temperatures. I'd like to classify the temperatures into high/average/low within the day (again, below/above 1 stddev from average).
The output would be the original temperatures with their new classifications.
Using Combine.PerKey(CombineFn) allows only one output per key using the #extractOutput() method.
Thanks
CombineFns are restricted to a single output value because that allows the system to do additional parallelization: combining different subsets of the values separately, and then combining their intermediate results in an arbitrary tree reduction pattern, until a single result value is produced for each key.
If your values per key don't fit in memory (so you can't use the GroupByKey-ParDo pattern that Jeremy suggests) but the computed statistics do fit in memory, you could also do something like this:
(1) Use Combine.perKey() to calculate the stats per day
(2) Use View.asIterable() to convert those into PCollectionViews.
(3) Reprocess the original input with a ParDo that takes the statistics as side inputs
(4) In that ParDo's DoFn, have startBundle() take the side inputs and build up an in-memory data structure mapping days to statistics that can be used to do lookups in processElement.
Why not use a GroupByKey operation followed by a ParDo? The GroupBy would group all the values with a given key. Applying a ParDo then allows you to process all the values with a given key. Using a ParDo you can output multiple values for a given key.
In your temperature example, the output of the GroupByKey would be a PCollection of KV<Integer, Iterable<Float>> (I'm assuming you use an Integer to represent the Day and Float for the temperature). You could then apply a ParDo to process each of these KV's. For each KV you could iterate over the Float's representing the temperature and compute the hi/average/low temperatures. You could then classify each temperature reading using those stats and output a record representing the classification. This assumes the number of measurements for each Day is small enough as to easily fit in memory.

Resources