I have a financial time series and I want to make a new dataset out of it . I want to take every 20 data point(rows) and replace them with one data points like this :
[mean of those 20 data points , standard deviation of those 20 data points].
I actually think I need gaussian model for the variation or the standard deviation.
and I use python 3.
my dataset is like the first column is the index(number of days) and the second column is the close prices
I do not know the code for taking every 20 data point and replace them with data I wrote above
If the data points are stored in a dataframe, say df, you could group them using groupby like this -
df.groupby(df.index / 20)
You could compute the mean and standard deviation of the groups as follows, and concatenate both of them if you need to.
df.groupby(df.index / 20).mean()
df.groupby(df.index / 20).std()
Related
So in my current project, I am analyzing different ML models based on their quality. Right now, I'd like to put the quality in the context of the time a model needs to train. I track their quality using a F1 Score and I also log the needed time. Now I've been researching the best way to define some of a time-quality ratio but I am unsure how to reach that.
I've been thinking to create a table that has the F1 scores on the y-axis and the Time needed on the x-axis (or the other way around, I don't mind either but figured this makes most sense) but I struggle to define that in Google sheets. My table currently looks something like this (all values are imagined and could vary):
First Dataset
Time (in Min)
Quality (F1 Score)
Iteration 1
5
0
Iteration 2
8
0.1
Iteration 3
11
0.2
Iteration 4
21
0.5
Iteration 5
20
0.8
Iteration 6
21
1
And I'd like a table (this is manually created in GeoGebra) similar to this:
I'm aware I can manually pick my x-axis but was wondering what the best way would be to achieve this - if at all.
you can try a Line chart like this:
I have some "complex" calculation that I currently do row-by-row and store in a helper column. In the end I simply run a sum on the helper column to calculate the total value of that calculation.
I would like to simply have one field, where I do the calculation of the total value - without needing the helper column.
To be concrete, I am calculating exertion load (XL): http://www.strongur.io/monitoring-training-stress-with-exertion-load/
As input data I get a weight lifter, repetitions performed and how many reps were in the tank until failure (RIR) is reached. I calculate the XL of a set by expanding the reps performed into a range => 3 reps becomes [1,2,3] and then running an ArrayFormula on that range to calculate the distance to failure from the perspective of that rep (rep 1 is further from failure than rep 3) and consequently the XL of that single rep. I then use a sum to calculate the total exertion load of the given set.
Unfortunately, this approach does not scale to the "single field"-solution - as I cannot add another ArrayFormula around it. I am not sure where to go from here - my spreadsheet experience is limited.
I think I am missing something here from a conceptual perspective - I've been doing some googling and have seen matrices mentioned, would that be the right direction for this kind of thing? I would like to avoid having to write a JavaScript function just for this use-case.
Thanks in advance for any tips/pointers! :)
Best Regards,
Simon
Sample Spreadsheet: https://docs.google.com/spreadsheets/d/1CNYxsQKo_CUIsstCDbcjoojL6WK46rg9ONybviFxAGs/edit?usp=sharing
use:
=ARRAYFORMULA(SUM(IF(B2:B="";;IF(COLUMN(1:1)>C2:C;;
B2:B*EXP(-0,215*(D2:D+C2:C-COLUMN(1:1)))))))
Working in Google Sheets I'm making a gradebook. In the gradebook there are different assignment types that have different weights which can be chosen from the drop down. I would like to...
Average like assignment (there will be 3 values)
Weigh them appropriately (0.1 for baseline, 0.7 for Critical, 0.2 for accelerated.)
Add all the values together into 1 grade percentage.
Display them on grade report sheet for appropriate student.
I would like for this to be dynamic, so that if I change the assignment type (or any other values) the grade will change appropriately.
my MWE can be found here.
The AverageIf function will do what you need. So code like:
=averageif(B$3:G$3,"Baseline",B4:G4)
sitting next to your student in the 4th row (this is draggable, so row will update) that checks against values of assignment type in your 3rd row (3 is fixed by the dollar sign, so draggable) will accomplish your goal 1.
Once you do that for each of the averages, you can define one more column as =0.1*baseline avg+0.7*critical+0.2*accelerated
[not exactly that, I am pseudocoding] to achieve your other 3 objectives, I think.
I have a data set that contains the following information - Date, item # and the unit price for that item on that date.What I would like to create is one histogram per item (my dataset has 17 unique items), charting the frequency of the unit prices? Is this possible in Alteryx?
What you really want is the ability to group by items within your data set. I think the closest thing to this for your specific use case is the summarize tool. You can group by item and then use the percentile operation to generate several points within the data range to add to a histogram.
What would be the simplest way to process all the records that were mapped to a specific key and output multiple records for that data.
For example (a synthetic example), assuming my key is a date and the values are intra-day timestamps with measured temperatures. I'd like to classify the temperatures into high/average/low within the day (again, below/above 1 stddev from average).
The output would be the original temperatures with their new classifications.
Using Combine.PerKey(CombineFn) allows only one output per key using the #extractOutput() method.
Thanks
CombineFns are restricted to a single output value because that allows the system to do additional parallelization: combining different subsets of the values separately, and then combining their intermediate results in an arbitrary tree reduction pattern, until a single result value is produced for each key.
If your values per key don't fit in memory (so you can't use the GroupByKey-ParDo pattern that Jeremy suggests) but the computed statistics do fit in memory, you could also do something like this:
(1) Use Combine.perKey() to calculate the stats per day
(2) Use View.asIterable() to convert those into PCollectionViews.
(3) Reprocess the original input with a ParDo that takes the statistics as side inputs
(4) In that ParDo's DoFn, have startBundle() take the side inputs and build up an in-memory data structure mapping days to statistics that can be used to do lookups in processElement.
Why not use a GroupByKey operation followed by a ParDo? The GroupBy would group all the values with a given key. Applying a ParDo then allows you to process all the values with a given key. Using a ParDo you can output multiple values for a given key.
In your temperature example, the output of the GroupByKey would be a PCollection of KV<Integer, Iterable<Float>> (I'm assuming you use an Integer to represent the Day and Float for the temperature). You could then apply a ParDo to process each of these KV's. For each KV you could iterate over the Float's representing the temperature and compute the hi/average/low temperatures. You could then classify each temperature reading using those stats and output a record representing the classification. This assumes the number of measurements for each Day is small enough as to easily fit in memory.