I am currently calculating the average for a single dimension in a Druid data source using a timeseries query via pydruid. This is based on an example in the documentation (https://github.com/druid-io/pydruid):
from pydruid.client import PyDruid
from pydruid.utils.aggregators import count, doublesum
client = PyDruid()
client.timeseries(
datasource='test_datasource',
granularity='hour',
intervals='2019-05-13T11:00:00.000/2019-05-23T17:00:00.000',
aggregations={
'sum':doublesum('dimension_name'),
'count': count('rows')
},
post_aggregations={
'average': (
Field('sum')/ Field('count')
)
}
)
My problem is that I don't know what count('rows') is doing. This seems to give the total row count for a datasource and is not filtered on the dimension. So I don't know whether the average will be incorrect if one row in the dimension in question has a null value.
I was wondering whether anyone knew how to calculate the average correctly?
Many thanks
Related
I created a calculated field using Fixed LOD to get the distinct values of a field called Estimated ARR because there are duplicates due to joining two tables (with SQL).
The calculation, Distinct Estimated ARR, is:
{FIXED [Estimated ARR] : AVG([Estimated ARR]) }
And when I sum the values in the view, it only has the correct values for 2 of the 4 rows. I'm not sure why this is happening.
This is showing what the ARR values should be for each Customer Segment (using the data before the join):
This shows the sum of the estimated ARR (not distinct due to duplicates in joined dataset) with the sum of the distinct estimated ARR, which come from my LOD expression. As you can see, Customer Segments A and C are showing values that are off by a few hundred thousand, while B and D have accurate sums.
Please let me know if any other info is needed. I was trying to keep this as short as possible.
Got the answer here for anyone who is curious! https://community.tableau.com/s/feed/0D58b0000A9ZJhoCQG?t=1664819504200
What is the range of the pacf function in python? I'd assumed it'd be [-1,1] like Pearson's correlation and autocorrelation, but on trying on my data, I see it has values like -5.
Can anyone tell me why pacf has a different range, and what exactly is this range?
Edit :
the functions I'm using are -
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.stattools import pacf, acf
I'm using the data from this hackathon. This a timeseries of weekly sales across different stores and departments. I have checked the pacf values for data of 1 store, 1 dept
Here's the code for getting the pacf values :
# getting the data for just one store & dept
s1d1 = data[(data.Store==1)&(data.Dept==1)].sort_values('Date').reset_index(drop=True)
# differencing the series by 1 to make it stationary
s1d1['Weekly_Sales_shifted'] = s1d1.Weekly_Sales.shift(1)
s1d1['Weekly_Sales_differenced'] = s1d1.Weekly_Sales - s1d1.Weekly_Sales_shifted
# dropping the first record since it will have a nan in differenced column
s1d1 = s1d1.dropna(axis=0, subset='Weekly_Sales_differenced', how='any')
# getting the pacf values
pacf_values = pacf(s1d1['Weekly_Sales_differenced'], nlags=53)
I would like to get an average percentage out of my sample, however, I need to use several conditions. I tried to use the AVERAGE and AVERAGEIF together with FILTER but everything returns an error and I think I'm incorrectly "merging" formulas.
You can find my test sheet here.
The rules I need to apply:
The score for individual rows is possible to find in the "Data" sheet in cell N and the total results should be visible in the sheet "Calculation" cell E.
As the sample is huge in real life, I need to filter out several pieces of information and add conditions:
to filter out all items where the code/ID starts with 0: Data!A:A&"", "^0.+"
to filter out all items that are matching the date in the Calculation sheet: Data!C:C=$B3
to filter all items with the specific name: Data!B:B=$A3
Any idea how to get the average % out of items with specific filters?
UPDATE
Expected results: I want to see the total average for a specific date, name, and ID, and let's say I would use these filters, then I would see only the final average percentage.
Test =100%
Test = 0%
Test = 100%
Total Average %: 66.7%
Also, I think the best way would be to use AVERAGEIFS, but I'm getting the error "Array arguments to AVERAGEIFS are of different size".
=AVERAGEIFS(Data!N:N,Data!B:B=$A3,Data!C:C=$B3,Data!A:A&"", "^0.+")
=IFERROR(AVERAGEIFS(Data!N3:N,Data!B3:B,A3,Data!C3:C,B3,ARRAYFORMULA(if(LEN(Data!A3:A),REGEXMATCH(Data!A3:A,"^0.+"),"")),TRUE),"")
or
=IFERROR(AVERAGE(FILTER(Data!N3:N,Data!B3:B=A3,Data!C3:C=B3,REGEXMATCH(Data!A3:A,"^0.+"))),"")
or
=IFERROR(INDEX(QUERY({Data!A3:C,Data!N3:N},"select avg(Col4) where Col1 starts with '0' and Col2 = '"&A3&"' and Col3 = '"&B3&"'"),2,0),"")
Is it possible to get percentile on aggregated data in Influxdb?
Say, my data is
db,label1=value1 measure1_count=20 measure1_mean=0.8 140000000000
db,label1=value1 measure1_count=8 measure1_mean=0.9 140000001000
db,label1=value1 measure1_count=15 measure1_mean=0.4 140000002000
It it possible to do percentile on above data in influxdb1/2?
Influx db provide the Median aggregate function for calculating median.
select MEDIAN(Value) from ProcessData group by TagName
Note: MEDIAN() is nearly equivalent to PERCENTILE(field_key, 50), except MEDIAN() returns the average of the two middle field values if the field contains an even number of values.
https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#:~:text=Note%3A%20MEDIAN()%20is%20nearly,an%20even%20number%20of%20values.
Google Sheets average (avg) Query will fail with error AVG_SUM_ONLY_NUMERIC if any column in the dataset is empty. How you can overcome this?
Essentially, this occurs as the query is being run on a dynamically generated data set, therefore it's impossible to know what columns are empty beforehand. Moreover the query output "layout" must not change, so, if a column is empty, the query should return blank or 0 as for the faulty empty column.
Let's give it a look
Scenario: a Google Sheet is being used to insert markings for students tests.
When a single test is done by students, teacher assigns multiple grades for it. For instance, one marking for writing, one for comprehension, etc.
The sheet should finally build columns containing an average for all the markings assigned within the same date.
For instance, in the above sheet (link here), columns with markings given on December 16th (cols B,G,M,R,V) should be averaged in column AE.
Thanks to brilliant user Marikamitsos, this is achieved with the following query in cell AE4:
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z,B3:Z3=AE3)),
"select "&TEXTJOIN(",", 1, IF(LEN(A4:A),
"avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),
"select Col2")*1)
How does the above works?
Dataset is filtered by date
Filtered dataset is transposed and an avg Query is run on it
Result dataset is being queried again to easily filter out labels
All this works fine until a student has no markings for a given date, as occurs in cell AG4: student Bob has no markings for October's 28th test, and the query will throw an error AVG_SUM_ONLY_NUMERIC.
Could there be a way to insert a 0 in the filtered dataset FILTER(B4:Z,B3:Z3=AE3) so that ONLY empty rows will be set to 0? This would prevent the query to fail, while avoiding altering the dataset layout.
Or could there be a way to ignore zeroes in avg query?
NOTE: students cannot be graded with '0' when skipping a test!
See if this works
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z+0,B3:Z3=AG3)), "select "&TEXTJOIN(",", 1, IF(LEN(A4:A), "avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),"select Col2")*1)