I have the following columns:
Brand Reviews Rating
Test 25 4.5
Test 26 3.5
Test 41 4.3
Test2 20 2.1
Test2 15 4
Test3 29 5
Test3 22 3.3
I would like to calculate the total amount of reviews just for the brand TEST. How can I proceed?
try:
=AVERAGEIF(A:A,"Test", B:B)
=SUMIF(A:A,"Test",B:B) thanks to #ScottCraner
Related
I have a problem. I want to predict when the customer will place another order in how many days if an order comes in.
I have already created my target variable next_purchase_in_days. This specifies in how many days the customer will place an order again. And I would like to predict this.
Since I have too few features, I want to do feature engineering. I would like to specify how many orders the customer has placed in the last 90 days. For example, I have calculated back from today's date how many orders the customer has placed in the last 90 days.
Is it better to say per row how many orders the customer has placed? Please see below for the example.
So does it make more sense to calculate this from today's date and include it as a feature or should it be recalculated for each row?
customerId fromDate next_purchase_in_days
0 1 2021-02-22 24
1 1 2021-03-18 4
2 1 2021-03-22 109
3 1 2021-02-10 12
4 1 2021-09-07 133
8 3 2022-05-17 61
10 3 2021-02-22 133
11 3 2021-02-22 133
Example
# What I have
customerId fromDate next_purchase_in_days purchase_in_last_90_days
0 1 2021-02-22 24 0
1 1 2021-03-18 4 0
2 1 2021-03-22 109 0
3 1 2021-02-10 12 0
4 1 2021-09-07 133 0
8 3 2022-05-17 61 1
10 3 2021-02-22 133 1
11 3 2021-02-22 133 1
# Or does this make more sense?
customerId fromDate next_purchase_in_days purchase_in_last_90_days
0 1 2021-02-22 24 1
1 1 2021-03-18 4 2
2 1 2021-03-22 109 3
3 1 2021-02-10 12 0
4 1 2021-09-07 133 0
8 3 2022-05-17 61 1
10 3 2021-02-22 133 0
11 3 2021-02-22 133 0
You can address this in a number of ways, but something interesting to consider is the interaction between Date & Customer ID.
Dates have meaning to humans beyond just time keeping. They are associated with emotional, and culturally importance. Holidays, weekends, seasons, anniversaries etc. So there is a conditional relationship between the probability of a purchase and Events: P(x|E)
Customer Ids theoretically represent a single person, or at the very least a single business with a limited number of people responsible for purchasing.
Certain people/corporations are just more likely to spend.
So here are a number of ways to address this:
Find a list of holidays relevant to the users. For instance if they are US based find a list of US recognized holidays. Then create a
feature based on each date: Date_Till_Next_Holiday or (DTNH for
short).
Dates also have cyclical aspects that can encode probability. Day of the > year (1-365), Days of the week (1-7), week numbers (1-52),
Months (1-12), Quarters (1-4). I would create additional columns
encoding each of these.
To address the customer interaction, have a running total of past purchases. You could call it Purchases_to_date, and would be an
integer (0...n) where n is the number of previous purchases.
I made a notebook to show you how to do running totals.
Humans tend to share purchasing patterns with other humans. You could run a k-means cluster algorithm that splits customers into 3-4
groups based on all the previous info, and then use their
cluster-number as a feature. Sklearn-Kmeans
So based on all that you could engineer 8 different columns. I would then run Principle Component Analysis (PCA) to reduce that to 3-4 features.
You can use Sklearn-PCA to do PCA.
I have a table like this
Row time viewCount
1 00:00:00 31
2 00:00:01 44
3 00:00:02 78
4 00:00:03 71
5 00:00:04 72
6 00:00:05 73
7 00:00:06 64
8 00:00:07 70
I would like to aggregate this into
Row time viewCount
1 00:00:00 31
2 00:15:00 445
3 00:30:00 700
4 00:45:00 500
5 01:00:04 121
6 01:15:00 475
.
.
.
Please help. Thanks in advance
Supposing that you actually have a TIMESTAMP column, you can use an approach like this:
#standardSQL
SELECT
TIMESTAMP_SECONDS(
UNIX_SECONDS(timestamp) -
MOD(UNIX_SECONDS(timestamp), 15 * 60)
) AS time,
SUM(viewCount) AS viewCount
FROM `project.dataset.table`
GROUP BY time;
It relies on conversion to and from Unix seconds in order to compute the 15 minute intervals. Note that it will not produce a row with a zero count for an empty 15 minute interval unlike Mikhail's solution, however (it's not clear if this is important to you).
Below is for BigQuery Standard SQL
Note: you provided simplified example of your data and below follows it - so instead of each 15 minutes aggregation, it uses each 2 sec aggregation. This is for you to be able to easy test / play with it. It is easily can be adjusted to 15 minutes by changing SECOND to MINUTE in 3 places and 2 to 15 in 3 places. Also this example uses TIME data type for time field as it is in your example so it is limited to just 24 hour period - most likely in your real data you have DATETIME or TIMESTAMP. In this case you will also need to replace all TIME_* functions with respective DATETIME_* or TIMESTAMP_* functions
So, finally - the query is:
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIME '00:00:00' time, 31 viewCount UNION ALL
SELECT TIME '00:00:01', 44 UNION ALL
SELECT TIME '00:00:02', 78 UNION ALL
SELECT TIME '00:00:03', 71 UNION ALL
SELECT TIME '00:00:04', 72 UNION ALL
SELECT TIME '00:00:05', 73 UNION ALL
SELECT TIME '00:00:06', 64 UNION ALL
SELECT TIME '00:00:07', 70
),
period AS (
SELECT MIN(time) min_time, MAX(time) max_time, TIME_DIFF(MAX(time), MIN(time), SECOND) diff
FROM `project.dataset.table`
),
checkpoints AS (
SELECT TIME_ADD(min_time, INTERVAL step SECOND) start_time, TIME_ADD(min_time, INTERVAL step + 2 SECOND) end_time
FROM period, UNNEST(GENERATE_ARRAY(0, diff + 2, 2)) step
)
SELECT start_time time, SUM(viewCount) viewCount
FROM checkpoints c
JOIN `project.dataset.table` t
ON t.time >= c.start_time AND t.time < c.end_time
GROUP BY start_time
ORDER BY start_time, time
and result is:
Row time viewCount
1 00:00:00 75
2 00:00:02 149
3 00:00:04 145
4 00:00:06 134
I'm sure this is blatantly simple to many of you, but as a noob I am having an issue wrapping my brain around it.
I have a model called Season where the data looks like this - (all of the field names are all lower case, shown below with caps for readability)
Record Sport SeasonNumber
1 Football 84
2 Baseball 76
3 Basketball 52
4 Hockey 26
5 Football 85
6 Baseball 77
7 Basketball 53
8 Hockey 27
Because of the way the data is added, I know the order of the Sport column is always going to be the same.
I have another model we'll call Coach that looks like this (only the relevant fields that I am going to ask about are shown for brevity.)
Record Sport Season CoachName + otherdata
1 Football 84 Joe Smith
2 Football 84 Bob Jones
3 Football 84 Alex Trebek
4 Football 84 Computer
5 Football 84 Computer
6 Football 84 Computer
7 Baseball 76 Hank Aaron
8 Baseball 76 Computer
9 Football 85 Joe Smith
10 Football 85 Bob Jones
11 Football 85 Computer
12 Football 85 Computer
13 Football 85 Computer
14 Football 85 Sam Spade
... etc.
What I am wanting to do is, for the most recent "combo" of Sport/SeasonNumber [Four sports, only the latest season of each], get a count of the number of CoachNames that are not "Computer".
So first I pull the last 4 records from the Season table
#seasons = Season.last(4)
Now I want to iterate through the #seasons array to get the count of human coaches during the latest season for each sport. If I only had had a single "set" of data (e.g. "Football, 87" , I could do this:
Controller
sp=#seasons.sport
sn=#seasons.seasonnumber
humancount=Coach.human(sp,sn)
Model
scope :people, -> {where.not(coachname: "Computer")}
def self.human(sp,sn)
humans=Coach.people.count.where(["sport = ? and season = ?", sp,sn])
end
But, what I want to get back is an array that is a count of "football humans", "baseball humans", "basketball humans", and "hockey humans". How do I pass the results/elements of #seasons to the method and get the array in return? I'm trying to avoid having to write one query specific for football, one for baseball, one for basketball, and one for hockey.
I think the answer lies with using .map but I'm just not grokking it. Or maybe I just need to loop through each "record" in the #seasons array?
num_human_football_coaches = Coach.where("name != ?", "Computer").where(sport: "Football").count
You should do it like this
Coach.where("name != ?", "Computer").group(:sport).count
This would return you what you want.
I ended up just doing it by "brute force".
sn=#seasons.seasonnumber
footballhumancount=Coach.human("football",sn)
baseballhumancount=Coach.human("baseball",sn)
basketballhumancount=Coach.human("basketball",sn)
I have a massive dataset and am preparing a dashboard based on this dataset.
On my dashboard, I have a drop-down menu that allows me to select a month of my choice, from Jan to Apr.
Visitor Jan Feb Mar Apr
Jenny 2 3 0 1
Peter 2 0 1 3
Charley 0 2 4
Charley 1 2 2 3
Sam 1 4 2 3
Peter 2 2 5 0
John 3 3 6 9
Robin 4 0 7 0
I am looking for a formula that will give me the number of unique visitors who have been active at least once in the month that I choose from the drop-down menu.
Hoping this is really clear, but if not, please feel free to shoot back your questions.
This may be easier with Excel 2013, but if the results you want from your example are 6, 5, 5, and 5 for Jan>April respectively then perhaps:
Create a PivotTable from multiple consolidation ranges (example how here and for VALUES choose Sum of Value.
Count the non-zero values in the PT by column with a formula such as:
=COUNTIF(H5:H10,">"&0)
The above however would not be convenient for repetition each month, though a whole year might be prepared at one time.
I need to do mean in aggregate function by id and year with a condition. It should be simple - BUT couldn't make it.
An example:
ID year result
1 2011 50
1 2012 68
1 2012 45
2 2011 12
2 2011 80
2 2012 20
but I don't understand where to put the condition:
AGGREGATE
/OUTFILE='test'
/BREAK=CUSTOMER_ID CUSTOMERIDCD year
/test_mean_under60=MEAN(result) **IF result > 60**
/N_BREAK=N.
You can't do conditional statements in AGGREGATE. One way to accomplish your end goal though is to use TEMPORARY. and SELECT IF before the aggregate. Example below:
DATA LIST FREE / Id year result.
BEGIN DATA
1 2011 50
1 2012 68
1 2012 45
2 2011 12
2 2011 80
2 2012 20
END DATA.
DATASET DECLARE test.
TEMPORARY.
SELECT IF result > 60.
AGGREGATE OUTFILE='test'
/BREAK = ID year
/test_mean_over60 = MEAN(result)
/N_BREAK=N.
EXECUTE.