Ruby on Rails : Query using array results - ruby-on-rails

I'm sure this is blatantly simple to many of you, but as a noob I am having an issue wrapping my brain around it.
I have a model called Season where the data looks like this - (all of the field names are all lower case, shown below with caps for readability)
Record Sport SeasonNumber
1 Football 84
2 Baseball 76
3 Basketball 52
4 Hockey 26
5 Football 85
6 Baseball 77
7 Basketball 53
8 Hockey 27
Because of the way the data is added, I know the order of the Sport column is always going to be the same.
I have another model we'll call Coach that looks like this (only the relevant fields that I am going to ask about are shown for brevity.)
Record Sport Season CoachName + otherdata
1 Football 84 Joe Smith
2 Football 84 Bob Jones
3 Football 84 Alex Trebek
4 Football 84 Computer
5 Football 84 Computer
6 Football 84 Computer
7 Baseball 76 Hank Aaron
8 Baseball 76 Computer
9 Football 85 Joe Smith
10 Football 85 Bob Jones
11 Football 85 Computer
12 Football 85 Computer
13 Football 85 Computer
14 Football 85 Sam Spade
... etc.
What I am wanting to do is, for the most recent "combo" of Sport/SeasonNumber [Four sports, only the latest season of each], get a count of the number of CoachNames that are not "Computer".
So first I pull the last 4 records from the Season table
#seasons = Season.last(4)
Now I want to iterate through the #seasons array to get the count of human coaches during the latest season for each sport. If I only had had a single "set" of data (e.g. "Football, 87" , I could do this:
Controller
sp=#seasons.sport
sn=#seasons.seasonnumber
humancount=Coach.human(sp,sn)
Model
scope :people, -> {where.not(coachname: "Computer")}
def self.human(sp,sn)
humans=Coach.people.count.where(["sport = ? and season = ?", sp,sn])
end
But, what I want to get back is an array that is a count of "football humans", "baseball humans", "basketball humans", and "hockey humans". How do I pass the results/elements of #seasons to the method and get the array in return? I'm trying to avoid having to write one query specific for football, one for baseball, one for basketball, and one for hockey.
I think the answer lies with using .map but I'm just not grokking it. Or maybe I just need to loop through each "record" in the #seasons array?

num_human_football_coaches = Coach.where("name != ?", "Computer").where(sport: "Football").count

You should do it like this
Coach.where("name != ?", "Computer").group(:sport).count
This would return you what you want.

I ended up just doing it by "brute force".
sn=#seasons.seasonnumber
footballhumancount=Coach.human("football",sn)
baseballhumancount=Coach.human("baseball",sn)
basketballhumancount=Coach.human("basketball",sn)

Related

Create calculated field with condition from another column

I have data that shows student name and the years they attended the school along with the grades they achieved each year.
I need to visualize this in a horizontal bar graph with names on the y-axis and grades on the x-axis
My dataset looks somewhat like this:
name
Year
Grade
John
2016
79
2017
65
Smith
2018
87
2019
56
Mary
2017
92
Jack
2016
95
2017
75
I want a dropdown/parameter based on the year that changes my data in a way that shows, The names and grades of the selected year only.
So if I were to select 2017, I want the data to look like this:
Name
Grade
John
65
Mary
92
Jack
75
I've tried something like this with no luck in the 'create calculated field' dialog box:
If [Parameters].[By Year] == "2017"
THEN
[Name] = WHEN [year] = "2017"
END
your calculated field (for filter) should be like this:
IF [Year] = [Parameters].[By Year]
THEN [Grade]
ELSE NULL
END

calculating attrition rate over time per different attributes in Tableau

I need help with a project I am working on in Tableau. I have a cvs file which I can load in Tableau. I need to calculate the attrition rate of different attributes such as gender, age etc. Someone please help me with that. I have been trying it for hours and I still haven't had any success.
Below is a sample of what the dataset looks like
Employee ID
date hired
termination date
age
gender
length of service
status
job title
12
02/21/2018
04/29/2022
38
F
4
Terminated
auditor
17
08/28/1989
01/01/2023
52
M
32
Active
CEO
41
04/21/2013
10/21/2014
21
M
1
Terminated
Cashier

Development of a feature per row or from today's date

I have a problem. I want to predict when the customer will place another order in how many days if an order comes in.
I have already created my target variable next_purchase_in_days. This specifies in how many days the customer will place an order again. And I would like to predict this.
Since I have too few features, I want to do feature engineering. I would like to specify how many orders the customer has placed in the last 90 days. For example, I have calculated back from today's date how many orders the customer has placed in the last 90 days.
Is it better to say per row how many orders the customer has placed? Please see below for the example.
So does it make more sense to calculate this from today's date and include it as a feature or should it be recalculated for each row?
customerId fromDate next_purchase_in_days
0 1 2021-02-22 24
1 1 2021-03-18 4
2 1 2021-03-22 109
3 1 2021-02-10 12
4 1 2021-09-07 133
8 3 2022-05-17 61
10 3 2021-02-22 133
11 3 2021-02-22 133
Example
# What I have
customerId fromDate next_purchase_in_days purchase_in_last_90_days
0 1 2021-02-22 24 0
1 1 2021-03-18 4 0
2 1 2021-03-22 109 0
3 1 2021-02-10 12 0
4 1 2021-09-07 133 0
8 3 2022-05-17 61 1
10 3 2021-02-22 133 1
11 3 2021-02-22 133 1
# Or does this make more sense?
customerId fromDate next_purchase_in_days purchase_in_last_90_days
0 1 2021-02-22 24 1
1 1 2021-03-18 4 2
2 1 2021-03-22 109 3
3 1 2021-02-10 12 0
4 1 2021-09-07 133 0
8 3 2022-05-17 61 1
10 3 2021-02-22 133 0
11 3 2021-02-22 133 0
You can address this in a number of ways, but something interesting to consider is the interaction between Date & Customer ID.
Dates have meaning to humans beyond just time keeping. They are associated with emotional, and culturally importance. Holidays, weekends, seasons, anniversaries etc. So there is a conditional relationship between the probability of a purchase and Events: P(x|E)
Customer Ids theoretically represent a single person, or at the very least a single business with a limited number of people responsible for purchasing.
Certain people/corporations are just more likely to spend.
So here are a number of ways to address this:
Find a list of holidays relevant to the users. For instance if they are US based find a list of US recognized holidays. Then create a
feature based on each date: Date_Till_Next_Holiday or (DTNH for
short).
Dates also have cyclical aspects that can encode probability. Day of the > year (1-365), Days of the week (1-7), week numbers (1-52),
Months (1-12), Quarters (1-4). I would create additional columns
encoding each of these.
To address the customer interaction, have a running total of past purchases. You could call it Purchases_to_date, and would be an
integer (0...n) where n is the number of previous purchases.
I made a notebook to show you how to do running totals.
Humans tend to share purchasing patterns with other humans. You could run a k-means cluster algorithm that splits customers into 3-4
groups based on all the previous info, and then use their
cluster-number as a feature. Sklearn-Kmeans
So based on all that you could engineer 8 different columns. I would then run Principle Component Analysis (PCA) to reduce that to 3-4 features.
You can use Sklearn-PCA to do PCA.

List unique dates and add line at the beginning of a new month

I have long (multiple thousand lines and growing) list of data in Sheets which have a date and additional columns with data. Here's a simplified example of this list (=TAB1):
Date Number Product-ID
02.09.2021 123 1
02.09.2021 2 1
01.09.2021 15 1
01.09.2021 675 2
01.09.2021 45 2
01.09.2021 52 1
31.08.2021 2 1
31.08.2021 78 1
31.08.2021 44 1
31.08.2021 964 2
30.08.2021 1 2
29.08.2021 ...
...
Three remarks:
The date is formatted to European standard DD.MM.YYYY
There definitely is more than one line per day per product (could be a big number depending on the day)
(for the formulas below) In the European standard Sheets uses ; instead of , as in =IF(A;B;C)
In a different tab (=TAB2), I want to add up all the numbers for a unique date for Product-ID 1. So far I've done it like this:
Date Sum (if Product-ID=1)
=UNIQUE('TAB1'!A2:A) =ARRAYFORMULA(SUMIF('TAB1'!A:A&'TAB1'!C:C;A2:A&"1";'TAB1'!B:B))
02.09.2021 125
01.09.2021 67
31.08.2021 124
30.08.2021 1
29.08.2021 ...
...
This works fine so far. Here's what I want to do now:
For every month (here: August and September 2021) I need an additional line above the current date (in this case: above 02.09.2021) AND above a completed month to sum over the whole month for column B. Here's how it should look like:
Date Sum (if Product-ID=1)
September 2021 192
02.09.2021 125
01.09.2021 67
August 2021 125
31.08.2021 124
30.08.2021 1
29.08.2021 ...
Of course, the line for the next day (03.09.2021) should be added above 02.09.2021 and below the sum for the month when it's automatically added to TAB1 on the next day.
I tried to play around with s.th. like =IF(DAY(UNIQUE('TAB1'!A2:A))=1;...;...) but didn't get far.
Is there anyone with an idea how to realize s.th. like this?
You want to learn about QUERY().
in cell A1 of an empty tab.
=QUERY('TAB1'!A2:C,"select A,SUM(B) where C = 1 group by A")
it makes a very big difference whether your product ids are text or numbers. the above was written as if they are numbers, but you might have just been simplifying. If they are text you would write it like this:
=QUERY('TAB1'!A2:C,"select A,SUM(B) where C = '1XYZ' group by A")
note the single quotes.
if the IDs are a MIX of text and letters then you need to force them all to text values in the original data by highlighting the IDs column and choosing Format>Number>Plain Text from the menu bar.
UPDATE:
I understand the requirements better now for intermixing a cumulative month total into the output. This may work.
=ARRAYFORMULA({QUERY({EOMONTH('TAB1'!A2:A,0),'TAB1'!B2:C},"select 'Total',Col1,SUM(Col2) where Col3 = 1 group by 'Total',Col1 label 'Total''',SUM(Col2)''",0);QUERY('TAB1'!A2:C,"select '',A,SUM(B) where C = 1 group by '',A label '''',SUM(B)''",0)},"order by Col2,Col1",0))

Calculate Average For Specific Columns

I have the following columns:
Brand Reviews Rating
Test 25 4.5
Test 26 3.5
Test 41 4.3
Test2 20 2.1
Test2 15 4
Test3 29 5
Test3 22 3.3
I would like to calculate the total amount of reviews just for the brand TEST. How can I proceed?
try:
=AVERAGEIF(A:A,"Test", B:B)
=SUMIF(A:A,"Test",B:B) thanks to #ScottCraner

Resources