How to Calculate Word Frequency by Date in Dataframe - twitter

So I have a Dataframe of thousands of tweets that I was able to clean and tokenize.
-Each row is an individual Tweet
-Column 1 is the day they were tweeted (example: 7_23_2020)
-Column 2 is the individual tokenized bag of words of each Tweet
Goal: How can I tabulate the word frequencies for each day and then graph them?
For example: What are the frequencies of top 10 words from (7_23_2020 - 7_30_2020)?
Showing how frequently the word was used each day?
Originally I needed to calculate the PMI (Pointwise Mutual Information) between the dates and most common words but I think i am just settling for word frequencies across time now.

Related

Complex Google Sheets Puzzle - Help appreciated - Link to sheet included

I have a finance sheet that tracks the following in different columns:
(A) Amount Already Built Into Budget for the Year [Purple]
(B) Amount Spent Year-to-Date [Red]
(C,E,G,I) Q1-Q4 Reimbursement amounts [Green]
(D,F,H,J) Q1-Q4 Hidden columns to be used to help create this function and to tick on and off based on reimbursement amount input [Gold]
(K) Reimbursement Remaining [Blue]
The amount already built into the budget needs to be divided by 4 and to show up on each quarter as reimbursed. That amount will be entered into each quarter by default with a code that divides Column A by 4. The user will replace that value each quarter by adding the column K value to the value for that quarter.
Each quarter, the user should be able to add the value in column K to the appropriate quarter and end up with zero balance in column K.
The Amount Spent Column will update monthly and include:
Expenses built into the budget to-date
Additional expenses to-date
The goal of this sheet is to allow someone to input how much was actually reimbursed in the Q1-Q4 Reimbursement amounts [Green] and to provide a tool for that person to know how much needs to be reimbursed at any given time in the Reimbursement Remaining [Blue] column.
Column K needs to still be able to function if all expenses appear in the actuals Q4--meaning, Column K will need to be zero for Q1-Q3 and only show a balance if the sum of the actuals recorded exceeds what was built into the budget.
Wow that was hard to write out.
What is a formula that could go in Column K to make this work?
I hope this makes sense to someone!
-Alfred
try:
=if(C2+E2+G2+I2=A2+B2,0,
if(AND(D2+F2+H2+J2=0,D2=0),B2-C2,
if(AND(D2+F2+H2+J2=1,D2=1),B2-C2,
if(AND(D2+F2+H2+J2=1,F2=1),B2-C2-E2,
if(AND(D2+F2+H2+J2=1,H2=1),B2-C2-E2-G2,
if(AND(D2+F2+H2+J2=1,J2=1),B2-C2-E2-G2-I2, 0)
)))))

Count string occurrence IF 2 conditions are met

I have this data set (example)
It shows different responses for different towns and what industries the respondent thinks has job openings (e.g Akron has 2 rows (2 respondents), with different industries given)
Now I want to chart/gather from that data set, the count of each industry for each town: So using this table setup, I want a formula that shows, for example, how many times "restaurant" was listed for Akron...
So here's what I want a formula to give me, in cell B63, ("Akron" row, "restaurant" column) for ex. This is paraphrasing in plain English what I want it to do. I have tried countless variations of COUNTIFS, IFS, MATCH, SUM, INDIRECT, LOOKUPS... etc and have not been able to get the numbers that reflect the data given.
=count if ("restaurant" appears in the range B53:D60 AND if those occurrences are in rows starting with "Akron" (in the range A53:A60))
The main hang-up, obviously is that these 2 different criteria encompass 2 different sized ranges (not something countifs like...) So how can I get around that barrier?
A final note: the imgs/ex given are small representations of a much larger range that I'm actually working on... so yes, I could make a nifty formula for each town that has just the town rows as my range...(COUNTIF likes this approach!) but I've got many more towns of various row counts... it takes too long to make a town specific range or diff formula for each town... I prefer 1 formula that looks through all the town rows/range and all the "Industries" range...
=COUNTIF(FILTER(DATA_RANGE,TOWN_NAME="TARGET_TOWN_NAME"),"=TARGET_INDUSTRY")
Example use
=COUNTIF(FILTER(A2:C,A2:A="A"),"=1")
Counts the responses of 1 under town A.
Example Screengrab

Machine learning model with varying input shape as time changes

I am trying to predict the bookings of a stand-up comedian cafe. There are a lot of features I can use which have an affect on the number of sales. (e.g. day of the year, weather, average sales last month, day of the week, average sales on the specific day of the week etc.)
However, one of the features that most correlates with the actual number of sales is the number of tickets already sold before the deadline. The customers are able to start making reservations 120hours (5 days) before the actual deadline of ordering (11:00 AM on the same day of the show).
I would prefer to use this data as input for my machine learning algorithm. Currently I created 120 columns in the dataframe. The columns define 120 hours before deadline untill the deadline itself. Column "hour_98" therefore shows the accumulated sales 4 days before the deadline. Column "hour_24" shows the accumulated sales 24 hours before deadline etc.
If I now would like to predict the sales 24 hours before deadline the columns "hour_24" until "hour_0" are all given "NaN" values. Since algorithms can't deal with NaN values I currently give these columns a value of 0. However, I tihnk this is too simplistic and will result in bad prediction model.
How do we deal with a changing input shape since we obtain more data if we get closer to the deadline of ordering?
Now from what I understand, you have a fixed number of columns, each representing the data from a predefined hour before the deadline. So in a sense the input data shape never changes, only the validity of some input features changes.
Provided you have a fixed input shape, with changing validity of the features (NaNs),
you can get around that issue by using a mask for each input feature.
For example a valid hour_24 can be represented as hour_24 = 20 and mask_24 = 1, and an invalid hour_24 can be represented as hour_24 = 0 (or whatever) and
mask_24 = 0.
The algorithm itself will need to learn where to ignore a given feature in respect to the related feature's mask.
This answer explains in more detail how to mask input.

Google Sheets: Dense Ranking from sorted values

I have a simple table with 3 columns:
[Name] [Score] [Rank]
For the 3rd column, I'm using the following formula to rank each row according to the score:
=RANK(C9,$C$9:$C$28,0)
The problem is that the formula isn't returning the values I'd expect. For example on the last row it returns 19 when it should be 5.
I found other formulas for ranking (RANK.EQ, etc.) but same issue happens.
Here is the Google Sheet to see it in context:
https://docs.google.com/spreadsheets/d/1P1m7UHPPIcQLQkzpnk-SI1y7-0mhKytCWDjA6FJzFrM/edit?usp=sharing
Any guidance appreciated
The results you want can be achieved with a simple MATCH formula:
=match(round(C9,0),NamedRange1,0)
Provided an array (named NamedRange1 for above) is created, say with:
=sort(unique(round(C9:C28,0)),1,0)
I think the result is as intended. Check this Ranking Wikipedia page (called 'standard competition ranking'). It says:
Standard competition ranking ("1224" ranking)
In competition ranking, items that compare equal receive the same
ranking number, and then a gap is left in the ranking numbers. The
number of ranking numbers that are left out in this gap is one less
than the number of items that compared equal. Equivalently, each
item's ranking number is 1 plus the number of items ranked above it.
This ranking strategy is frequently adopted for competitions, as it
means that if two (or more) competitors tie for a position in the
ranking, the position of all those ranked below them is unaffected
(i.e., a competitor only comes second if exactly one person scores
better than them, third if exactly two people score better than them,
fourth if exactly three people score better than them, etc.).
Thus if A ranks ahead of B and C (which compare equal) which are both
ranked ahead of D, then A gets ranking number 1 ("first"), B gets
ranking number 2 ("joint second"), C also gets ranking number 2
("joint second") and D gets ranking number 4 ("fourth").
What you want is 'dense ranking' and it can be achieved by pnuts's answer or something like this:
set G9 to 1
set G10 to =if(round(C10,0)<round(C9,0), G9+1, G9)
copy G10 and paste it into G11:G28
Sample sheet is here.
Thanks to #pnuts and #sangboklee for your solutions. I think I have a good solution now. It is pnuts's solution, just simplified:
=match(round($C9,0),sort(unique(round($C$9:$C$28,0)),1,false),0)
This essentially "embeds" the created array within a single formula, that can be applied to all rows. And as a bonus, the values don't even have to be sorted.
Please check for correctness folks, but I think this works. I've updated the linked Google Sheet from the original question description (it's "Solution 2b").

Average wins per player

I'm not really good with Java, even less with Sheets and i need help for this :
I want to create a list of average win of players using a list with several other players :
Example (I want to get the average on the right):
Conceptually this would be "for each player, see if the player match and if he won (ratio 1:1) then continue until there is no more game (or the end of the array)".
It's for a team game and we use Google Sheets a lot for it; I wanted some stats too.
JavaScript != Java.
Additionally, there's no JavaScript involved here if you're just using Sheets.
=AVERAGE(COUNTIF(A2:A7, "Win")/COUNTA(A2:A7))
Steps for understanding:
COUNTIF all cells in a range containing the text "Win".
COUNTA all cells in the same range, regardless of what they contain.
Calculate the AVERAGE of those two values using the built-in function.
A2:A7 is just an example and should be replaced with whatever range your RESULT column takes up.

Resources