Find frequency of words in a column in Google Sheets and lookup another value from a different column using formulae - google-sheets

I have 2 columns of data in a Google Sheet. Column1 is unique words or sentences (words are repeated in sentences) and the Column2 is a numeric value next to each (say votes). I am trying to get a list of unique words from Column1 and then the sum of values (votes) from Column2 when the word was present either on its own or in a sentence.
Here is a sample of the data I am working with in Google Sheets:
Term Votes
apple 20
apple eat 100
orange 30
orange rules 40
rule why 50
This is what the end result looks like:
Word Votes
apple 120
eat 100
orange 70
rules 40
rule 50
why 50
The way I am doing it now is quite long and I am not sure if this is the best solution.
Here's my solution:
JOIN values in Column1 using a delimiter " " and then SPLIT them using the same delimiter and then TRANSPOSE them into a column all in one step. This way I have a list of all the words used in Column1 in say Column3.
In Column4 pull out all the UNIQUE values and then do a COUNTIF for the unique values from Column3. This way I am able to get the frequency of each unique word by referencing to the lsit of all words.
In order to get the sum of Votes I have to TRANSPOSE Column4 and then QUERY Column1 and Column2 by using dynamic text in the formula. The formula looks like =QUERY(Column1:Column2, "SELECT SUM(Column2) WHERE Column1 CONTAINS '" & referenceToUniqueWord & "'", 1). The reason I have to transpose first is because the query formula outputs 2 cells of data ie Text: sumColumn1 and Number: 'sum of votes'. Since for one cell of unique word I get two cells of data I am not able to drag the formula down and hence I have to do it horizontally.
I finally get three rows of data after the last step:
One row is just transposed Column4 (all the unique words). Second row is just the text sumColumn2 from using the QUERY formula. And third row is the actual sum of votes, resulting from individual QUERY formulae. I then transpose these rows to columns and to get my final table I VLOOKUP the frequency values arrived at earlier.
This approach is lengthy and prone to errors. Also doesn't work if the list is large and in the initial JOIN I get an error of limit 50,000 reached. Any ideas to make it better are welcome. I know this can be done much easier using Scripts but I'd prefer to have it done using only formulae.

try:
=ARRAYFORMULA(QUERY(SPLIT(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(
IF(IFERROR(SPLIT(A:A, " "))="",,"♠"&SPLIT(A:A, " ")&"♦"&B:B)
,,999^99)),,999^99), "♠")), "♦"),
"select Col1,sum(Col2)
group by Col1
order by sum(Col2) desc
label sum(Col2)''"))

Related

Combine a SPLIT formula with a formula that chooses N unique words from the SPLIT outcome

I got a sentence which I SPLIT into words without the punctuation. Next I want to choose three random, but unique words from that split. I use the formula as seen in cell I2. Is it possible to combine both the SPLIT formula and the other formula into one (big) formula?
SPLIT formula:
=ARRAYFORMULA(REGEXREPLACE(SPLIT(A2," "),"[,.?!]",""))
Formula to choose three random unique words:
=ARRAYFORMULA(ARRAY_CONSTRAIN(SPLIT(FLATTEN(QUERY(QUERY(QUERY(SPLIT(FLATTEN(
ROW(B2:G2)&"×"&RANDARRAY(ROWS(B2:G2), COLUMNS(B2:G2))&"×"&B2:G2), "×"),
"select max(Col3) group by Col2 pivot Col1"),
"offset 1", 0),,9^9)), " "), 9^9, 3))
I understand that you want to get 3 random unique words from a string.
in what follows i am going to demonstrate how get truly random words when the sheet is modified plus handling exceptions, ponctuation and more, like this take a look at this sheet.
Solution:
Notes:
This solution handels punctuation notice the highlighted characters with yellow.
To get N unique random words just replace [n] of SORTN Function with a cell refrence.
Paste this formula in B2.
=ArrayFormula(IF(A2="",,JOIN(" ,",TRANSPOSE(QUERY(SORTN({RANDARRAY(COUNTA(UNIQUE(SPLIT(TRIM(REGEXREPLACE(A2,"[[:punct:]]",""))," ")))),TRANSPOSE(UNIQUE(SPLIT(TRIM(REGEXREPLACE(A2,"[[:punct:]]",""))," ")))},3,,1,RANDBETWEEN(0,1))," Select Col2 ")))))
Explanation: Pending...
1 - We need UNIQUE(SPLIT(TRIM(REGEXREPLACE(A2,"[[:punct:]]",""))," ")) to rplace punctuation with nothing "" and TRIM spaces in start, tailing and additional spaces, SPLIT the string with " " as a delimiter, and then get the UNIQUE columns resulted from SPLIT, which is He|is|cunning|as|a|fox and TRANSPOSE the output like this TRANSPOSE(UNIQUE([Output])
to join it with random numbers column later.
2 - we need an Array {} that contain He|is|cunning|as|a|fox and column with random numbers , like this { RANDARRAY , He|is|cunning|as|a|fox }.
To get the column with random numbers: RANDARRAY(COUNTA(UNIQUE(SPLIT(TRIM(REGEXREPLACE(A2,"[[:punct:]]",""))," "))))
RANDARRAY takes [columns] set to 1 and [rows] set to COUNTA(UNIQUE(SPLIT(TRIM(REGEXREPLACE(A2,"[[:punct:]]",""))," "))) which is the COUNTA( He|is|cunning|as|a|fox )
3 - Now we have to SORTN the output with [n] set to 3 meaning 3 words in this case
"to get N unique random words" just replace [n] with a cell refrence.
[sort_column] set to 1 the column of random number and [is_ascending] set to RANDBETWEEN(0,1) to get either 0 or 1, [is_ascending] 0 means Flase it sort's Descending , 1 means True sort ascending.
4 - QUERY " Select Col2 ", the randomized column of words.
5 - TRANSPOSE the column.
6 - JOIN with " ,"
After researching for a while I came across the use of array_constrain to pick a fixed number of results and sort with randarray to randomize the outcome.
=ARRAY_CONSTRAIN(
transpose(SORT(transpose(ARRAYFORMULA(REGEXREPLACE(SPLIT(A2," "),"[,.?!]",""))),
randarray(COUNTA(ARRAYFORMULA(REGEXREPLACE(SPLIT(A2," "),"[,.?!]","")))),true))
,1,3)
If anyone happens to have a better solution to this, I would gladly see a response.

Google Sheets Combine a column with duplicates and update total sum in another colum

This might be something fairly simple but struggling to find a way to do it.
In Column B, I have a list of foods required.
In Column C, I have the amount needed.
In Column D, I have g (for grams) ml (for mills) etc.
I would like to combine the duplicates in Column B and update the totals from Column C, with the g or ml in Column D beside it.
The list I have has been created by using an array formula based on dropdowns in another sheet.
I have seen people using UNIQUE formula in 1 column (this works) and then a SUMIF formula in another column and then a JOIN formula in another... I tried this but the SUMIF is always returning 0.
Would someone please be able to advise on how I can do this?
TIA :D
It's hard to be sure exactly what you need without seeing the data. But based on my understanding of solely what you've posted, this QUERY formula should generate a condensed mini-report:
=QUERY({B2:D},"Select Col1, SUM(Col2), Col3 WHERE Col1 Is Not Null GROUP BY Col1, Col3 LABEL SUM(Col2) ''")
In plain English, this means "Arrange the data from the range B2:D in the same order as the raw data, but sum the second column's data according to matches in both the first and third columns. Only return results for the raw data where the first column is not blank. Replace the default 'sum' header on the second column with nothing; I don't need it."
This formula assumes that every ingredient will always be attached to the same measurement (e.g., 'salt' in Col B is always paired with 'mg' in Col D, etc.). If this is not the case, you will wind up with ingredients being listed as many times as there are different measures in Col D.

Excel - sum given a condition in a relative column

Imagine I have this (time)sheet:
Hours | Text
------+----------------------
3 | fixing PRA-345
4.5 | refactoring PRA-222
5 | PRA-345 and stuff
And I want to calculate how much cumulative time one has spent on a ticket with a given number.
In other words sum the hours based on the text in a neighbouring cell.
Can you do it without extra column? what I did was to make an extra column that returned either the number, if given text was present (via REGEXMATCH) or 0. And then I ran a SUM on that column. Having this solved without extra column would be nice ;)
Expected output
In my case if would be enough for a given string to find the total sum of hours. So if I cell(say it's D1) has the hardwired text, such as "PRA-345" I want the cell to the left(E1) to display the total hours(8 in this case)
Is this what you need?
=sum(filter(B5:B,regexmatch(C5:C,E5)))
Reference:
FILTER
SUM
Instead you can try
=QUERY({A1:A11,ArrayFormula(REGEXEXTRACT(B1:B11,"PRA-\d+"))},
"select Col2, sum(Col1) where Col1 is not null
group by Col2 label Col2 'Tickets', sum(Col1) 'Sums' ",1)
Functions used:
QUERY
ArrayFormula
REGEXEXTRACT

How to Sort a Query by a Column Containing Numbers as Text

I have the following problem where I'm querying from one tab to another, and then trying to sort one of the columns by rating (AA), and days past (X, which is a negative number since it represents the amount of days past a deadline). The querying looks as follows:
=QUERY('(Name of Tab1)'!K7:AA,"SELECT K, N , X, Z, AA WHERE X != 'Closed' ORDER BY X ASC")
The issue is that I'm getting sorts for Column X that look like this:
-279.00
-3.00
-10.00
-106.00
-11.00
-12.00
-12.00
-13.00
-14.00
-144.00
-149.00
Clearly, this isn't the sort I want and it's pretty evident that it's reading it as a string and not an int. However, whenever I try to use SQL functions like cast as int, it doesn't work.
How can I convert these values into an int so then it sorts everything properly? Thanks in advance.
How to sort a query by a column containing numbers as text
The issue you face has nothing to do with the "numbers" being negative.
It is because the column/cells containing the numbers are formatted as text.
Text cells cannot be sorted.
Please use the following formula:
=QUERY({'Name of Tab1'!K7:W44,ARRAYFORMULA('Name of Tab1'!X7:X44*1),'Name of Tab1'!Y7:AA44},
"SELECT Col1, Col4, Col14 , Col16, Col17 WHERE Col14 is not null and Col1<>'' ORDER BY Col14")
How the formula works:
We split our range into 3 parts
The part before our "numbers" column 'Name of Tab1'!K7:W44
Our "numbers" column ARRAYFORMULA('Name of Tab1'!X7:X44*1)
The last part 'Name of Tab1'!Y7:AA44
Because we now have our 3 ranges in curly brackets {} we cannot use column letters. Instead, we must use Col1, Col4 etc, where Col1 is the 1st column in our combined range, Col4 is our previous X column and so on.
About our "numbers" column ARRAYFORMULA('Name of Tab1'!X7:X44*1).
An Arrayformula multiplied by 1 turns every text cell containing numbers to be formatted as number and the ones with text (in our case Closed) result to #VALUE! which get skipped using WHERE Col14 is not null (instead of our original WHERE X != 'Closed')
Functions used:
QUERY
ArrayFormula

Sum 5 largest numbers in each row, dynamically

I have a league table with Column A displaying a list championship entrants.
In the corresponding row are the entrants various race results (points scores). i.e. ColC shows Race 1, ColD Race 2 etc.
I want to sum total, per row (entrant), the 5 largest scores (in Col B)
The following formula works fine entered line by line,
=ArrayFormula(SUM(IFERROR(LARGE($H5:$AE5,{1,2,3,4,5,6}),0)))
However, I want it to be a dynamic array formula that self populates, should new entrants be added. Something like (though this doesn't work):
=arrayformula(If(A2:A<>"",ArrayFormula(SUM(IFERROR(LARGE($H5:$AE5,{1,2,3,4,5,6}),0))),""))
I've been trying to use MMULT, and a few other haphazard ideas, unsuccessfully.
Test sheet can be used here;
https://docs.google.com/spreadsheets/d/18tmKdwAcXoDQrQxSDSnzgK6A5Erj22oSXcxwUt_lq4o/edit?usp=sharing
This should work even with hundreds or thousands of rows. You can find it on the new tab called mk.help
=Arrayformula({"TEST";if(A3:A="",,VLOOKUP(A3:A,query({query(vlookup(SEQUENCE(COUNTA(A3:A)*10,1,0)/10+3,{row(A3:A),A3:A,D3:M},mod(SEQUENCE(COUNTA(A3:A)10,1,0),10){0,1}+{2,3}),"order by Col1,Col2 desc"),Mod(SEQUENCE(COUNTA(A3:A)*10,1,0),10)},"select Col1,Sum(Col2) where Col3<5 group by Col1"),2,0))})
In B3 put this formula:
=arrayformula(query({transpose(split(textjoin(",",false,{left("",row(A3:A5))} & join(",",column(D3:M3)-column(D3))),",",true,false)),sort(split(transpose(split(textjoin("*",false,{row(B3:B5) & "^" & D3:M5}),"*",true,false)),"^",true,false),1,true,2,false)},"Select sum(Col3) where Col1<=4 group by Col2 label sum(Col3) ''"))
but you must modify this for more than row number 5

Resources