How to list most frequent text values within a range? - google-sheets

I'm an intermediate excel user trying to solve an issue that feels a little over my head. Basically, I'm working with a spreadsheet which contains a number of orders associated with customer account #s and which have up to 5 metadata "tags" associated with them. I want to be use that customer account # to pull the 5 most commonly occurring metadata tags in order.
Here is a mock up of the first set of data
Account Number Order Number Metadata
5043 1 A B C D
4350 2 B D
4350 3 B C
5043 4 A D
5043 5 C D
1204 6 A B
5043 7 A D
1204 8 D B
4350 9 B D
5043 10 A C D
and the end result I'm trying to create
Account Number Most Common Tag 2nd 3rd 4th 5th
5043 A C B N/A
4350 B D C N/A N/A
1204 B A C N/A N/A
I was trying to work with the formula suggested here:
=ARRAYFORMULA(INDEX(A1:A7,MATCH(MAX(COUNTIF(A1:A7,A1:A7)),COUNTIF(A1:A7,A1:A7),0)))
But I don't know how to a) use the customer account # as a precondition for counting the text values within the range. b) how to circumvent the fact that the Match forumula only wants to work with a single column of data and c) how to read the 2nd, 3rd, 4th, and 5th most common values from this range.
The way I'm formatting this data isn't set in stone. I suspect the way I'm organizing this information is holding me back from simpler solutions, so any suggestions on re-thinking my organization would be just as helpful as insights on how to create a formula to do this.

Implementing this kind of frequency analysis using built-in functions is likely to be a frustrating exercise. Since you are working with Google Sheets, take advantage of the custom functions, written in JavaScript and placed into a script bound to the sheet (Tools > Script Editor).
The function I wrote for this purpose is below. Entering something like =tagfrequency(A2:G100) in the sheet will produce desired output:
+----------------+-----------------+-----+-----+-----+-----+
| Account Number | Most Common Tag | 2nd | 3rd | 4th | 5th |
| 5043 | D | A | C | B | N/A |
| 4350 | B | D | C | N/A | N/A |
| 1204 | B | A | D | N/A | N/A |
+----------------+-----------------+-----+-----+-----+-----+
Custom function
function tagFrequency(arr) {
var dict = {}; // the object in which to store tag counts
for (var i = 0; i < arr.length; i++) {
var acct = arr[i][0];
if (acct == '') {
continue; // ignore empty rows
}
if (!dict[acct]) {
dict[acct] = {}; // new account number
}
for (var j = 2; j < arr[i].length; j++) {
var tag = arr[i][j];
if (tag) {
if (!dict[acct][tag]) {
dict[acct][tag] = 0; // new tag
}
dict[acct][tag]++; // increment tag count
}
}
}
// end of recording, begin sorting and output
var output = [['Account Number', 'Most Common Tag', '2nd', '3rd', '4th', '5th']];
for (acct in dict) {
var tags = dict[acct];
var row = [acct].concat(Object.keys(tags).sort(function (a,b) {
return (tags[a] < tags[b] ? 1 : (tags[a] > tags[b] ? -1 : (a > b ? 1 : -1)));
})); // sorting by tag count, then tag name
while (row.length < 6) {
row.push('N/A'); // add N/A if needed
}
output.push(row); // add row to output
}
return output;
}

You also could get this report:
Account Number Tag count
1204 B 2
1204 A 1
1204 D 1
4350 B 3
4350 D 2
4350 C 1
5043 D 5
5043 A 4
5043 C 3
5043 B 1
with the formula:
=QUERY(
{TRANSPOSE(SPLIT(JOIN("",ArrayFormula(REPT(FILTER(A2:A,A2:A<>"")&",",5))),",")),
TRANSPOSE(SPLIT(ArrayFormula(CONCATENATE(FILTER(C2:G,A2:A<>"")&" ,")),",")),
TRANSPOSE(SPLIT(rept("1,",counta(A2:A)*5),","))
},
"select Col1, Col2, Count(Col3) where Col2 <>' ' group by Col1, Col2
order by Col1, Count(Col3) desc label Col1 'Account Number', Col2 'Tag'")
The formula will count the number of occurrences of any tag.

Related

SUMPRODUCT but only on nonblank cells

Given:
Lp | COL1 | COL 2 | COL 3
ROW 1 | X | | X
ROW 2 | | X | X
ROW 3 | X | X |
ROW 4 | | |
ROW 5 | 1 | 1.5 | 2
ROW 6 | 2 | 1 | 3
I would like to use SUMPRODUCT of Row 1 with Row 5 (and then Row 6) but only in the places where row has X (or rather where it is non empty).
Expected result for Row 1: 1 * 2 + 2 * 3 = 8 (because first and last column is not empty)
Expected result for Row 2: 1.5 * 1 + 2 * 3 = 7.5 (second and last col not empty)
Expected result for Row 3: 1 * 2 + 1.5 * 1 = 3.5 (first and second non empty)
Expected result for Row 4: 0
I appreciate your help.
Use:
=SUMPRODUCT(($B$6:$D$6)*($B$7:$D$7)*(B2:D2<>""))
You can achieve the same thing without SUMPRODUCT.
Create another three columns COL1',2',3', replace
every X with the corresponding product using IF condition.
For example at COL1',ROW1 you write a formula such as =IF(A1="X", A$5\*A$6, 0)
(here A1 is COL1,ROW1)
and drag it to fill COL1',2',3'.
Then you do SUM over COL1',2',3'.

gSheets: How to use SPLIT in ARRAYFORMULA over columns

For numbers x and y, I have cell data formatted as x#y.
An example row:
| A | B | C | D |
| ------ | ------ | ----- | ------ |
|10#100 | 10#120 | 8#150 | 5#175 |
I want to parse this type of row into two quantities: the sum of the x's and sum of y's.
With my example, I should have two cells:
33 and 545
Basically, I want to SUM the resulting array of SPLIT applied to each cell in A1:D1.
My attempt
=SUM(ARRAYFORMULA(SPLIT(A1:D1, "#")))
Unfortunately, this approach doesn't allow me to specify whether I want x or y (when I call SPLIT) and it seems to be returning x + y, rather than sum(i=1 to 4) x_i.
Try this:
=index(query(arrayformula(split(transpose(A1:D1), "#")),"select sum(Col1),sum(Col2) ",0),2)
Another option:
=ArrayFormula({SUM(INDEX(SPLIT(TRANSPOSE(A1:D1),"#"),0,1)),SUM(INDEX(SPLIT(TRANSPOSE(A1:D1),"#"),0,2))})
use:
=SUMPRODUCT(SPLIT(JOIN("#",A1:D1),"#"),ISEVEN(SEQUENCE(1,COUNTA(A1:D1)*2)-1))
F3= (replace ISEVEN -> ISODD)
use:
=ARRAYFORMULA(QUERY(QUERY(SPLIT(TRANSPOSE(A1:D1); "#");
"sum(Col1),sum(Col2)"); "offset 1"; 0))

How to grab a value corresponding to a particular date, if the date is before/after the dates in the table?

I have a Google Sheet table with a number of inventory additions:
Date | Product | New Units | # Total Units
-----------|---------|-----------|---------------
1/11/2017 | Coke | 14 | 14
1/31/2017 | Pepsi | 6 | 6
2/12/2017 | Coke | 3 | 17
3/13/2017 | Coke | 12 | 29
3/13/2017 | Pepsi | 13 | 19
e.g., on Feb 12th 2017, I received 3 new units of Coke, for a total of 17 units. I'd like to be able to say for any given product and any given date, how many units of that product did I have on that date?
For example, given the following list of dates in a separate sheet, based on the data above, I'd hope to see this output:
Date | Coke | Pepsi
-----------|------|-------
1/10/2017 | 0 | 0
1/11/2017 | 14 | 0
2/10/2017 | 14 | 6
2/15/2017 | 17 | 6
3/15/2017 | 29 | 19
Is there a formula or formulas I could use to calculate values for B2:B6 and C2:C6?
paste in G3 (skip the 1st avail row to avoid #REF!) then drag down, right and up
=ARRAYFORMULA(IF($F3<MIN($A$2:$A), 0, IFERROR(IFERROR(
QUERY($A$2:$D,
"select D where A >= date '"&TEXT($F2, "yyyy-mm-dd")&"'
and A <= date '"&TEXT($F3, "yyyy-mm-dd")&"'
and B = '"&G$1&"' ", 0),
QUERY($A$2:$D,
"select D where A >= date '"&TEXT($F1, "yyyy-mm-dd")&"'
and A <= date '"&TEXT($F3, "yyyy-mm-dd")&"'
and B = '"&G$1&"' ", 0)), 0)))
paste in G3 (skip the 1st avail row to avoid #REF!) then drag down, right and up
=ARRAYFORMULA(IF($F2<MIN($A$2:$A), 0, IFERROR(IFERROR(
QUERY(TO_TEXT({VALUE($A$2:$A), $B$2:$D}),
"select Col4 where Col1 >= '"&VALUE($F1)&"'
and Col1 <= '"&VALUE($F2)&"'
and Col2 = '"&G$1&"' ", 0),
QUERY(TO_TEXT({VALUE($A$2:$A), $B$2:$D}),
"select Col4 where Col1 >= '"&VALUE(#REF!)&"'
and Col1 <= '"&VALUE($F2)&"'
and Col2 = '"&G$1&"' ", 0)), 0)))

number of connected nodes to specific nodes in a path

I have a cypher query (below).
It works but I was wondering if there's a more elegant way to write this.
Based on a given starting node, the query tries to:
Find the following pattern/motif: (inputko)-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko).
Foreach the motifs/patterns found, find connected nodes with labels contigs, for the following nodes in the pattern: [inputko, ko2, ko3].
A summary of the 3 nodes and their connected contigs, ie. the name property .ko of the 3 nodes and the number of connected :contig nodes in each of the (inputko)-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko) motifs that were found.
+--------------------------------------------------------------------------+
| KO1 | KO1count | KO2 | KO2count | KO3 | KO3count |
+--------------------------------------------------------------------------+
| "ko:K00001" | 102 | "ko:K14029" | 512 | "ko:K03736" | 15 |
| "ko:K00001" | 102 | "ko:K00128" | 792 | "ko:K12972" | 7 |
| "ko:K00001" | 102 | "ko:K00128" | 396 | "ko:K01624" | 265 |
| "ko:K00001" | 102 | "ko:K03735" | 448 | "ko:K00138" | 33 |
| "ko:K00001" | 102 | "ko:K14029" | 512 | "ko:K15228" | 24 |
+--------------------------------------------------------------------------+
I'm puzzled for the syntax to operate on each match.
From the documentation the foreach clause doesn't seem to be what I need.
Any ideas guys?
The FOREACH clause is used to update data within a collection, whether
components of a path, or result of aggregation.
Collections and paths are key concepts in Cypher. To use them for
updating data, you can use the FOREACH construct. It allows you to do
updating commands on elements in a collection — a path, or a
collection created by aggregation.
START
inputko=node:koid('ko:\"ko:K00001\"')
MATCH
(inputko)--(c1:contigs)
WITH
count(c1) as KO1count, inputko
MATCH
(inputko)-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko)
WITH
inputko.ko as KO1,
KO1count,
ko2,
ko3
MATCH
(ko2)--(c2:contigs)
WITH
KO1,
KO1count,
ko2.ko as KO2,
count(c2) as KO2count,
ko3
MATCH
(ko3)--(c3:contigs)
RETURN
KO1,
KO1count,
KO2,
KO2count,
ko3.ko AS KO3,
count(c3) AS KO3count
LIMIT
5;
realised that i have to place distinct for in count(distinct cX) to get a accurate count. Do not know why.
I am not sure how elegant this is but I think it does give you some notion about how you could extend your query for n ko nodes in a path and still return the data as you have laid it out below. It should also demonstrate the power of combining the with directive and collections.
// match the ko/cpd node paths starting with K00001
match p=(ko1:ko {name:'K00001' } )-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko)
// remove the cpd nodes from each path and name the collection row
with collect([n in nodes(p) where labels(n)[0] = 'ko' | n]) as row
// create a range for the number of rows and number of ko nodes per row
with row
, range(0, length(row)-1, 1) as idx
, range(0, 2, 1) as idx2
// iterate over each row and node in the order it was collected
unwind idx as i
unwind idx2 as j
with i, j, row[i][j] as ko_n
// find all of the contigs nodes atttached to each ko node
match ko_n--(:contigs)
// group the ko node data together in a collection preserving the order and the count
with i, [j, ko_n.name, count(*)] as ko_set
order by i, ko_set[0]
// re-collect the ko node sets as ko rows
with i, collect(ko_set) as ko_row
order by i
//return the original paths in the ko node order with the counts
return reduce( ko_str = "", ko in ko_row |
case
when ko_str = "" then ko_str + ko[1] + ", " + ko[2]
else ko_str + ", " + ko[1] + ", " + ko[2]
end) as `KO-Contigs Counts`
The foreach directive in cypher is strictly for mutating data. For instance , you could use one query to collect the contigs counts per ko node.
This is a bit convoluted and you would never update the number of contigs on a ko node like this but it illustrates the use of foreach in cypher.
match (ko:ko)-->(:contigs)
with ko,count(*) as ct
with collect(ko) as ko_nodes, collect(ct) as ko_counts
with ko_nodes, ko_counts, range(0,length(ko_nodes)-1, 1) as idx
foreach ( i in idx |
set (ko_nodes[i]).num_contigs = ko_counts[i] )
A simpler way to perform the above update task on each ko node would be to do something like this...
match (ko:ko)-->(:contigs)
with ko, count(*) as ct
set ko.num_contigs = ct
If you were to carry teh number of contigs on each ko node then you could perform a query like this to return the number of
// match all the paths starting with K00001
match p=(ko1:ko {name:'K00001' } )-->(:cpd)-->(ko2:ko)-->(:cpd)-->(ko3:ko)
// build a csv line per path
return reduce( ko_str = "", ko in nodes(p) | ko_str +
// using just the ko nodes in the path
// exclude the cpd nodes
case
when labels(ko)[0] = "ko" then ko.name + ", " + toString(ko.num_contigs) + ", "
else ""
end
) as `KO-Contigs Counts`

How to get a single value from a cell-range by matching multiple columns and rows

I'm struggling with this one.
Here is data from 'sheet1':
|| A B C D E
=========================================
1 || C1 C2 X1 X2 X3
.........................................
2 || a b 1 2 3
3 || a d 10 11 12
4 || c d 4 5 6
5 || c f 13 14 15
6 || e f 7 8 9
7 || e b 16 17 18
Here's data in "sheet2":
|| A B C D
=================================
1 || C1 C2 C3 | val
.................................
2 || a d X2 | ?
3 || c f X1 | ?
4 || e b X3 | ?
Note that column C in sheet2 actually has values equal to user column names in sheet1.
I simply want to match A, B and C in sheet2 with A, B and 1 in sheet1 to find values in the last column:
|| A B C D
=================================
1 || C1 C2 C3 | val
.................................
2 || a d X2 | 11
3 || c f X1 | 13
4 || e b X3 | 18
I've been playing with OFFSET() and MATCH() but can't seem to lock down on one cell using multiple search criteria. Can someone help please?
I would use this function in sheet2 D2 field:
=index(filter(sheet1!C:E,sheet1!A:A=A2,sheet1!B:B=B2),1,match(C2,sheet1!$C$1:$E$1,0))
Explanation:
There is a FILTER function which will result the X1,X2,X3 values (C,D,E columns of sheet1) of the row which matches to the these two conditions:
C1 is "a"
C2 is "d"
So it will give back an array: [10,11,12] - which is the values of the X1, X2, X3 (C,D,E ) columns of sheet1 in the appropriate row.
Then, the INDEX function will grab this array. Now we only need to determine which value to pick. The MATCH function will do this computation as it tries to find the third condition C3 (which is in this case "X2) in the header row of sheet1. And in this example it will give back "2" as X2 is in the 2nd position of sheet1!c1:e1
So the INDEX function will give back the 2nd element of this array:[10,11,12], which is 11, the desired value.
Hope this helps.

Resources