Multi-level and multi-argument Index in Google Sheets - google-sheets

I am writing a sheet where I am trying to create a multi level Index that searches through 5 different columns with 3 pieces of data. So for example:
x = 40
y = 5000
z = 20000
Column1 | Column2 | Column3 | Column4 | Column5 | Column6
13 | 29 | 0 | 0 | 0 | Yes
30 | 870 | 0 | 0 | 0 | No
10 | 870 | 0 | 30000 | 1 | Blue
10 | 870 | 30001 | 100000 | 1 | Yes
10 | 870 | 100001 | 300000 | 1 | Unknown
Here's a sample set of my data, what I need is to compare
the variable x to columns 1 and then 2 (x must fall between these values)
variable y to columns 3 and 4 (y must fall between these values)
and then finally z to column 5 (z must be above these values)
In each of these cases I need to know if the the variable is either lower than or higher than . Finally, I need the matching data from column 6 to be returned as a result in my sheet. At the moment I have a simply IMMENSE list of nested if statements which consider all of these criteria separately but it doesn't lend itself very well to editing when changes need to be made to the values.
I've looked at every single page on the internet (every... single... page...) and can't seem to find the solution to my issue. Most solutions I have found are either using a single data point, using multiple data points against a single range or simply don't seem to work. The latest iteration I have tried is:
=INDEX('LTV Data'!$N$3:$N$10, MATCH($D$5 & $G$8 & $G$12, ARRAYFORMULA($D$5 <= 'LTV Data'!$H$3:$H$10 & $D$5 >= 'LTV Data'!$I$3:$I$10 & $G$12 <= 'LTV Data'!$J$3:$J$10 & $G$12 >= 'LTV Data'!$K$3:$K$10 & $G$8 <= 'LTV Data'!$L$3:$L$10), 0), 7)
But this only produces an error as the separate values I want to test against are concatenated and the Match can't find that string. I'm also unsure about the greater than and less than symbols as to how valid that syntax is. Is anyone able to shed some light on how I can achieve the result I need in a more elegant way than the mass of IFS, ANDS + ORs I have right now? Like I said, it works but it sure ain't pretty ;)
Thanks a bunch in advance!
ETA: So with the above variables the result I would like would be 'Blue'. This is because x falls between columns 1 and 2, y falls between columns 3 and 4 and z is higher than column 5 on the third row. This is all contained in the MATCH statement in the example code above. Please see the MATCH statement to see the comparisons I am trying to make.

You need to put the different criteria together using multiplication if you want to get the effect of an AND in an array:
=INDEX(F2:F10,MATCH(1,(A2:A10<x)*(B2:B10>x)*(C2:C10<y)*(D2:D10>y)*(E1:E10<z),0))
or
=INDEX(F2:F10,MATCH(1,(A2:A10<=x)*(B2:B10>=x)*(C2:C10<=y)*(D2:D10>=y)*(E1:E10<=z),0))
to include the equality (I have used named ranges for x, y and z).
Works in Google Sheets and (if entered as an array formula) Excel.
In Google Sheets you also have the option of using a filter
=filter(F2:F10,A2:A10<=x,B2:B10>=x,C2:C10<=y,D2:D10>=y,E2:E10<=z)
but then you aren't guaranteed to get just one row.

Related

Google sheets COUNTIF excluding hidden rows

In google sheets, I have a list of strings (1 per row) where each string is split with 1 character per column, so my sheet looks something like below:
A
B
C
D
E
F
1
F
R
A
N
K
2
P
A
S
S
1
2
I then have this sheet filtered, so Can select only the rows where the first character is F, for example. On another sheet in the same workbook, I have a table of how often each character appears in each column, that looks something like this:
A
B
C
D
E
F
1
Char
Overall
1
2
3
2
A
979
141
304
165
3
B
281
173
69
15
I would like to have this table dynamically update, so that when I filter the first sheet my table shows the frequency only for the strings that meet the filter.
In Excel, this can be accomplished using a combination of SUMPRODUCT and SUBTOTAL but this doesn't work in google sheets. I've seen this done in sheets using helper columns, but I would like the solution to work for a string of an arbitrary number of strings with different lengths without having to change the sheet. Can this be done in Google Sheets?
Thanks!
Hidden cells are assigned with the value 0. One way to solve this is by adding a "helper" column in column A and set all the values in it to 1.
| A | B | C | D | E | F | G
--+--------+------+---+---------+-----+-----+-----
1 | Helper | Char | | Overall | 1 | 2 | 3
--+--------+------+---+---------+-----+-----+-----
2 | 1 | A | | 979 | 141 | 304 | 165
3 | 1 | B | | 281 | 173 | 69 | 15
Now instead of using COUNTIF, use the COUNTIFS formula where the second condition A2:A = 1. For example:
=COUNTIFS([YOUR_CONDITION], A2:A,"=1")
the A column values of hidden rows will calculate as 0, therefore will not be counted.

How to check if subsequent non-empty cells contains the same value?

I would like to count if in the column there are N subsequent elements of a certain value (in this example it's 1). So for the below example, it should return true for A, but false for B. Empty cells shall be ignored.
A B
+---+---+
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| | |
| 1 | 2 |
| | 1 |
+---+---+
| Y | N | <-- RESULT
+---+---+
Ideally would be if I could mark all of those 4 or more subsequent cells.
Have a look at this one as well - you need to pull it across for columns B, C etc.
=ArrayFormula(max(frequency(if(filter(A:A,A:A<>"")=1,sequence(count(A:A)),""),if(filter(A:A,A:A<>"")<>1,sequence(count(A:A)),"")))>=4)
So how does this work? It goes all the way back to #Barry Houdini, who as far as I know was the first to post this elegant method to find the longest sequence of repeated values within a range in Excel.
It uses Frequency. Frequency counts the number of values which fall within a series of bins defined by cut points
Frequency(<Values>,<Cut points>)
So if your cut points were 10,20,30 and your values were 12,15,25,35,36,37 you would expect the counts to be 2,1 and 3.
#Barry's insight was to realise that if you made the row numbers for values that you didn't want to count be the cut points and the row numbers for values you did want to count (in this case 1) be the values, then you could use Frequency to count up the number of consecutive values of interest. This led him to this formula (in his case looking for consecutive zeroes bounded by ones):
=MAX(FREQUENCY(IF((A2:A100=0)*(A2:A100<>""),ROW(A2:A100)),IF(A2:A100=1,ROW(A2:A100))))
which is easily adapted for the present situation simply by changing 0 to 1 and 1 to <>1, filtering out the blanks and using Sequence instead of row number.
try:
=ARRAYFORMULA(IF(TRANSPOSE(MMULT(TRANSPOSE(N(REGEXMATCH(""&
IF(A1:B6="", 1, A1:B6), "1"))),
SEQUENCE(ROWS(A1:B6), 1)^0))/ROWS(A1:B6)=1, "Y", "N"))

How to use criteria based on a set of numbers with SUMIFS?

I have table like this
A | B | C | Val
5 | 5 | 5 | 10
5 | 4 | 5 | 20
4 | 4 | 4 | 5
3 | 3 | 4 | 7
Is there any way sumifs can use criteria based on a set of numbers?
E.g. I want to sum the Val column based on these cells:
A: 5
B: 4,5 -> means get 4 or 5
C: 5
Then the result is 10 + 20 = 30
A: 3,4,5
B: 3,4
C: 4
Then the result is 5 + 7 = 12
You can combine SUMPRODUCT and MMULT functions to get dynamic formula:
=SUMPRODUCT(
($D$2:$D$5)
*MMULT(--($A$2:$A$5=$I$2:$K$2),ROW($A$1:$A$3)^0)
*MMULT(--($B$2:$B$5=$I$3:$K$3),ROW($A$1:$A$3)^0)
*MMULT(--($C$2:$C$5=$I$4:$K$4),ROW($A$1:$A$3)^0)
)
There's no simple way to do it with SUMIFS as it does not let you express OR conditions with Array literals like in Excel.
However, Google Sheets does offer the QUERY function.
Assume that criteria for A is in H1:1, B is in H2:2, C is in H3:3:
=QUERY(A:D,"select sum(D) where
(A="&TEXTJOIN(" or A=",1,H1:1)&") and
(B="&TEXTJOIN(" or B=",1,H2:2)&") and
(C="&TEXTJOIN(" or C=",1,H3:3)&")")
This gives you a header too. To get just the sum you can INDEX it:
=INDEX(
QUERY(A:D,"select sum(D) where
(A="&TEXTJOIN(" or A=",1,H1:1)&") and
(B="&TEXTJOIN(" or B=",1,H2:2)&") and
(C="&TEXTJOIN(" or C=",1,H3:3)&")")
,2)
With this, you don't have to worry about headers or the number of criteria.
For a way to do it without QUERY see #basic's solution.
I think you can do it with relatively simple SUMPRODUCT(MMULT(... as shown here on this sheet.
=SUMPRODUCT(E4:E,MMULT(N(REGEXMATCH(TO_TEXT(B4:D),B2:D2)),SEQUENCE(3,1,1,0))=3)
Note the use of the number 3 in two places which is indicative of how wide your data range is. It can be made dynamic, but it's easier to see how the formula works with the 3s as "hard" numbers for now.

Can logistic regression be used for variables containing lists?

I'm pretty new into Machine Learning and I was wondering if certain algorithms/models (ie. logistic regression) can handle lists as a value for their variables. Until now I've always used pretty standard datasets, where you have a couple of variables, associated values and then a classification for those set of values (view example 1). However, I now have a similar dataset but with lists for some of the variables (view example 2). Is this something logistic regression models can handle, or would I have to do some kind of feature extraction to transform this dataset into just a normal dataset like example 1?
Example 1 (normal):
+---+------+------+------+-----------------+
| | var1 | var2 | var3 | classification |
+---+------+------+------+-----------------+
| 1 | 5 | 2 | 526 | 0 |
| 2 | 6 | 1 | 686 | 0 |
| 3 | 1 | 9 | 121 | 1 |
| 4 | 3 | 11 | 99 | 0 |
+---+------+------+------+-----------------+
Example 2 (lists):
+-----+-------+--------+---------------------+-----------------+--------+
| | width | height | hlines | vlines | class |
+-----+-------+--------+---------------------+-----------------+--------+
| 1 | 115 | 280 | [125, 263, 699] | [125, 263, 699] | 1 |
| 2 | 563 | 390 | [11, 211] | [156, 253, 399] | 0 |
| 3 | 523 | 489 | [125, 255, 698] | [356] | 1 |
| 4 | 289 | 365 | [127, 698, 11, 136] | [458, 698] | 0 |
| ... | ... | ... | ... | ... | ... |
+-----+-------+--------+---------------------+-----------------+--------+
To provide some additional context on my specific problem. I'm attempting to represent drawings. Drawings have a width and height (regular variables) but drawings also have a set of horizontal and vertical lines for example (represented as a list of their coordinates on their respective axis). This is what you see in example 2. The actual dataset I'm using is even bigger, also containing variables which hold lists containing the thicknesses for each line, lists containing the extension for each line, lists containing the colors of the spaces between the lines, etc. In the end I would like to my logistic regression to pick up on what result in nice drawings. For example, if there are too many lines too close the drawing is not nice. The model should pick up itself on these 'characteristics' of what makes a nice and a bad drawing.
I didn't include these as the way this data is setup is a bit confusing to explain and if I can solve my question for the above dataset I feel like I can use the principe of this solution for the remaining dataset as well. However, if you need additional (full) details, feel free to ask!
Thanks in advance!
No, it cannot directly handle that kind of input structure. The input must be a homogeneous 2D array. What you can do, is come up with new features that capture some of the relevant information contained in the lists. For instance, for the lists that contain the coordinates of the lines along an axis (other than the actual values themselves), one could be the spacing between lines, or the total amount of lines or also some statistics such as the mean location etc.
So the way to deal with this is through feature engineering. This is in fact, something that has to be dealt with in most cases. In many ML problems, you may not only have variables which describe a unique aspect or feature of each of the data samples, but also many of them might be aggregates from other features or sample groups, which might be the only way to go if you want to consider certain data sources.
Wow, great question. I have never consider this, but when I saw other people's responses, I would have to concur, 100%. Convert the lists into a data frame and run your code on that object.
import pandas as pd
data = [["col1", "col2", "col3"], [0, 1, 2],[3, 4, 5]]
column_names = data.pop(0)
df = pd.DataFrame(data, columns=column_names)
print(df)
Result:
col1 col2 col3
0 0 1 2
1 3 4 5
You can easily do any multi regression on the fields/features of the data frame and you'll get what you need. See the link below for some ideas of how to get started.
https://pythonfordatascience.org/logistic-regression-python/
Post back if you have additional questions related to this. Or, start a new post if you have similar, but unrelated, questions.

How to count distinct errors in google sheets/excel (NON-SQL)?

Column A (Error.Type) |
(Or any other text) | Column B (Message)
> 1 | #REF! |
> 1 | #REF! | total for error 1 is 2
> 2 | #DIV/0! |
> 2 | "Hello World" | total for error 2 is 2
> 3 | "Foobar" |
> 3 | "Something" |
> 3 | "Else" | total for error 3 is 3
// the number is based on error.type, there are other columns
grand total i need to return 2 + 2 + 3 = 7
=(countif(k:k,k1)>1)) <- this code will return all under conditional formatting
but =count((countif(k:k,k1)>1))) is a circular dependency and i don't know why?
Circular dependency occurs when a formula includes a reference to the cell that holds it. If cell A1, has =A1 the result is the referred error message
=count((countif(k:k,k1)>1)) will return the circular dependency error if it's included on any cell on the K column, on any other column it should work fine.
It sounds like what you are needing is the COUNTIFS() formula. For example, the following function will return a TRUE value for all the rows except the ones with >1 and #REF! in your example. Thus, giving you your uniques.
=COUNTIFS($A:$A,$A1,$B:$B,$B1)=1
Returning this true value for conditional formatting will make w/e you're needing happen on the Range you have it applied. This is assuming whatever value you need in K isn't really needed.
For distinct counts, you can use a helper column (say somewhere in the far right) like this. It will return 1 for each distinct.
| D |
=COUNTIFS(A$1:A1,A1,B$1:B1,B1)
Then this formula will get the total distincts.
=COUNTIF(D:D,1)

Resources