How to define a custom similarity measure - machine-learning

I need some help defining a custom similarity measure.
I have a dataset whose elements are defined by 4 attributes.
As an example, consider the following two items:
Element 1:
A1: "R1", "R3", "R4", "R7"
A2: "H1"
A3 "F1", "F2"
A4 "aaa" "bbb"
Element 2:
A1: "R1", "R2"
A2: "H1"
A3 "F1", "F2"
A4 "aaa" "bbb" "ccc" "ddd" "eee" "fff"
I have to implement a similarity measure which should satisfies the following conditions:
1 - If A2 value is the same, the two elements must belong to the same cluster
2 - If two elements have at least one common value on A4, the who elements must belong to the same cluster.
I need to use a sort of weighted Jaccard measure. Is it mathematically correct to define a similarity measure that sums the jaccard distance of each attribute and then to add a sort of high weigth if condition 1 and 2 are satisfied for A2 and A4?
If so, how can I transform the similarity matrix into a distance matrix?

(1) Distance = 1 - similarity. This is a common characteristic.
(2) Summing the distances of the attributes is valid, although you may wish to scale it back to the [0, 1] range.
(3) Putting a high weight is not correct for what you've described. If the A2 or A4 values show a match, simply set the distance to 0. The clustering is a requirement, not merely strong advice. Is there some other semantic to your distance function, that you didn't want to take this route?
FYI, the basics for being a topological metric's distance function, D are:
D(a, a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) >= D(a,c)

Related

Converting formula to ARRAYFORMULA issues with SUM and INDEX

I have a scoring spreadsheet for a competition I'm working on. Competitors' place/rank are converted into points towards the overall series based on a chart of corresponding values. For ties, the sum of the points covered by all of the tied places are split evenly among the tied competitors (i.e. 2-way tie for 3rd; if 3rd usually gets 10 points and 4th usually gets 8, these competitors will receive (10+8)/2 (2 being the # of tied competitors), so they each receive 9 points).
I have a formula which does this exact calculation:
=IFERROR(IF(ISBLANK($A4:$A),,SUM(INDEX(SeriesPoints, E4:E):INDEX(SeriesPoints, MIN(E4:E + COUNTIF(E$4:E, E4:E) - 1, ROWS(SeriesPoints)))) / COUNTIF(E$4:E, E4:E), 0))
Where 'SeriesPoints' is a 2 column array; column 1 is the places/ranks (1:125) and column 2 is their corresponding point values. Column 'E' is the competitors' rank from the competition.
I have been unable to convert this formula to an ARRAYFORMULA() so I can avoid dragging it down the entire sheet (possibly up to 1000+ competitors over the series).
I'm mildly proficient with MMULT(), so I understood that would be a good approach for switching out SUM(), however, I haven't been able to create a matrix of the values to be summed.
INDEX():INDEX() doesn't work with ARRAYFORMULA() so I've tried switching to VLOOKUP(). With VLOOKUP() I've been able to produce the start and end values of the range of values for a tie, but not the full list. For example, if there is a 3-way tie for 4th, I can produce the respective points for 4th and 6th (the bounds of the tie).
In an attempt to list out even just the numbers from 4:6, I've hit a wall converting what would be a simple ROW() or SEQUENCE() formula to a matrix/array.
The following formula produces an array of the upper and lower bounds of ties or the single place should there be no tie, although the single place gets repeated.
=ARRAYFORMULA(IF(COUNTIF(E$4:E,E4:E)=1,E4:E,{E4:E,E4:E+COUNTIF(E$4:E,E4:E)-1}))
I'm assuming if I can get VLOOKUP({#:#}) to fill properly, I'll be where I need to be.
From here, I feel confident in my abilities to wrap a VLOOKUP() for the actual point values, an MMULT() to sum across these rows for the total, then a simple division to produce the correct point value.
Spreadsheet: https://docs.google.com/spreadsheets/d/1lpNewR3p4i7ZHmlFGLlG1tLuxgO-6onSeH8mWTeclBw/edit?usp=sharing
Currently, my workspace is off to the right. The original formula is in F4 and my test codes are working on column G instead of E.
So for sample placements of 1,1,3,3,3,6,7,8 and sample points values of 1000, 850,738,663,633,603,573,550 I expect the output to be 925 for the two 1st place tied competitors, 678 for the tied 3rd places, 603 for 6th, 573 for 7th, and 550 for 8th.
I'd appreciate any and all help!
=ARRAYFORMULA(IFERROR(IFERROR(VLOOKUP(G4:G, QUERY({INDIRECT("G4:G"&counta(A4:A)+3),
VLOOKUP(ROW(INDIRECT("A1:A"&COUNTA(A4:A))), SeriesPoints, 2, 0)},
"select Col1,sum(Col2) group by Col1 label sum(Col2)''", 0), 2, 0))/
IFERROR(VLOOKUP(G4:G, QUERY(G4:G,
"select G,count(G) where G is not NULL group by G label count(G)''", 0), 2, 0))))

How to get weighted average while skipping rows with a string on Google Sheets?

I want to get weighted average of values in column A with weights in column B. Problem is that column A might have string values and I want to skip these rows from calculation. Unlike =AVERAGEIF, function =AVERAGE.WEIGHTED does not have this implemented.
How do I do it? And how would I do it if column B could also have strings (for future proofing)?
=AVERAGE.WEIGHTED(FILTER(A:A,ISNUMBER(A:A)),FILTER(B:B,ISNUMBER(A:A)))
=SUM(FILTER(A:A*B:B / sum(FILTER(B:B,ISNUMBER(A:A))),ISNUMBER(A:A)))
=SUM(FILTER(A:A*B:B / (sum(B:B) - SUMIF(A:A,"><", B:B)),ISNUMBER(A:A)))
case both columns contain strings, add 1 more condition for each formula:
=AVERAGE.WEIGHTED(FILTER(A:A,ISNUMBER(A:A),ISNUMBER(B:B)),FILTER(B:B,ISNUMBER(A:A), ISNUMBER(B:B)))
So let us suppose our numbers (we hope) sit in A2:A4 and the weights are in B2:B4. In C2, place
=if(and(ISNUMBER(A2),isnumber(B2)),A2,"")
and drag that down to C4 to keep only actual numbers for which we have weights.
Similarly in D2 (and drag to D4), use
=if(and(ISNUMBER(A2),isnumber(B2)),B2,"")
and in E2 (and drag to E4) we need
=if(and(ISNUMBER(C2),isnumber(D2)),C2*D2,"")
then our weighted average will be:
=iferror(sum(E2:E4)/sum(D2:D4))
you can hide the working columns or place them in the middle of nowhere if you wish.

How to get the sum of a column up to a certain value?

I have a google sheet that I am using to try and calculate leveling and experience points. Column A has the level and Column B has the exp needed to reach the next level. i.e. To get to Level 3 you need 600 exp.
A B
1 200
2 400
3 600
...
99 19800
In column I2 I have an integer for an amount of exp (e.g. 2000), in column J2 I want to figure out what level someone would be at if they started from 0.
Put this in column J and ddrag down as required. Rounddown(I2,-2) rounds I2 down to the nearest 100. Index match finds a match in column B and returns the value in column A of the matched row.
=index(A2:A100,match(ROUNDDOWN(I2,-2),B2:B100,0))
Using a helper column (for example Z): put =sum(B$1:B1) in cell Z1 and drag down. This will compute the sums required for each level. In J2, use the formula
=vlookup(I2, {B:B, Z:Z}, 2) + 1
which looks up I2 in column B, and returns the nearest match that is less than or equal to the search key. It adds 1 to find the level that would be reached, because your table has this kind of an offset to you: the entry against level N is about achieving level N+1.
You may want to put 0 0 on top of the table, to correctly handle the amounts under 200. Or treat them with a separate if condition.
Using algebra
In your specific scenario, the point amount required for level N can be computed as
200*(1+2+3+...+N-1) = 200*(N-1)*N/2 = 100*(N-1/2)^2 - 25
So, given x amount of points, we can find N directly with algebra:
N = floor(sqrt((x+25)/100)+1/2)
which means that the formula
=floor(sqrt((I2 + 25) / 100) + 1/2)
will have the desired effect in cell J2, without the need for an extra column and vlookup.
However, the second approach only works for this specific point values.

Compute subranks in spreadsheet column in combination with ArrayFormula (Google Sheets)

I'm trying to find the inverse rank within categories using an ArrayFormula. Let's suppose a sheet containing
A B C
---------- -----
1 0.14 2
1 0.26 3
1 0.12 1
2 0.62 2
2 0.43 1
2 0.99 3
Columns A:B are input data, with an unknown number of useful rows filled-in manually. A is the classifier categories, B is the actual measurements.
Column C is the inverse ranking of B values, grouped by A. This can be computed for a single cell, and copied to the rest, with e.g.:
=1+COUNTIFS($B$2:$B,"<" & $B2, $A$2:$A, "=" & $A2)
However, if I try to use ArrayFormula:
=ARRAYFORMULA(1+COUNTIFS($B$2:$B,"<" & $B2:$B, $A$2:$A, "=" & $A2:$A))
It only computes one row, instead of filling all the data range.
A solution using COUNT(FILTER(...)) instead of COUNTIFS fails likewise.
I want to avoid copy/pasting the formula since the rows may grow in the future and forgetting to copy again could cause obscure miscalculations. Hence I would be glad for help with a solution using ArrayFormula.
Thanks.
I don't see a solution with array formulas available in Sheets. Here is an array solution with a custom function, =inverserank(A:B). The function, given below, should be entered in Script Editor (Tools > Script Editor). See Custom Functions in Google Sheets.
function inverserank(arr) {
arr = arr.filter(function(r) {
return r[0] != "";
});
return arr.map(function(r1) {
return arr.reduce(function(rank, r2) {
return rank += (r2[0] == r1[0] && r2[1] < r1[1]);
}, 1);
});
}
Explanation: the double array of values in A:B is
filtered, to get rid of empty rows (where A entry is blank)
mapped, by the function that takes every row r1 and then
reduces the array, counting each row (r2) only if it has the same category and smaller value than r1. It returns the count plus 1, so the smallest element gets rank 1.
No tie-breaking is implemented: for example, if there are two smallest elements, they both get rank 1, and there is no rank 2; the next smallest element gets rank 3.
Well this does give an answer, but I had to go through a fairly complicated manoeuvre to find it:
=ArrayFormula(iferror(VLOOKUP(row(A2:A),{sort({row(A2:A),A2:B},2,1,3,1),row(A2:A)},4,false)-rank(A2:A,A2:A,true),""))
So
Sort cols A and B with their row numbers.
Use a lookup to find where those sorted row numbers now are: their position gives the rank of that row in the original data plus 1 (3,4,2,6,5,7).
Return the new row number.
Subtract the rank obtained just by ranking on column A (1,1,1,4,4,4) to get the rank within each group.
In the particular case where the classifiers (col A) are whole numbers and the measurements (col B) are fractions, you could just add the two columns and use rank:
=ArrayFormula(iferror(rank(A2:A+B2:B,if(A2:A<>"",A2:A+B2:B),true)-rank(A2:A,A2:A,true)+1,""))
My version of an array formula, it works when column A contains text:
=ARRAYFORMULA(RANK(ARRAY_CONSTRAIN(VLOOKUP(A1:A,{UNIQUE(FILTER(A1:A,A1:A<>"")),ROW(INDIRECT("a1:a"&COUNTUNIQUE(A1:A)))},2,)*1000+B1:B,COUNTA(A1:A),1),ARRAY_CONSTRAIN(VLOOKUP(A1:A,{UNIQUE(FILTER(A1:A,A1:A<>"")),ROW(INDIRECT("a1:a"&COUNTUNIQUE(A1:A)))},2,)*1000+B1:B,COUNTA(A1:A),1),1) - COUNTIF(A1:A,"<"&OFFSET(A1,,,COUNTA(A1:A))))

An idea for a complex function that finds in a sorted list of integers a number not bigger than a given one

I have, say, in A1 a text containing a sorted (eventually reversed) list of integers separated by some not-digit-char - for example "10, 123, 230, 750, 1034, 2003, 10101"; in B1 I have an integer n; I need a formula not involving other cells that returns:
n if n belongs to the list in A1;
otherwise, if n is not bigger than the maximum value in A1, the value in A1 immediately bigger than n (e.g., for n = 567 the returned value must be 750);
otherwise, an error.
In my opinion, the only way to solve the problem concerns regexp substitution (that Google Sheet supports), but until now I can't find a reasonable way to proceed.
Someone has a (different) idea?
Please try:
=index(SORT(TRANSPOSE(SPLIT(A1,", ")),1,0),
MATCH(B1,SORT(TRANSPOSE(SPLIT(A1,", ")),1,0),-1))
in this formula I used search_type = -1 for match function:
MATCH(search_key, range, search_type)
search_key - The value to search for. For example, 42, "Cats", or
I24.
range - The one-dimensional array to be searched. If a range with both height and width greater than 1 is used, MATCH will return #N/A!.
search_type - [ OPTIONAL - 1 by default ] - The manner in which to
search.
1, the default, causes MATCH to assume that the range is sorted in ascending order and return the largest value less than or equal to
search_key.
0 indicates exact match, and is required in situations where range is not sorted.
-1 causes MATCH to assume that the range is sorted in descending order and return the smallest value greater than or equal to search_key.
Simplify the case
Suppose you have a cell with text sorted in descending:
The formula would be:
=index(SPLIT(A1,", "),MATCH(B1,SPLIT(A1,", "),-1))
Please try:
=if(isnumber(find(B1,A1)),B1,index(split(A1,","),match(B1,split(A1,","),1)+1))
Above won't work for numbers lower than the first, but if required could be expanded to:
=if(B1<1*left(A1,find(",",A1)-1),1*left(A1,find(",",A1)-1),if(isnumber(find(B1,A1)),B1,index(split(A1,","),match(B1,split(A1,","),1)+1)))

Resources