Calculate Positional Difference based on row for string values for two tables - google-sheets

Table 1:
Position
Team
1
MCI
2
LIV
3
MAN
4
CHE
5
LEI
6
AST
7
BOU
8
BRI
9
NEW
10
TOT
Table 2
Position
Team
1
LIV
2
MAN
3
MCI
4
CHE
5
AST
6
LEI
7
BOU
8
TOT
9
BRI
10
NEW
Output I'm looking for is
Position difference = 10 as that is the total of the positional difference. How can I do this in excel/google sheets? So the positional difference is always a positive even if it goes up or down. Think of it as a league table.
Table 2 New (using formula to find positional difference):
Position
Team
Positional Difference
1
LIV
1
2
MAN
1
3
MCI
2
4
CHE
0
5
AST
1
6
LEI
1
7
BOU
0
8
TOT
2
9
BRI
1
10
NEW
1

Try this:
=IFNA(ABS(INDEX(A:B,MATCH(E2,B:B,0),1)-D2),"-")
Assuming that table 1 is at columns A:B:

Related

How To Skip Down by 1 Row/Cell The Formula Output and Remove The Last Sequential Output Before 1's Google Sheets?

I've got these 3 groups of data in range F2:G22 as below
(3 groups as minimal example, in reality many thousands of groups, and recurrent similar datasets expected in the future):
I need to number each group's rows sequentially, starting over at 1 at each new group.
The expected result would be like in range E1:E22.
I tried the following formula n cell C2 , then in cell D3:
=INDEX(IF(A2:A22="",COUNTIFS(B2:B22&A2:A22, B2:B22&A2:A22, ROW(B2:B22), "<="&ROW(B2:B22)),1))
In C2:
In D3:
That fixed partially the sequence issue, but there's still 2 issues I can't find remedy for.
1st remaining issue:
I'd prefer not having to manually do the C2 to D3 step each time I get new similar data (but would accomodate if there's no simple solution to this issue).
Is there a simple way to modify the formula to make it output the correct sequencing from C2 ?
2nd remaining issue:
At rows 7, 14 and 23 there still remain unecessary ending numbering for these intermediary rows in D7 , D14 , and D23:
I could only think of an extra manual step of filtering out the non-blank rows in Column A to fix this 2nd issue (i.e. Highlighting Column A > Data tab > Create Filter > Untick all > Tick Blanks > Copy All > Paste In new Sheet).
But would there be a way to do it in the same formula? I'm not seeing the way to add the proper filter or using another method in the formula.
Any help is greatly appreciated.
EDIT (Sorry for Forgotten Sample):
Formula Input A
Formula Input B
Formula Output 1
Formula Output 2
EXPECTED RESULT
rockinfreakshow
ztiaa
DATA
DATA BY GROUP
7
1
1
7
7
2
1
1
1
2
Element-1
Group-1
7
3
2
2
2
3
Element-2
Group-1
7
4
3
3
3
4
Element-3
Group-1
7
5
4
4
4
5
Element-4
Group-1
8
1
5
6
8
8
2
1
1
1
7
Element-1
Group-2
8
3
2
2
2
8
Element-2
Group-2
8
4
3
3
3
9
Element-3
Group-2
8
5
4
4
4
10
Element-4
Group-2
8
6
5
5
5
11
Element-5
Group-2
8
7
6
6
6
12
Element-6
Group-2
9
1
7
13
9
9
2
1
1
1
14
Element-1
Group-3
9
3
2
2
2
15
Element-2
Group-3
9
4
3
3
3
16
Element-3
Group-3
9
5
4
4
4
17
Element-4
Group-3
9
6
5
5
5
18
Element-5
Group-3
9
7
6
6
6
19
Element-6
Group-3
9
8
7
7
7
20
Element-7
Group-3
9
9
8
8
8
21
Element-8
Group-3
9
Can you try:
=INDEX(LAMBDA(y,z,
IF(LEN(z),COUNTIFS(y,y,ROW(z),"<="&ROW(z)),))
(LOOKUP(ROW(G2:G),FILTER(ROW(G2:G),BYROW(G2:G,LAMBDA(z,IF(z<>OFFSET(z,-1,0),row(z),0))))),G2:G))
You can simply use SCAN.
=SCAN(,G2:G,LAMBDA(a,c,IF(c="",,a+1)))
Sample sheet

How to use multi inputs or multi features for RNN or LSTM

I have a time series dataframe like blow, and the number in it is meaning less, and I have some problems when applying LSTM.
I have saw some demos of LSTM, mostly use this pattern: [y_{t-2},y_{t-1},y_{t}] to predict [y_{t+1}], but just as the dataframe blow, I also have featureA, featureB, featureC, so my quesiton is: how to use multi inputs or multi features for LSTM
time featureA featureB featureC target
1 2 5 6 1
2 4 1 7 3
3 6 2 1 5
4 2 4 0 7
5 7 6 1 5
6 9 3 2 8
7 1 2 3 5
8 2 9 5 10
9 1 10 7 6
10 3 2 2 11
For RNN/LSTM, it is more like this: [..., y_{t-2}(x_{t-2}), y_{t-1}(x_{t-1})] to predict [y_{t}(x_{t})]
Or more succinctly:
y_{t} = f(y_{t-1}, x_{t})
So in feed forward you still use your inputs x_{t} (i.e. your features) plus the outputs from previous timesteps to make the prediction at the current timestep.

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

why the result of method mostSimilarItems in mahout is not order by the weight?

I have the following codes:
ItemSimilarity itemSimilarity = new UncenteredCosineSimilarity(dataModel);
recommender = new GenericItemBasedRecommender(dataModel,itemSimilarity);
List<RecommendedItem> items = recommender.mostSimilarItems(10, 5);
my datamodel is like this:
uid itemid socre
userid itemid score
1 6 5
1 10 3
1 11 5
1 12 4
1 13 5
2 2 3
2 6 5
2 10 3
2 12 5
when I run the code above,the result is just like this:
13
6
11
2
12
I debug the code,and find that the List items = recommender.mostSimilarItems(10, 5); return the items has the same score,that is one!
so,I have a problem.in my opinion,I think the mostsimilaritem should consider the item co-occurrence matrix:
2 6 10 11 12 13
2 0 1 1 0 1 0
6 1 0 2 1 2 1
10 1 2 0 1 2 1
11 0 1 1 0 1 1
12 1 2 2 1 0 1
13 0 1 1 1 1 0
in the matrix above ,the item 12's most similar should be [6,12,11,13,2],because the item 1 and item 12 is more similar than the other items,isn't it?
now,anyone who can explain this for me?thanks!
In your matrix you have much more data than in your input. In particular you seem to be imputing 0 values that are not in the data. That is why you are likely getting answers different from what you expect.
Mahout expects your IDs to be contiguous Integers starting from 0. This is true of your row and column ids. Your matrix looks like it has missing ids. Just having Integers is not enough.
Could this be the problem? Not sure what Mahout would do with the input above.
I always keep a dictionary to map Mahout IDs to/from my own.

Using COUNTIFS on 3 different columns and then need to SUM a 4th column?

I have written this formula below. I do not know the correct part of this formula that will add the numbers I have in Column AB2:AB552. As it is, this formula is counting the number of cells in that range that has numbers in it, but I need it to total those numbers as my final result. Any help would be great.
=COUNTIFS(Cases!B2:B552,"1",Cases!G2:G552,"c*",Cases!X2:X552,"No",**Cases!AB2:AB552,">0"**)
Assuming you don't actually need the intermediate counts, the sumifs function should give you the final result:
=SUMIFS(Cases!AB2:AB552,Cases!B2:B552,1,Cases!G2:G552,"c",Cases!X2:X552,"No",Cases!AB2:AB552,">0")
Testing this with some limited data:
Row B G X AB
2 2 a No 10
3 1 c No 24
4 2 c No 4
5 1 c No 0
6 1 a Yes 9
7 2 c No 12
8 2 c No 6
9 2 b No 0
10 1 b No 0
11 1 a No 10
12 2 c No 6
13 1 c No 20
14 1 c No 4
15 1 b Yes 22
16 1 b Yes 22
the formula above returned 48, the sum of AB3, AB13, and AB14, which were the only rows matching all 4 criteria

Resources