join DataFrames on a partially matching index - join

I'm trying to find a more elegant way to join two DataFrames where the index levels of one DF are a partial subset of the index levels of the other DF. This is a very common operation in SQL and I'm surprised to find it's so difficult to do with pandas:
Here's an example:
import pandas as pd
df = pd.DataFrame(
{
2012:[4,5,8,9],
2013:[1,2,4,7],
2014:[6,5,4,3],
},
index= pd.MultiIndex.from_tuples([('apples',False),('bananas',False),('oranges',True),('lemons',True)], names=('fruit','citrus'))
)
=>
2012 2013 2014
fruit citrus
apples False 4 1 6
bananas False 5 2 5
oranges True 8 4 4
lemons True 9 7 3
[4 rows x 3 columns]
Now I want to know the highest number of each fruit sold in a given year:
fruit_max_by_date = df.max(axis=1).to_frame()
citrus_max_by_date = fruit_max_by_date.max(level='citrus')
citrus_max_by_date.columns = [1]
=>
fruit_max_by_date =
0
fruit citrus
apples False 6
bananas False 5
oranges True 8
lemons True 9
[4 rows x 1 columns]
citrus_max_by_date =
1
citrus
False 6
True 9
[2 rows x 1 columns]
So far so good. But now I try to join the latter two together:
fruit_max_by_date.join(citrus_max_by_date) =>
0 1
fruit citrus
apples False 6 NaN
bananas False 5 NaN
oranges True 8 NaN
lemons True 9 NaN
[4 rows x 2 columns]
Argh!! Because the index of the second table doesn't exactly match the index of the first table, the join fails. This seems totally contrary to the intuitive behavior of an SQL-like inner join.
All the workarounds below (especially the second) are butt-ugly and basically involve either throwing the index out the window, or manually broadcasting the index of one table. Is there a simpler way to do this?
Workaround: Expand the index of the smaller table through broadcasting
This is the least-ugly workaround I could come up with, but it's still quite bad in that it requires expanding the size of the second array for no good reason.
fruit_max_by_date.join(
citrus_max_by_date.reindex(fruit_max_by_date.index, level='citrus') ) =>
0 1
fruit citrus
apples False 6 6
bananas False 5 6
oranges True 8 9
lemons True 9 9
[4 rows x 2 columns]
Workaround: Truncate the index of the first table
This is horribly ugly, especially having to reassemble the index afterwards, but it works.
fruit_max_by_date \
.reset_index(level='fruit') \
.join(citrus_max_by_date) \
.set_index('fruit',append=True \
.reorder_levels((1,0)) =>
0 1
citrus fruit
False apples 6 6
bananas 5 6
True oranges 8 9
lemons 9 9
[4 rows x 2 columns]
Drop all pretense of using an index, and join without index
Okay, this is relatively straightforward, but what exactly is the point of having an index if you can't use it?
If using join — but not merge (FML!!) — there is another bizarre side effect: the joined-on column is reduplicated in the output:
fruit_max_by_date.reset_index().join(
citrus_max_by_date.reset_index(),
on='citrus', rsuffix='_' ) =>
fruit citrus 0 citrus_ 1
0 apples False 6 False 6
1 bananas False 5 False 6
2 oranges True 8 True 9
3 lemons True 9 True 9
[4 rows x 5 columns]
fruit_max_by_date.reset_index().merge(
citrus_max_by_date.reset_index(),
on='citrus' ) =>
fruit citrus 0 1
0 apples False 6 6
1 bananas False 5 6
2 oranges True 8 9
3 lemons True 9 9
[4 rows x 4 columns]

Related

Intercalate columns when they are in pairs

Using this table:
A
B
C
D
1
2
3
4
5
6
7
8
9
10
11
12
In Google Sheets if I do this here in column E:
={A1:B3;C1:D3}
Teremos:
E
F
1
2
5
6
9
10
3
4
7
8
11
12
But the result I want is this:
E
F
1
2
3
4
5
6
7
8
9
10
11
12
I tried multiple options with FLATTEN, but none of them returned what I wanted.
Well you can try:
=WRAPROWS(TOCOL(A1:D3),2)
You could try with MAKEARRAY
=MAKEARRAY(ROWS(A1:D3)*2,2,LAMBDA(r,c,INDEX(FLATTEN(A1:D3),c+(r-1)*2)))
GENERAL ANSWER
For you or anyone else: to do something similar but with a variable number of columns of origin or of destination, you can use this formula. Changing the range and amount of columns at the end of LAMBDA:
=LAMBDA(range,cols,MAKEARRAY(ROWS(range)*ROUNDUP(COLUMNS(range)/cols),cols,LAMBDA(r,c,IFERROR(INDEX(FLATTEN(range),c+(r-1)*cols)))))(A1:D3,2)
you can do:
={FLATTEN({A1:A3, C1:C3}), FLATTEN({B1:B3, D1:D3})}
for more columns, it could be automated with MOD

Calculate Positional Difference based on row for string values for two tables

Table 1:
Position
Team
1
MCI
2
LIV
3
MAN
4
CHE
5
LEI
6
AST
7
BOU
8
BRI
9
NEW
10
TOT
Table 2
Position
Team
1
LIV
2
MAN
3
MCI
4
CHE
5
AST
6
LEI
7
BOU
8
TOT
9
BRI
10
NEW
Output I'm looking for is
Position difference = 10 as that is the total of the positional difference. How can I do this in excel/google sheets? So the positional difference is always a positive even if it goes up or down. Think of it as a league table.
Table 2 New (using formula to find positional difference):
Position
Team
Positional Difference
1
LIV
1
2
MAN
1
3
MCI
2
4
CHE
0
5
AST
1
6
LEI
1
7
BOU
0
8
TOT
2
9
BRI
1
10
NEW
1
Try this:
=IFNA(ABS(INDEX(A:B,MATCH(E2,B:B,0),1)-D2),"-")
Assuming that table 1 is at columns A:B:

Join two pandas dataframes based on line order

I have two dataframes df1 and df2 I want to join. Their indexes are not the same and they don't have any common columns. What I want is to join them based on the order of the rows, i.e. join the first row of df1 with the first row of df2, the second row of df1 with the second row of df2, etc.
Example:
df1:
'A' 'B'
0 1 2
1 3 4
2 5 6
df2:
'C' 'D'
0 7 8
3 9 10
5 11 12
Should give
'A' 'B' 'C' 'D'
0 1 2 7 8
3 3 4 9 10
5 5 6 11 12
I don't care about the indexes in the final dataframe. I tried reindexing df1 with the indexes of df2 but could not make it work.
You could assign to df1 index of df2 and then use join:
df1.index = df2.index
res = df1.join(df2)
In [86]: res
Out[86]:
'A' 'B' 'C' 'D'
0 1 2 7 8
3 3 4 9 10
5 5 6 11 12
Or you could do it in one line with set_index:
In [91]: df1.set_index(df2.index).join(df2)
Out[91]:
'A' 'B' 'C' 'D'
0 1 2 7 8
3 3 4 9 10
5 5 6 11 12
Try concat:
pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
The reset_index() calls make the indices the same, then, concat with axis=1 simply joins horizontally.
I guess you can try to join them (doing this it performs the join on the index, which is the same for the two DataFrame due to reset_index):
In [18]: df1.join(df2.reset_index(drop=True))
Out[18]:
'A' 'B' 'C' 'D'
0 1 2 7 8
1 3 4 9 10
2 5 6 11 12

why the result of method mostSimilarItems in mahout is not order by the weight?

I have the following codes:
ItemSimilarity itemSimilarity = new UncenteredCosineSimilarity(dataModel);
recommender = new GenericItemBasedRecommender(dataModel,itemSimilarity);
List<RecommendedItem> items = recommender.mostSimilarItems(10, 5);
my datamodel is like this:
uid itemid socre
userid itemid score
1 6 5
1 10 3
1 11 5
1 12 4
1 13 5
2 2 3
2 6 5
2 10 3
2 12 5
when I run the code above,the result is just like this:
13
6
11
2
12
I debug the code,and find that the List items = recommender.mostSimilarItems(10, 5); return the items has the same score,that is one!
so,I have a problem.in my opinion,I think the mostsimilaritem should consider the item co-occurrence matrix:
2 6 10 11 12 13
2 0 1 1 0 1 0
6 1 0 2 1 2 1
10 1 2 0 1 2 1
11 0 1 1 0 1 1
12 1 2 2 1 0 1
13 0 1 1 1 1 0
in the matrix above ,the item 12's most similar should be [6,12,11,13,2],because the item 1 and item 12 is more similar than the other items,isn't it?
now,anyone who can explain this for me?thanks!
In your matrix you have much more data than in your input. In particular you seem to be imputing 0 values that are not in the data. That is why you are likely getting answers different from what you expect.
Mahout expects your IDs to be contiguous Integers starting from 0. This is true of your row and column ids. Your matrix looks like it has missing ids. Just having Integers is not enough.
Could this be the problem? Not sure what Mahout would do with the input above.
I always keep a dictionary to map Mahout IDs to/from my own.

Using COUNTIFS on 3 different columns and then need to SUM a 4th column?

I have written this formula below. I do not know the correct part of this formula that will add the numbers I have in Column AB2:AB552. As it is, this formula is counting the number of cells in that range that has numbers in it, but I need it to total those numbers as my final result. Any help would be great.
=COUNTIFS(Cases!B2:B552,"1",Cases!G2:G552,"c*",Cases!X2:X552,"No",**Cases!AB2:AB552,">0"**)
Assuming you don't actually need the intermediate counts, the sumifs function should give you the final result:
=SUMIFS(Cases!AB2:AB552,Cases!B2:B552,1,Cases!G2:G552,"c",Cases!X2:X552,"No",Cases!AB2:AB552,">0")
Testing this with some limited data:
Row B G X AB
2 2 a No 10
3 1 c No 24
4 2 c No 4
5 1 c No 0
6 1 a Yes 9
7 2 c No 12
8 2 c No 6
9 2 b No 0
10 1 b No 0
11 1 a No 10
12 2 c No 6
13 1 c No 20
14 1 c No 4
15 1 b Yes 22
16 1 b Yes 22
the formula above returned 48, the sum of AB3, AB13, and AB14, which were the only rows matching all 4 criteria

Resources