Performing a one to many join in R dplyr - join

How to do a one to one-to-many join without any keysin r using dplyr?
I have two tables:
origin<-tribble(~"o",
1,2)
destination<-tribble(~"d",
5,
6,
7)
I want to merge both of them without any keys like the following:
od<- tribble(~"o",~"d",
1,5,
1,6,
1,7,
2,5,
2,6,
2,7)
Can anyone help me out with this?

You can use slice and rep to repeat the rows in origin based on the length of destination. Then, inside of bind_cols, we can create a list and repeat the values in destination based on the length of origin; then, bind them together.
library(tidyverse)
origin %>%
slice(rep(1:n(), each = nrow(destination[, 1]))) %>%
bind_cols(., d = unlist(rep(
c(destination[, 1]), times = nrow(origin)
)))
Output
# A tibble: 6 × 2
o d
<dbl> <dbl>
1 1 5
2 1 6
3 1 7
4 2 5
5 2 6
6 2 7

tidyr::crossing and expand_grid can give you a cross join of two dataframes.
tidyr::crossing(origin, destination)
#tidyr::expand_grid(origin, destination)
# o d
# <dbl> <dbl>
#1 1 5
#2 1 6
#3 1 7
#4 2 5
#5 2 6
#6 2 7

Related

Clustering to achieve heterogeneous groups

I want to group 100 users based on a categorical variable (which can be low, medium, or high). The group size should be 3. I want to get the maximal heterogeneity within groups, assuming that users are distributed equally. I wonder if I can use some clustering algorithm to group based on the dissimilarity? Any suggestions?
I don't believe you need a clustering algorithm to group the data based upon a categorical variable.
Based on you question, I think this should work.
# Code
from sklearn.model_selection import train_test_split
group1, group23 = train_test_split(data, test_size=2/3., stratify=data['lab'])
group2, group3 = train_test_split(group23, test_size=1/2., stratify=group23['lab'])
Stratify makes sure that the maximum heterogeneity is maintained for the given categorical value.
# Sample output
print(data)
val1 val2 lab
0 1 1 L
1 2 2 L
2 3 3 L
3 4 4 M
4 5 5 M
5 6 6 M
6 7 7 H
7 8 8 H
8 9 9 H
print(group1)
val1 val2 lab
4 5 5 M
1 2 2 L
6 7 7 H
print(group2)
val1 val2 lab
8 9 9 H
2 3 3 L
3 4 4 M
print(group3)
val1 val2 lab
0 1 1 L
7 8 8 H
5 6 6 M
train_test_split() Documentation

Octave Conditional Merging of matrices

I have searched for an Octave function that facilitates conditional merging of matrices but haven't one so far. My goal is to do this using vectors without looping. Here is an example of what I am trying to do.
A= [1 1
2 2
3 1
5 2];
B= [1 9
2 10];
I would like to get C as
C= [1 1 9
2 2 10
3 1 9
5 2 10];
Is there a function that takes A, B and the list of column(s) to join on and then produce C?
You can use the second output of ismember to find the occurrences of the second column of A in the first column of B and then use that to grab specific entries from the second column of B to construct C.
[~, inds] = ismember(A(:,2), B(:,1));
C = [A, B(inds,2)];
%// 1 1 9
%// 2 2 10
%// 3 1 9
%// 5 2 10

Join two pandas dataframes based on line order

I have two dataframes df1 and df2 I want to join. Their indexes are not the same and they don't have any common columns. What I want is to join them based on the order of the rows, i.e. join the first row of df1 with the first row of df2, the second row of df1 with the second row of df2, etc.
Example:
df1:
'A' 'B'
0 1 2
1 3 4
2 5 6
df2:
'C' 'D'
0 7 8
3 9 10
5 11 12
Should give
'A' 'B' 'C' 'D'
0 1 2 7 8
3 3 4 9 10
5 5 6 11 12
I don't care about the indexes in the final dataframe. I tried reindexing df1 with the indexes of df2 but could not make it work.
You could assign to df1 index of df2 and then use join:
df1.index = df2.index
res = df1.join(df2)
In [86]: res
Out[86]:
'A' 'B' 'C' 'D'
0 1 2 7 8
3 3 4 9 10
5 5 6 11 12
Or you could do it in one line with set_index:
In [91]: df1.set_index(df2.index).join(df2)
Out[91]:
'A' 'B' 'C' 'D'
0 1 2 7 8
3 3 4 9 10
5 5 6 11 12
Try concat:
pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
The reset_index() calls make the indices the same, then, concat with axis=1 simply joins horizontally.
I guess you can try to join them (doing this it performs the join on the index, which is the same for the two DataFrame due to reset_index):
In [18]: df1.join(df2.reset_index(drop=True))
Out[18]:
'A' 'B' 'C' 'D'
0 1 2 7 8
1 3 4 9 10
2 5 6 11 12

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

Using COUNTIFS on 3 different columns and then need to SUM a 4th column?

I have written this formula below. I do not know the correct part of this formula that will add the numbers I have in Column AB2:AB552. As it is, this formula is counting the number of cells in that range that has numbers in it, but I need it to total those numbers as my final result. Any help would be great.
=COUNTIFS(Cases!B2:B552,"1",Cases!G2:G552,"c*",Cases!X2:X552,"No",**Cases!AB2:AB552,">0"**)
Assuming you don't actually need the intermediate counts, the sumifs function should give you the final result:
=SUMIFS(Cases!AB2:AB552,Cases!B2:B552,1,Cases!G2:G552,"c",Cases!X2:X552,"No",Cases!AB2:AB552,">0")
Testing this with some limited data:
Row B G X AB
2 2 a No 10
3 1 c No 24
4 2 c No 4
5 1 c No 0
6 1 a Yes 9
7 2 c No 12
8 2 c No 6
9 2 b No 0
10 1 b No 0
11 1 a No 10
12 2 c No 6
13 1 c No 20
14 1 c No 4
15 1 b Yes 22
16 1 b Yes 22
the formula above returned 48, the sum of AB3, AB13, and AB14, which were the only rows matching all 4 criteria

Resources