Creating a hybrid clustering method with K-Means & Agglomerative Clustering - machine-learning

I'm trying to create a clustering method that combines K-Means and Agglomerative Clustering.
The first step would be to apply the K-Means algorithm to group the data into 50 clusters.
From the centroids and labels obtained for each cluster,
Second, I ll' display a dendrogram in order to choose the adequate number of clusters (>2).
Then I'll apply a hierarchical ascending classification algorithm (with Agglomerative Clustering) from the centroids obtained in step 1 with the number of clusters obtained in step 2.
Then I'll calculate the centroids for each new cluster.
Finally, I'll use the calculated centroids to consolidate these clusters by the K-Means algorithm (with the init argument of KMeans which allows to specify the centroids from which the algorithm starts).
In order to do this. I tried the following code :
# Step 1
clf = KMeans(n_clusters = 50)
clf.fit(df)
labels = clf.labels_
centroids = clf.cluster_centers_
# Step 2
Z = linkage(df, method = 'ward', metric = 'euclidean')
dendrogram(Z, labels = labels, leaf_rotation = 90., color_threshold = 0)
On the basis of the dendrogram, I found out that optimal choice for the numbers of clusters was 3
# Step 3 :
avg = AgglomerativeClustering(n_clusters = 3)
avg.fit(centroids)
labels_1 = avg.labels_
Z = linkage(centroids, method = 'ward', metric = 'euclidean')
dendrogram(Z, labels = labels_1, leaf_rotation = 90., color_threshold = 0)
But after this I'm lost, I don't know how to calculate the new centroids and how to implement the new KMeans code
What do you think of my steps, did I do something wrong? What to do to make this combination work, thanks !

Related

While applying MinMax scaling does each column needs to be treated independently for train and test?

So, I have 8 columns in my dataframe: 5 features and other 3 are targets. After following these process, the results obtained are not good. Can anyone provide any feedback in the steps followed?
Here I am defining 2 minmax scaling variables, one for features and other for targets columns. Once model predicts the values, we run reverse scaling on features and predicted targets again to obtain the results.
#smoothening and minMax scaling
smoother=tsmoothie.KalmanSmoother(component='level_trend', component_noise={'level':0.1, 'trend':0.1})
scaler_features = MinMaxScaler(feature_range=(0,1))
scaler_targets = MinMaxScaler(feature_range=(0,1))
#setting up features and targets from the df
df_norm_feature = scaler_features.fit_transform(raw_df.iloc[:,:5])
df_norm_target = scaler_targets.fit_transform(raw_df.iloc[:,5:])
#smoothening features and targets
smoother.smooth(df_norm_feature)
smoothed_features = smoother.smooth_data
smoother.smooth(df_norm_target)
smoothed_targets = smoother.smooth_data
#split into train test and train the data, and prepare the model on train.
#for reverse transformation I am using the following code.
test_resultsForAll = mode.predict(test_data)
transformed_test_resultsForAll = scaler_targets.inverse_transform(test_resultsForAll))
but the results obtained via this method are not good. Are there any mistakes in the order of steps or do I need to perform minMax scaling & smoothening on the whole dataset at once?

MLJ: selecting rows and columns for training in evaluate

I want to implement a kernel ridge regression that also works within MLJ. Moreover, I want to have the option to use either feature vectors or a predefined kernel matrix as in Python sklearn.
When I run this code
const MMI = MLJModelInterface
MMI.#mlj_model mutable struct KRRModel <: MLJModelInterface.Deterministic
mu::Float64 = 1::(_ > 0)
kernel::String = "linear"
end
function MMI.fit(m::KRRModel,verbosity::Int,K,y)
K = MLJBase.matrix(K)
fitresult = inv(K+m.mu*I)*y
cache = nothing
report = nothing
return (fitresult,cache,report)
end
N = 10
K = randn(N,N)
K = K*K
a = randn(N)
y = K*a + 0.2*randn(N)
m = KRRModel()
kregressor = machine(m,K,y)
cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
evaluate!(kregressor, resampling=cv, measure=rms, verbosity=1)
the evaluate! function evaluates the machine on different subsets of rows of K. Due to the Representer Theorem, a kernel ridge regression has a number of nonzero coefficients equal to the number of samples. Hence, a reduced size matrix K[train_rows,train_rows] can be used instead of K[train_rows,:].
To denote I'm using a kernel matrix I'd set m.kernel = "" . How do I make evaluate! select the columns as well as the rows to form a smaller matrix when m.kernel = ""?
This is my first time using MLJ and I'd like to make as few modifications as possible.
Quoting the answer I got on the Julia Discourse from #ablaom
The intended use of evaluate! is to estimate the generalisation error
associated with some supervised learning model, by subsampling
observations, as in cross-validation, a common use-case. I’m afraid
there is no natural way for evaluate! do feature subsampling.
https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/
FYI: There is a version of kernel regression implementing the MLJ
model interface, namely kernel partial least squares regression from
the package GitHub - lalvim/PartialLeastSquaresRegressor.jl:
Implementation of a Partial Least Squares Regressor 2 .

How to do link prediction with node embeddings?

I am currently working on an item embedding task in recommendation system and I want to evaluate the performance of the new embedding algorithm with the old ones. I have read some papers about graph embedding and almost every paper mentioned a normal method to evaluate the embeddings which is link prediction. But none of these papers described exactly how you do it. So my question is how to evaluate the embeddings using link prediction?
The algorithm I am trying to apply is:
First a directed graph is built on user click sequences, each node in the graph represents an item, and if a user once clicked item A then clicked B, there should be two nodes A and B and an edge A-B with weight of 1. When another user clicked A then clicked B, the weight of edge A-B is added by 1.
Then a new sequence dataset is generated by random walking the graph, using the outbound weights as the teleport probabilities.
Finally SkipGram is performed on the new sequences to generate the node embeddings.
As many papers mentioned, I removed a certain proportion of the edges in the graph as the positive samples of test set(e.g. 0.25) and randomly generated some fake edges as the negative ones. So what's next? Should I simply generate fake edges for the real edges in the training set, concatenate the embeddings of the two nodes on each edge, and build a common classifier such as logistic regression and test it on the test set? Or should I calculate the AUC on test set with cosine similarity of the two nodes and a label of 0/1 indicating if the two nodes are really connected? Or should I calculate the AUC with the sigmoided dot product of the embeddings of two nodes and a label of 0/1 indicating if the two nodes are really connected, since this is how you compute the probability at last layer?
# these are example describing the three methods above
item_emb = np.random.random(400).reshape(100, 4) # assume we have 100 items and have embedded them into a 4-dimension vector space.
test_node = np.random.randint(0, 100, size=200).reshape(100, 2) # assume we have 100 pairs of nodes
test_label = np.random.randint(0, 2, size=100).reshape(100, 1) # assume this is the label indicating if the pair of nodes are really connected
def test_A():
# use logistic regression
train_node = ... # generate true and fake node pairs in a similar way
train_label = ... # generate true and fake node pairs in a similar way
train_feat = np.hstack(
item_emb[train_node[:, 0]],
item_emb[train_node[:, 1]]) # concatenate the embeddings
test_feat = np.hstack(
item_emb[test_node[:, 0]],
item_emb[test_node[:, 1]]) # concatenate the embeddings
lr = sklearn.linear_models.LogisticRegression().fit(train_feat, train_label)
auc = roc_auc_score(test_label, lr.predict_proba(test_feat)[:,1])
return auc
def test_B():
# use cosine similarity
emb1 = item_emb[test_node[:, 0]]
emb2 = item_emb[test_node[:, 1]]
cosine_sim = emb1 * emb2 / (np.linalg.norm(emb1, axis=1)*np.linalg.norm(emb2,axis=1)
auc = roc_auc_score(test_label, cosine_sim)
return auc
def test_C():
# use dot product
# here we extract the softmax weights and biases from the training network
softmax_weights = ... # same shape as item_emb
softmax_biases = ... # shape of (item_emb.shape[0], 1)
embedded_item = item_emb[test_node[:, 0]] # target item embedding
softmaxed_context = softmax_weights[test_node[:, 1]] + softmax_biases
dot_prod = np.sum(embeded_item * softmaxed_context, axis=1)
auc = roc_auc_score(test_label, dot_prod)
return auc
I have tried the three method in several tests, and they are not always telling the same thing. Some parameter combinations perform better with testA() and bad in other metrics, some the opposite..etc. Sadly there is no such a parameter combination that out performs others in all three metrics...The question is which metric should I use?
You should investigate some implementations:
StellarGraph: Link prediction with node2vec+Logistic regression
AmpliGraph: Link prediction with ComplEx
Briefly, one should sample edges (not nodes!) from the original graph, remove them, and learn embeddings on that truncated graph. Then an evaluation is performed on removed edges.
Also, there are two possible cases:
All possible edges between any pair of nodes are labeled. In this case evaluation metric is ROC AUC, when we learn a classifier to distinguish positives and negatives edges.
Only positive (real) edges are observed. We don't know if rest pairs are connected or not in the real world. Here we generate negative (fake) nodes for every positive one. The task is considered as Entity ranking with next evaluation metrics:
Rank
Mean Rank (MR)
Mean Reciprocal Rank (MRR)
Hits#N
An example can be found in the paper, sections 5.1-5.3.

Combining Neural Networks Pytorch

I have 2 images as input, x1 and x2 and try to use convolution as a similarity measure. The idea is that the learned weights substitute more traditional measure of similarity (cross correlation, NN, ...). Defining my forward function as follows:
def forward(self,x1,x2):
out_conv1a = self.conv1(x1)
out_conv2a = self.conv2(out_conv1a)
out_conv3a = self.conv3(out_conv2a)
out_conv1b = self.conv1(x2)
out_conv2b = self.conv2(out_conv1b)
out_conv3b = self.conv3(out_conv2b)
Now for the similarity measure:
out_cat = torch.cat([out_conv3a, out_conv3b],dim=1)
futher_conv = nn.Conv2d(out_cat)
My question is as follows:
1) Would Depthwise/Separable Convolutions as in the google paper yield any advantage over 2d convolution of the concatenated input. For that matter can convolution be a similarity measure, cross correlation and convolution are very similar.
2) It is my understanding that the groups=2 option in conv2d would provide 2 separate inputs to train weights with, in this case each of the previous networks weights. How are these combined afterwards?
For a basic concept see here.
Using a nn.Conv2d layer you assume weights are trainable parameters. However, if you want to filter one feature map with another, you can dive deeper and use torch.nn.functional.conv2d to explicitly define both input and filter yourself:
out = torch.nn.functional.conv2d(out_conv3a, out_conv3b)

How to check deep embedded clustering on new data?

I'm using DEC from mxnet (https://github.com/apache/incubator-mxnet/tree/master/example/deep-embedded-clustering)
While it defaults to run on the MNIST, I have changed the datasource to several hundreds of documents (which should be perfectly fine, given that mxnet can work with the Reuters dataset)
The question; after training MXNET, how can I use it on new, unseen data? It shows me a new prediction each time!
Here is the code for collecting the dataset:
vectorizer = TfidfVectorizer(dtype=np.float64, stop_words='english', max_features=2000, norm='l2', sublinear_tf=True).fit(training)
X = vectorizer.transform(training)
X = np.asarray(X.todense()) # * np.sqrt(X.shape[1])
Y = np.asarray(labels)
Here is the code for prediction:
def predict(self, TrainX, X, update_interval=None):
N = TrainX.shape[0]
if not update_interval:
update_interval = N
batch_size = 256
test_iter = mx.io.NDArrayIter({'data': TrainX}, batch_size=batch_size, shuffle=False,
last_batch_handle='pad')
args = {k: mx.nd.array(v.asnumpy(), ctx=self.xpu) for k, v in self.args.items()}
z = list(model.extract_feature(self.feature, args, None, test_iter, N, self.xpu).values())[0]
kmeans = KMeans(self.num_centers, n_init=20)
kmeans.fit(z)
args['dec_mu'][:] = kmeans.cluster_centers_
print(args)
sample_iter = mx.io.NDArrayIter({'data': X})
z = list(model.extract_feature(self.feature, args, None, sample_iter, N, self.xpu).values())[0]
p = np.zeros((z.shape[0], self.num_centers))
self.dec_op.forward([z, args['dec_mu'].asnumpy()], [p])
print(p)
y_pred = p.argmax(axis=1)
self.y_pred = y_pred
return y_pred
Explanation: I thought I also need to pass a sample of the data I trained the system with. That is why you see both TrainX and X there.
Any help is greatly appreciated.
Clustering methods (by themselves) don't provide a method for labelling samples that weren't included in the calculation for deriving the clusters. You could re-run the clustering algorithm with the new samples, but the clusters are likely to change and be given different cluster labels due to different random initializations. So this is probably why you're seeing different predictions each time.
One option is to use the cluster labels from the clustering method in a supervised way, to predict the cluster labels for new samples. You could find the closest cluster center to your new sample (in the feature space) and use that as the cluster label, but this ignores the shape of the clusters. A better solution would be to train a classification model to predict the cluster labels for new samples given the previously clustered data. Success of these methods will depend on the quality of your clustering (i.e. the feature space used, separability of clusters, etc).

Resources