I am trying to apply some clustering method on my datasets (with numerical dimensions). But I'm convinced that the features have different weights for different clusters. I read that there is an approach called soft subspace clustering that tries do identify the clusters and the weights of the features for each cluster simultaneously. However, the algorithms that I have found apre applied only to categorical data.
I am trying to identify some algorithm of soft subspace clustering for numerical. Do you know if there is any, or how can I adapt methods originally designed to deal with categorical data for dealing with numerical data (I think that it would necessary to propose some way of measuring the relevance of each numerical feature in each cluster)?
Yes, there are dozens of subspace clustering algorithms.
You'll need to do a proper literature research, this is too broad to cover in a QA like stack overflow. Look for (surprise) "subspace clustering", but also include "biclustering", for example.
Related
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!
I have a set of data, which has 3 possible events. There are 24 features that effect which of the three events will happen.
I have training data with all the 24 features and which events happened.
What I want to do is using this data predict which of the three events will happen next, given all the 24 feature values are known.
Could you suggest some machine learning algorithm that I should use to solve this problem
This sounds like a typical classification problem in supervised learning. However, you haven't given us enough information to suggest a particular algorithm.
We would need statistics about the "shape" of your data: relative clustering and range, correlations among the features, etc. The critical points so far are that you have few classes (3) and many more features than classes. What have you considered so far? Backing up a little, what unsupervised classification algorithms have you researched well enough to use?
My personal approach is to hit such a generic problem with Naive Bayes or multi-class SVM, and use the resulting classification parameters as input for feature reduction. I might also try a CNN with one hidden layer (or none, just a single FC connection) and then examine the weights to eliminate extraneous features.
Given the large dimensionality, you might also try hitting it with k-means clustering to see whether the classification is already cohesive in 24-D space. Try k=6; in most runs, this will give you 3 good clusters and 3 tiny outliers.
Does that get you moving toward a solution?
I have a large set of scanned documents that I need to index however the the documents of interest are a small proportion of the entire package my classifier needs to identify. To get an idea of the optimum number of classes and how best to merge documents in a class I wanted to run an unsupervised clustering analysis.
Which distance method would work better to capture the structural information. Also would agglomerative Hierarchical clustering be the best clustering approach for the given task? Thanks
An unsupervised clustering technique fails on scanned documents since it fails to grasp the underlying structure and ends up giving non nonsensical clusters. So the approach is fundamentally flawed. However Classification using deep convolutional neural networks, with sufficient data and carefully chosen distinct classes, can outperform OCR techniques if the documents have a distinct structure.
I was looking for an automatic way to decide how many layers should I apply to my network depends on data and computer configuration. I searched in web, but I could not find anything. Maybe my keywords or looking ways are wrong.
Do you have any idea?
The number of layers, or depth, of a neural network is one of its hyperparameters.
This means that it is a quantity that can not be learned from the data, but you should choose it before trying to fit your dataset. According to Bengio,
We define a hyper-
parameter for a learning algorithm A as a variable to
be set prior to the actual application of A to the data,
one that is not directly selected by the learning algo-
rithm itself.
There are three main approaches to find out the optimal value for an hyperparameter. The first two are well explained in the paper I linked.
Manual search. Using well-known black magic, the researcher choose the optimal value through try-and-error.
Automatic search. The researcher relies on an automated routine in order to speed up the search.
Bayesian optimization.
More specifically, adding more layers to a deep neural network is likely to improve the performance (reduce generalization error), up to a certain number when it overfits the training data.
So, in practice, you should train your ConvNet with, say, 4 layers, try adding one hidden layer and train again, until you see some overfitting. Of course, some strong regularization techniques (such as dropout) is required.
I have two dependent continuous variables and i want to use their combined values to predict the value of a third binary variable. How do i go about discretizing/categorizing the values? I am not looking for clustering algorithms, i'm specifically interested in obtaining 'meaningful' discrete categories i can subsequently use in in a Bayesian classifier.
Pointers to papers, books, online courses, all very much appreciated!
That is the essence of machine learning and problem one of the most studied problem.
Least-square regression, logistic regression, SVM, random forest are widely used for this type of problem, which is called binary classification.
If your goal is to pragmatically classify your data, several libraries are available, like Scikits-learn in python and weka in java. They have a great documentation.
But if you want to understand what's the intrinsics of machine learning, just search (here or on google) for machine learning resources.
If you wanted to be a real nerd, generate a bunch of different possible discretizations and then train a classifier on it, and then characterize the discretizations by features and then run a classifier on that, and see what sort of discretizations are best!?
In general discretizing stuff is more of an art and having a good understanding of what the input variable ranges mean.