I am quite confused about following two problems:
I have a 15 dimensional dataset which should be used to cluster how many types of attacks are contained in the dataset.
1. now i have already clustered my dataset into 5 clusters (5 attacks). Does anyone know how can i point which cluster is which attack? (how to label the clusters not by just "cluster 1,cluster 2...")
2. In supervised classification, we have training dataset and testing dataset, and the testing is conducted with the classifier built from traning dataset. My question is, can the same approach be used for clustering. Like building a model with clustering algorithm, and then automatically classify the new instance into a specific cluster? Is this achievable?
How should an unsupervised method be able to identify named attacks?
The human-assigned name is not in the data!
For some clustering algorithms you can assign new instances automatically, but in general you cannot (not without knowing the model used by the clustering). In the worst case, a new observation would even e.g. merge two clusters into one. What are you going to do then?
If you want classification, use classification, not clustering.
Clustering has a very different mind-set. If you approach it from a classification point of view, you will not really understand it. You use clustering for finding something unknown in data, classification for generalizing something known to new data.
If necessary, you can also train a classifier on your cluster. But don't do this blindly. First make sure that the clusters actually are something useful. It's much easier to come up with a completely meaningless clustering result than with a good clustering. Training a classifier on worthless clusters won't produce a meaningful output.
Related
I am classifying a client's clients. However, the data is fluid and the clusters can change every day.
Running new clusters daily to update user clusters is difficult because Kmeans is inconsistent in labeling clusters.
If we cluster, and then train the data say with a Neural networks or XGBoost and moving forward simply predict the clusters. Does this make sense or is it a good way to do things?
Yeah, it does make sense, it's just a regular classification task at that point. You should have enough data assigned to clusters though before moving on to neural network.
On the other hand, why don't you predict clusters for new points instead of updating them (you can see separate methods for fit and predict in sklearn's docs, though it depends on technology you are using)? Remember, that neural network will only be as good as it's input (K-Means clusters) and it's predictions will probably be similiar to K-Means.
Furthermore, NNs are more complicated and harder to train, maybe those shouldn't be you first choice.
You could check the idea of fuzzy clustering as well, as the data is fluid it might be a better fit for your case. Maybe autoencoders, as a method of obtaining latent variables, might be of use as well.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!
Why behaviour of different classifier differ for different data?
Based on what parameters we can decide the good classifier for particular dataset?
For some dataset naive bayes gives better accuracy than SVM classifier
and for other dataset SVM performs better than naive bayes. Why is it
so? What is the reason?
Those are completely different classifiers. If you would have one classifier which is allways better than the other one. Why would you need the "bad one" then?
First google hit about when SVM's are not the best choice:
https://www.quora.com/For-what-kind-of-classification-problems-is-SVM-a-bad-approach
There is no general answer for this question. To understand which classifier to be used when you will need to understand the algorithm behind the classification procedure.
For instance logistic regression assumes a normal distribution of y and is generally useful when a particular parameter is not a uniquely deciding factor however combined weightage of the factors make a difference, for instance in text classification.
Decision tree on the other hand splits on the basis of parameter which gives most information. So if you have a set of parameters which is highly correlated with the label, then it makes more sense to use decision tree based classifiers.
SVM, work based on identifying adequate hyperplanes. These are generally useful when it is not possible to classify data in one plane but projecting them into higher plane classifies them easily. This is a nice tutorial on SVM https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93
In short the only way to learn which classifier will be better in which situation is to understand how they work, and then figure out if they are best for your situation.
Another, crude way will be try every classifier and pick the best one, but i don't think you are interested in that.
I have a set of data, which has 3 possible events. There are 24 features that effect which of the three events will happen.
I have training data with all the 24 features and which events happened.
What I want to do is using this data predict which of the three events will happen next, given all the 24 feature values are known.
Could you suggest some machine learning algorithm that I should use to solve this problem
This sounds like a typical classification problem in supervised learning. However, you haven't given us enough information to suggest a particular algorithm.
We would need statistics about the "shape" of your data: relative clustering and range, correlations among the features, etc. The critical points so far are that you have few classes (3) and many more features than classes. What have you considered so far? Backing up a little, what unsupervised classification algorithms have you researched well enough to use?
My personal approach is to hit such a generic problem with Naive Bayes or multi-class SVM, and use the resulting classification parameters as input for feature reduction. I might also try a CNN with one hidden layer (or none, just a single FC connection) and then examine the weights to eliminate extraneous features.
Given the large dimensionality, you might also try hitting it with k-means clustering to see whether the classification is already cohesive in 24-D space. Try k=6; in most runs, this will give you 3 good clusters and 3 tiny outliers.
Does that get you moving toward a solution?
I was looking for an automatic way to decide how many layers should I apply to my network depends on data and computer configuration. I searched in web, but I could not find anything. Maybe my keywords or looking ways are wrong.
Do you have any idea?
The number of layers, or depth, of a neural network is one of its hyperparameters.
This means that it is a quantity that can not be learned from the data, but you should choose it before trying to fit your dataset. According to Bengio,
We define a hyper-
parameter for a learning algorithm A as a variable to
be set prior to the actual application of A to the data,
one that is not directly selected by the learning algo-
rithm itself.
There are three main approaches to find out the optimal value for an hyperparameter. The first two are well explained in the paper I linked.
Manual search. Using well-known black magic, the researcher choose the optimal value through try-and-error.
Automatic search. The researcher relies on an automated routine in order to speed up the search.
Bayesian optimization.
More specifically, adding more layers to a deep neural network is likely to improve the performance (reduce generalization error), up to a certain number when it overfits the training data.
So, in practice, you should train your ConvNet with, say, 4 layers, try adding one hidden layer and train again, until you see some overfitting. Of course, some strong regularization techniques (such as dropout) is required.