How to identify the features that have the highest impact on the KNN model's predictions?
i have used the permutation importance from eli5 module as a proxy. What are the other ways to get the feature importance?
Related
I know that support vector machine, random tree forest and logistic regression are famous machine learning (ML)algorithms for classification.
I'm confused the terminology between a feature extraction, selection and classification.
Does the above ML algorithms are used for extracting features not part of selecting?
Does the ML algorithms include both process of feature extraction and classification?
Does the result of training the ML algorithm (accuracy, specificity, sensitivity..) tell us the result of classifying a disease after the feature extraction?
Regarding your confusion about the 3 terminologies,
Feature extraction: When you want to create new features out of raw data (say you have the transaction_day column but you are only interested in the month, so you create a new column "transaction_month" out of "transaction_day")
Feature selection: You have many features but want to select only the important ones (how many of them is another topic to be studied). This could speed up the process of learning and with the right strategy, you would not sacrifice accuracy in many applications.
Classification: Is a family of supervised (labeled) machine learning that your goal is to assign observations to known classes (for example emails to spam or normal class)
Note: Some of machine learning algorithms like "Lasso" have build-in feature selection but for others, large coefficient of the feature after training usually shows the importance of the feature (read more about recursive feature elimination (rfe))
you may also find a good discussion in this post.
For e-commerce company, how to pick up features when doing Click Through Rate prediction using logistic regression, SVM or other machine learning models.
I tried gender, statistic features from goods tags, and used SVM, NN. but the result was very bad.
Is there any suggestions or best practices about the important factors for CTR prediction in e-commerce? THANKS!
When you use a library like scikit-learn, you can use GridSearchCV the best parameter for the model you're building! You can specify the evaluation metric that you want to optimize! In your case, you need to understand what the evaluation metric is!
Read about it here:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
So far, I have read some highly cited metric learning papers. The general idea of such papers is to learn a mapping such that mapped data points with same label lie close to each other and far from samples of other classes. To evaluate such techniques they report the accuracy of the KNN classifier on the generated embedding. So my question is if we have a labelled dataset and we are interested in increasing the accuracy of classification task, why do not we learn a classifier on the original datapoints. I mean instead of finding a new embedding which suites KNN classifier, we can learn a classifier that fits the (not embedded) datapoints. Based on what I have read so far the classification accuracy of such classifiers is much better than metric learning approaches. Is there a study that shows metric learning+KNN performs better than fitting a (good) classifier at least on some datasets?
Metric learning models CAN BE classifiers. So I will answer the question that why do we need metric learning for classification.
Let me give you an example. When you have a dataset of millions of classes and some classes have only limited examples, let's say less than 5. If you use classifiers such as SVMs or normal CNNs, you will find it impossible to train because those classifiers (discriminative models) will totally ignore the classes of few examples.
But for the metric learning models, it is not a problem since they are based on generative models.
By the way, the large number of classes is a challenge for discriminative models itself.
The real-life challenge inspires us to explore more better models.
As #Tengerye mentioned, you can use models trained using metric learning for classification. KNN is the simplest approach but you can take the embeddings of your data and train another classifier, be it KNN, SVM, Neural Network, etc. The use of metric learning, in this case, would be to change the original input space to another one which would be easier for a classifier to handle.
Apart from discriminative models being hard to train when data is unbalanced, or even worse, have very few examples per class, they cannot be easily extended for new classes.
Take for example facial recognition, if facial recognition models are trained as classification models, these models would only work for the faces it has seen and wouldn't work for any new face. Of course, you could add images for the faces you wish to add and retrain the model or fine-tune the model if possible, but this is highly impractical. On the other hand, facial recognition models trained using metric learning can generate embeddings for new faces, which can be easily added to the KNN and your system then can identify the new person given his/her image.
I used Logistic Regression as a classifier. I have six features, I want to know the important features in this classifier that influence the result more than other features. I used Information Gain but it seems that it doesn't depend on the used classifier. Is there any method to rank the features according to their importance based on specific classifier (like Logistic Regression)?
any help would be highly appreciated.
You could use Random Forest Classifier to give you a ranking of your features. You could then select the top x features from this and use it for logistic regression, although Random Forest would work perfectly fine as well.
Check out variable importance at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
One way to do this is by null hypothesis significance testing. Basically, for each feature, you test for evidence that the coefficient of that feature is nonzero. Most statistical software reports the results of these tests by default in the model summary (Scikit-learn and other machine-learning oriented tools tend to not do so). With a small number of features, you can use this information and stepwise regression to rank the importance of the features.
I'm using Weka to perform classification, clustering, and some regression on a few large data sets. I'm currently trying out all the classifiers (decision tree, SVM, naive bayes, etc.).
Is there a way (in Weka or other machine learning toolkit) to sweep through all the available classifier algorithms to find the one that produces the best cross-validated accuracy or other metric?
I'd like to find the best clustering algorithm, too, for my other clustering problem; perhaps finding the lowest sum-of-squared-error?
Isn't that some kind of overfitting, too? Trying tons of classifiers, and choosing the best?
Also note that preprocessing is usually very important, and different classifiers may need different preprocessing; and each classifier has in turn a dozen or so parameters...
Same for clustering, don't choose a clustering algorithm by some metric. Because if you choose e.g. "lowest sum-of-squares", k-means will win. Not because it is better. But because it is more overfit to your evaluation method: k-means optimizes the sum-of-squares. The results may be crap on other metrics, but on SSQ, they are by design a local optimum.
Data mining is not something you can automate to a push-button level.
It's a skill that requires experience on how to preprocess, choose algorithms, adjust parameters and evaluate the actual outcome. Otherwise, you'd have some software on the market where you just feed your data and get the optimal classifier out.