How to choose the right normalization method for the right dataset? - machine-learning

There are several normalization methods to choose from. L1/L2 norm, z-score, min-max. Can anyone give some insights as to how to choose the proper normalization method for a dataset?
I didn't pay too much attention to normalization before, but I just got a small project where it's performance has been heavily affected not by parameters or choices of the ML algorithm but by the way I normalized the data. Kind of surprise to me. But this may be a common problem in practice. So, could anyone provide some good advice? Thanks a lot!

Related

Is there something like an ad-hoc approach in machine learning?

When we use a machine learning approach, we divide the data set into test and training data and, in effect, we always use a post hoc approach by using all the data and then calculating the y-value for a new query.
But is there such a thing as an ad hoc approach where we can go through feature by feature for a new query and see how our prediction changes?
The advantage of this would be that we know exactly which feature has changed the predictions and how.
I would be grateful for any advice, including literature references, as I don't really know how to google it. It is also possible that the term ad-hoc approach is not chosen correctly.
Very vague question. Also, why would you know how the prediction changes? You usually want to know which feature contributes most towards finding the 'best' prediction/correct classification. That is approached by looking at Feature Importance which comes in different flavors for different algorithms.
In case that is kind of what you were looking for take a look at Permutation Feature Importance, Boruta Algorithm, SHAP Feature Importance, Feature Importance for tree-based algorithms, ...

How to give a logical reason for choosing a model

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

How to perform linear/logistic regression on predictions of different models (say randomforest, gbm, svm etc)?

Basically it is done to improve the predictions by creating an ensemble. But how do we do that. Could somebody please explain using a sample code in R? I am just a learner. Any help would greatly be appreciated.
Thank you.
Prediction aggregation in ensembles can be done in a large variety of ways. The simplest approach is majority voting (classification) or averaging the predictions of all base models (regression).
Often, complex aggregation schemes are not much better than the basic ones (and are very sensitive to overfitting). This is why specialized packages such as EnsembleSVM only permit very basic aggregation (a linear combination at best).

What are the common algorithm to calculate similarity between zones of images

I have already tried mean squared error and cross correlation, but they don't give me that much of a good result. I'm doing that for Brain MRI. Thank you.
I have seen principal component analysis used to compare separate brain scan images.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5874201
This might be useful, but I am not entirely sure what you are trying to do with similarity between sones.

Resources