Is always normalizing features standard practice? - machine-learning

It seems like every single machine learning method (perceptron, SVM, etc) warns you about the need to normalize all the features during preprocessing.
Is this always true for all common machine learning methods? Or am I just running into the few that require normalized features.

In general it is a good idea to normalize since many ML methods need it, and other do not care at all (thus you do not harm the process). The only exception is methods crafted for a very specific types of data (especially if you have features that represent completely different classes of objects, and have a specialistic method which is aware of that, for example a kernel that treats differently dates and 'regular' numbers).

Related

In a regression task, how do I find which independent variables are to be ignored or are not important?

In the regression problem I'm working with, there are five independent columns and one dependent column. I cannot share the data set details directly due to privacy, but one of the independent variables is an ID field which is unique for each example.
I feel like I should not be using ID field in estimating the dependent variable. But this is just a gut feeling. I have no strong reason to do this.
What shall I do? Is there any way I decide which variables to use and which to ignore?
Well, I agree with #desertnaut. Id attribute does not seem relevant when creating a model and provides no help in prediction.
The term you are looking for is feature selection. Since it's a comprehensive section so I would just tell you the methods that are mostly used by data scientists.
As for regression problems you can try correlation heatmap to find the features that are highly correlated with the target.
sns.heatmap(df.corr())
There are several other ways too like PCA,using tree inbuilt feature selection methods to find the right features for your model.
You can also try James Phillips method. This approach is limited since model time complexity will increase linearly with the features. But in your case where you've only four features to compare you can try it out. You can compare the regression model trained with all the four features with the model trained with only three features by dropping one of the four features recursively. This would mean training four regression models and comparing them.
According to you, the ID variable is unique for each example. So the model won't be able to learn anything from this variable as with every example, you get a new ID and hence no general patterns to learn as every ID only occurs once.
Regarding feature elimination, it depends. If you have domain knowledge, based on that alone you can engineer/ remove features as needed. If you don't know much about the domain, you can try out some basic techniques like Backward Selection, Forward Selection, etc via Cross Validation to get the model with the best value of the metric that you're working with.

TML(Tractable Markov Logic) is a wonderful model! Why I haven't seen it being used for a wide of application scenarios of artificial intelligence?

I have been reading papers about the Markov model, suddenly a great extension like TML(Tractable Markov Logic) coming out.
It is a subset of Markov logic, and uses probabilistic class and part hierarchies to control complexity.
This model has both complex logical structure and uncertainty.
It can represent objects, classes, and relations between objects, subject to certain restrictions which ensure that inference in any model built in TML can be queried efficiently.
I am just wondering why such a good idea not widely spreading around the area of application scenarios like activity analysis?
More info
My understanding is that TML is polynomial in the size of the model, but the size of the model needs to be compiled to a given problem and may become exponentially large. So, at the end, it's still not really tractable.
However, it may be advantageous to use it in the case that the compiled form will be used multiple times, because then the compilation is done only once for multiple queries. Also, once you obtain the compiled form, you know what to expect in terms of run-time.
However, I think the main reason you don't see TML being used more broadly is that it is just an academic idea. There is no robust, general-purpose system based on it. If you try to work on a real problem with it, you will probably find out that it lacks certain practical features. For example, there is no way to represent a normal distribution with it, and lots of problems involve normal distributions. In such cases, one may still use the ideas behind the TML paper but would have to create their own implementation that includes further features needed for the problem at hand. This is a general problem that applies to lots and lots of academic ideas. Only a few become really useful and the basis of practical systems. Most of them exert influence at the level of ideas only.

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

How to build target variable for supervised machine learning project

I am quite new to machine learning with small experience and I did some projects.
Now I have a project relates to insurance. So I have databases about clients that I will merge to get all possible information about the clients and I have one database for the claims. I need to build a model to identify how risky the client based on ranks.
My question: I need to build my target variable that ranks the clients based on how risky they are, counting on the claims. I could have different strategies to do that, but I am confused about how I will deal with the following:
- Shall I do a specific type of analysis before building the ranks such as clustering, or I need to have a strong theoretical assumption matching with the project provider vision.
- If I use some variables in the claims database to build up the ranks, how shall I deal with them later. In other words, shall I remove them from the final data set for training, to avoid correlation with target variable, or I can treat them in a different way and keep them.
- If I will keep them, is there a special treatment for them depending on whether they are categorical or continuous variables.
Every machine learning project's starting point is EDA. First create some feature, like how often do they get bad claims or how many do they get. Then do some EDA to find which features are more useful. Secondly, the problem looks like classification. Clustering is usually harder to evaluate.
In data sciences when you make a business model, EDA exploratory data analytics play a major role which includes data cleaning, feature engineering, filtering data. As mentioned how to build target variable, it all depends on the attributes you have and what model do you want to apply say linear regression or logistic or make a decision tree. You need to use those algorithms. But most importantly you need to find out the impacting variable. That's probably the core elation between the output and the given input and priority must be given accordingly. Also attributes which add no value must be removed as that would contribute to overfitting.
You can do clustering too. And interesting thing is any unspervisoned learning could be converted to a form of supervised learning. Probably you can try to do logistic regression or do linear regression etc... And find out which model fits best to your project.

Improving K Means on some data sets

Anyone got an idea on how a simple K-means algorithm could be tuned to handle data sets of this form.
The most direct way to handle data of that form while still using k-means it to use a kernelized version of k-means. 2 implemtations of it exist in the JSAT library (see here https://github.com/EdwardRaff/JSAT/blob/67fe66db3955da9f4192bb8f7823d2aa6662fc6f/JSAT/src/jsat/clustering/kmeans/ElkanKernelKMeans.java)
As Nicholas said, another option is to create a new feature space on which you run k-means. However this takes some prior knowledge of what kind of data you will be clustering.
After that, you really just need to move to a different algorithm. k-means is a simple algorithm that makes simple assumptions about the world, and when those assumptions are too strongly violated (non linearly separable clusters being one of those assumptions) then you just have to accept that and pick a more appropriate algorithm.
One possible solution to this problem is to add another dimension to your data set, for which there is a split between the two classes.
Obviously this is not applicable in many cases, but if you have applied some sort of dimensionality reduction to your data, then it may be something worth investigating.

Resources