I know that feature scaling is always a requirement for clustering algorithms. Currently I am implementing hierarchal clustering on this dataset, I will use only the annual income and the spending rate features. Now I am confused of whether to use feature scaling here, this is because both features approximately have the same scale, where the scale for annual income is [15-137] and the scale for spending rate is [1-100]. The two features have approximately same scale so do I still need to feature scale them ?
Related
I'm currently using PCA to do handwritten digits recognition for MNIST database (each digit has about 1000 observations and 784 features). One thing I have found confusing is that the accuracy is the highest when it has 40 PCs. If the number of PCs grows from this point, the accuracy starts to drop continuously.
From my understanding of PCA, I thought the more components I have, the better I can describe a dataset. Why does the accuracy becomes less if I have too many PCs?
In order to identify the optimum number of components, you need to plot the elbow curve
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
The idea behind PCA is to reduce the dimensionality of the data by finding the principal components.
Lastly, I do not think that PCA can overfit the data as it is not a learning/ fitting algorithm.
You are just trying to project the data based on eigen-vectors to capture most of the variance along an axis.
This video should help: https://www.youtube.com/watch?v=_UVHneBUBW0
I am very much new to Machine Learning.
And I am trying to apply ML on data containing nearly 50 features. Some features have range from 0 to 1000000 and some have range from 0 to 100 or even less than that. Now when I use feature scaling by using MinMaxScaler for range (0,1) I think features having large range scales down to very small values and this might affect me to give good predictions.
I would like to know if there is some efficient way to do scaling so that all the features are scaled appropriately.
I also tried standared scaler but accuracy did not improve.
Also Can I use different scaling function for some features and another for remaining features.
Thanks in advance!
Feature scaling, or data normalization, is an important part of training a machine learning model. It is generally recommended that the same scaling approach is used for all features. If the scales for different features are wildly different, this can have a knock-on effect on your ability to learn (depending on what methods you're using to do it). By ensuring standardized feature values, all features are implicitly weighted equally in their representation.
Two common methods of normalization are:
Rescaling (also known as min-max normalization):
where x is an original value, and x' is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span [160 pounds, 200 pounds]. To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).
Mean normalization
where x is an original value, and x' is the normalized value.
I know that feature selection helps me remove features that may have low contribution. I know that PCA helps reduce possibly correlated features into one, reducing the dimensions. I know that normalization transforms features to the same scale.
But is there a recommended order to do these three steps? Logically I would think that I should weed out bad features by feature selection first, followed by normalizing them, and finally use PCA to reduce dimensions and make the features as independent from each other as possible.
Is this logic correct?
Bonus question - are there any more things to do (preprocess or transform)
to the features before feeding them into the estimator?
If I were doing a classifier of some sort I would personally use this order
Normalization
PCA
Feature Selection
Normalization: You would do normalization first to get data into reasonable bounds. If you have data (x,y) and the range of x is from -1000 to +1000 and y is from -1 to +1 You can see any distance metric would automatically say a change in y is less significant than a change in X. we don't know that is the case yet. So we want to normalize our data.
PCA: Uses the eigenvalue decomposition of data to find an orthogonal basis set that describes the variance in data points. If you have 4 characteristics, PCA can show you that only 2 characteristics really differentiate data points which brings us to the last step
Feature Selection: once you have a coordinate space that better describes your data you can select which features are salient.Typically you'd use the largest eigenvalues(EVs) and their corresponding eigenvectors from PCA for your representation. Since larger EVs mean there is more variance in that data direction, you can get more granularity in isolating features. This is a good method to reduce number of dimensions of your problem.
of course this could change from problem to problem, but that is simply a generic guide.
Generally speaking, Normalization is needed before PCA.
The key to the problem is the order of feature selection, and it's depends on the method of feature selection.
A simple feature selection is to see whether the variance or standard deviation of the feature is small. If these values are relatively small, this feature may not help the classifier. But if you do normalization before you do this, the standard deviation and variance will become smaller (generally less than 1), which will result in very small differences in std or var between the different features.If you use zero-mean normalization, the mean of all the features will equal 0 and std equals 1.At this point, it might be bad to do normalization before feature selection
Feature selection is flexible, and there are many ways to select features. The order of feature selection should be chosen according to the actual situation
Good answers here. One point needs to be highlighted. PCA is a form of dimensionality reduction. It will find a lower dimensional linear subspace that approximates the data well. When the axes of this subspace align with the features that one started with, it will lead to interpretable feature selection as well. Otherwise, feature selection after PCA, will lead to features that are linear combinations of the original set of features and they are difficult to interpret based on the original set of features.
I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?
Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.
I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.
I think that the number of dimensions from word2vec depends on your application. The most empirical value is about 100. Then it can perform well.
The number of dimensions reflects the over/under fitting. 100-300 dimensions is the common knowledge. Start with one number and check the accuracy of your testing set versus training set. The bigger the dimension size the easier it will be overfit on the training set and had bad performance on the test. Tuning this parameter is required in case you have high accuracy on training set and low accuracy on the testing set, this means that the dimension size is too big and reducing it might solve the overfitting problem of your model.
Say in the document classification domain, if I'm having a dataset of 1000 instances but the instances (documents) are rather of small content; and I'm having another dataset of say 200 instances but each individual instance with richer content. If IDF is out of my concern, will the number of instances really matter in training? Do classification algorithms sort of take that into account?
Thanks.
sam
You could pose this as a general machine learning problem. The simplest problem that can help you understand how the size of training data matters is curve fitting.
The uncertainty and bias of a classifier or a fitted model are functions of the sample size. Small sample size is a well known problem which we often try to avoid by collecting more training samples. This is because the uncertainty estimation of non-linear classifiers is estimated by a linear approximation of the model. And this estimation is accurate only if a large number samples are available as the main condition of the central limit theorem.
The proportion of outliers is also an important factor you should consider when deciding on the training sample size. If a larger sample size means a greater proportion of outliers then should limit the sample size.
The document size is actually is an indirect indicator of feature space size. If for example from each document you have got only 10 features then you're trying to separate/classify the documents in a 10-dimensional space. If you have got 100 features in each document then the same is happening in a 100-dimensional space. I guess it's easy for you to see drawing lines that separate the documents in a higher dimension is easier.
For both document size and sample size the rule of thumb is go to as high as possible but in practice this is not possible. And for example, if you estimate the uncertainty function of the classifier then you find a threshold that sample sizes higher than that lead to virtually no reduction of uncertainty and bias. Empirically you can also find this threshold for some problems by Monte Carlo simulation.
Most engineers don't bother to estimate uncertainty and that often leads to sub-optimal behavior of the methods they implement. This is fine for toy problems but in real-world problems considering uncertainty of estimations and computation is vital for most systems. I hope that answers your questions to some degree.