I use logistic regression. We know that it is a supervised method and needs calculated feature values both in training and test data. There are six features. Although the functions produce these features’ values are different and their maximum value can be 1, there are four features (both in training and test data) that have very low values. e.g. they range between 0 and 0.1 and are never 1, even more than 0.1!!!. Thus these features’ values are very close to each other. Other features are distributed normally (they range between 0 and 0.9). So the difference between these two kinds of features is high, I think this causes trouble in learning process for logistic regression. Am I right?! Does it need any transforming/normalizing these features? Any help would be highly appreciated.
In short: you should normalize your features prior to training. Typically - so each is either in some range (like [0,1]) or is whitened (mean 0 and std 1).
Why is it important? In order to make "small" features important LR will need very high weights in this dimension. However, you will probably use regularized LR (typically L2 regularized) - in such case it will be very hard to assign high values to these vectors, as regularization penalty will force model to rather choose equally distributed weights instead - thus use normalization. However - if you fit LR without any regularization, then there is no point in scaling (up to numerical errors) as LR does not depend on the choice of scaling (the solution should be exactly the same)
Related
Semi-supervised learning for Regression task with tabular data? I have like 20K labelled data and 200K unlabeled data.
Model gets better and better with semi-supervised learning for eg. self-learning and on top of it if you have just 20K and making predictions on 200K you assume variance and distribution of features remain same which is not always true so you make a prediction with a sizeable error even your model was 90 Mean R2 on K Fold CV. Hope that explains why I am searching for a semi-supervised approach for regression!
A key assumption for most semi-supervised learning (SSL) is that nearby points (e.g. between an unlabelled and labelled point) are likely to share the same label. It seems you're expecting the variance and distribution of your unlabelled set to be different to your labelled set which may violate the above assumption.
If you still wish to go ahead with SSL then there are no conventional techniques for regression methods. But...
The literature on less conventional methods has been reviewed: https://www.researchgate.net/publication/325798856_Semi-supervised_regression_A_recent_review
And...
You could always convert to a classification approach (helps if the labels are relatively uniformly distributed). For instance, just choose the median prediction value in your labelled set, then everything less than that is 0 and everything greater is 1. You can then apply a conventional classification SSL technique, such as label propagation. Then, you get the predicted probabilities on a test set (e.g. use predict_proba()) and scale these back to your regression range. For example, if the minimum value in your labelled set was 0 and the max was 10, then just multiply the predicted probabilities by 10.
Hope the above at least provides food for thought.
I am a beginner in machine learning. So any help or suggestion would be of great help.
I have read that putting weights on features and Predicting is a very bad idea. But what if few features needs to be weighted.
In a classification problem let's say it's a common norm that age is most dependent, how do I give weights to this feature. I was thinking to normalize it but with a variance of 1.5 or 2 (other features with variance 1), I believe that this feature will have more weight. Is this fundamentally wrong ? If wrong any other method.
Does it effect differently for classification and regression problems ?
If we are talking specifically about random forests (as you tagged) then you can use the Weighted Subspace Random Forest algorithm (in R wsrf package). The algorithm determines a weight for each variable and then uses these during the model building.
The informativeness of a variable with respect to the class is
measured by an information gain ratio. The measure is used as the
probability of that variable being selected for inclusion in the
variable subspace when splitting a specific node during the tree
building process. Therefore, variables with higher values by the
measure are more likely to be chosen as candidates during variable
selection and a stronger tree can be built.
Generally if a feature has more Importance compared to other features and the model is Dense enough, with enough training sample, your model will automatically give it more Importance by optimizing weight matrices to account for that because we have partial derivatives in back propagation which calculate change by each connection, so it learns to give more importance to that feature on itself. If you don't normalize it, but scale it to a higher scale, you might have overstated it's important.
In practice a neural network works best if the inputs are centered and white. That means that their covariance is diagonal and the mean is the zero vector. This improves optimization of the neural net, since the hidden activation functions don't saturate that fast and thus do not give you near zero gradients early on in learning.
If you do scale just one feature up by a small value, it may or may not have desired effects, but the higher probability is of saturated gradients, so we avoid it.
I have a binary logistic regression model (0/1), built over binary features. The feature coefficients are usually in the range (-1, 1). After training, can I use the feature coefficients as a proxy for the 'importance' of a feature? If the coefficient is < 0, does that mean the presence of the feature is a negative for the class (i.e., reduces the probability of the output being 1)?
Right; a negative coefficient means that the feature contra-indicates that class. The magnitude is, indeed, the relative importance. -1 and +1 are imperatives: all members of the class do not / do have that feature.
You absolutely can. In fact, this idea of important or 'blame' is the main concept behind machine learning algorithms. The coefficients change many times during the training process through gradient descent. How much the weights update by is actually determined by each weight's contribution towards the cost.
That is, the more a weight is to be blamed for a high cost the more extreme the update will be. Therefore, more extreme values (high positives or low negatives) are an indication of how impactful the respective feature is when the model is making its decision.
Currently I am doing English alphabet classification using SVM classifier in opencv.
I have following doubts in doing above thing
How length of feature vector depends on the classification ?
(What will happen if feature length increases (my current feature length is 125))
Is time taken for prediction depend on number of data used for training ?
Why we need normalization of feature vector (will this improve accuracy of prediction and time required for the prediction of the class) ?
How to determine best method for normalizing feature vector ?
1) Length of features does not matter per se, what matters is predictive quality of features
2) No, it does not depend on number of samples, but it depends on number of features (prediction is generally very fast)
3) Normalization is required if features are in very different ranges of values
4) There are basically standarization (mean, stdev) and scaling (xmax -> +1, xmean -> -1 or 0) - you could do both and see which one is better
when talking about classification the data consists of feature vectors with a number of features. in image processing there is also features which are mapped to classification feature vectors. so your "feature length" is actually the number of features or feature vector size.
1) the number of features matter. in principle more features allow better classification but also lead to overtraining. to avoid the latter you can add more samples (more feature vectors).
2) yes, as the prediction time depends on the number of support vectors and the size of the support vectors. but as prediction is very fast this is not an issue unless you have some real time requirements.
3) while SVM as a maximum margin classifier is quite robust against different feature value ranges a feature with a bigger value range would have more weight than one with a smaller range. this especially applies to penalty calculation if classes are not completely separable.
4) as SVM is quite robust against different value ranges (compared to cluster oriented algorithms) this is not the biggest issue. typically absolute min/max are scaled to -1/+1. if you know the expected range of your data you could scale that range and measurement errors in your data would not influence the scaling. a fixed range is also preferable when adding trraining data in an iterative process.
This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.
It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate "new" but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can't remember which chapter and I don't have my copy handy).
In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel's feature space, and {1,2,3}-grams of that of the cubic kernel.
Exactly what is happening depends on your training set; it might simply be too small.
As larsmans said, adding more variables / features makes it easier for the model to overfit hence lose in test accuracy. In the master branch of scikit-learn there is now a min_df parameter to cut-off any feature with less than that number of occurrences. Hence min_df==2 to min_df==5 might help you get rid of spurious bi-grams.
Alternatively you can use L1 or L1 + L2 penalized linear regression (or classification) using either the following classes:
sklearn.linear_model.Lasso (regression)
sklearn.linear_model.ElasticNet (regression)
sklearn.linear_model.SGDRegressor (regression) with penalty == 'elastic_net' or 'l1'
sklearn.linear_model.SGDClassifier (classification) with penalty == 'elastic_net' or 'l1'
This will make it possible to ignore spurious features and lead to a sparse model with many zero weights for noisy features. Grid Searching the regularization parameters will be very important though.
You can also try univariate feature selection such as done the text classification example of scikit-learn (check the SelectKBest and chi2 utilities.