Randomly feature mapping - machine-learning

I'm taking machine learning course In the second part of exercise 2, we are supposed to use feature map. And they added new features by mapping the features into all polynomial terms of x1 and x2 up to the sixth power. However, my instructor told me I shouldn't use this algorithm and instead I should randomly add features. However, we add new features in order to better classify.So wouldn't adding features randomly make this more complicated? So can we add features randomly or should we follow some rule?

Adding new features (e.g. polynomial of the existing features) helps to reduce the error by using complex hypothesis. But this may lead to overfitting the training data and may not produce efficient results on the test set.
So, in order to add new features, following should be considered:
1) Manually select which feature to keep by analyzing the results.
2) Other way is use all features and then use regularization which will automatically give less importance to less contributing features and more importance to more contributing features towards target variable.
3) Randomly selecting features may or may not help always. It is always required to choose those features which contribute more towards target variable. Random selection may not be the appropriate solution.
Important Note
Always use validation set to check the error during training.
While working with polynomial features, always check the learning curve to see model should not overfitting the train data. If such happens, try to increase the regularization parameter (lambda). Regularization helps in reducing over fitting.

Related

Is it a good idea to exclude noisy data from the dataset to train the model?

Will it be a good idea to exclude the noisy data( which may reduce model accuracy or cause unexpected output for testing dataset) from a dataset to generate the training and validation dataset ?
Assumption: Noisy data is pre-known to us
Any suggestion is deeply appreciated!
It depends on your application. If the noisy data is valid, then definitely include it to find the best model.
However, if the noisy data is invalid, then it should be cleaned out before fitting your model.
Noise is a broad term, you better consider them as inliers or outliers instead.
Most of the outliers detection algorithms specify a threshold and sort the outliers candidates according to some given score. In this case, you can choose to eradicate the most extreme values. Say for example 3xSTD far from the mean (of course that is in case you have a Gaussian-like distributed data set).
So my suggestion is to build your judgement based on two things:
Your business concept and logic about validity vs invalidity. For example: A house size, area or price cannot be a negative number.
Your mathematical / algorithmic logic. For example: Detect extreme values based on some threshold to decide (along with / without point no. 1) whether it is a valid observation or not.
Noisy data doesn't cause a huge problem themselves. The extreme noisy data (i.e. extreme values / outliers) are those you should really concern about!
Such points would adjust the hypothesis of your model while fitting the data. Hence, results might be drastically shifted / incorrect.
Finally, you can look at Pyod open-source Pythonic toolbox which contains a lot of different algorithms implemented off-the-shelf. (You can choose more than one algorithm and create a voting pool to decide the extremeness of the observations).
You can use Multivariate Gaussian Distribution for outlier Detection in python. It is the best method.

Beginners guide to troubleshooting badly performing models

Im creating my first predictive model and its results are absolutely awful.
Im in need of some help identifying how i troubleshoot this.
Im doing linear regression & logistic regression classification, to predict if a student will pass a course, 1 for yes, 0 for no.
The dataset is tiny, as we only have complete data for one class, 16 features just under 60 rows, 35 passed and 25 failed.
I'm wondering if my dataset is simply too small.
I dont want to share the dataset just yet, but will clean it up so its completely anonymous.
The ROC is very very jagged and mostly (for log regression), and predicts more false positives than anything else.
Id appreciate some general troubleshooting advice for a beginner that i can try before we hire in a professional.
Thanks for any help provided.
Id suggest some tips:
In Azure ML theres a module called "filter based feature selection", you can use it to score your features and check if there is really predictive power in them or even select just the ones with the highest score.
If you haven't ,splitt in train/cross validation set and evaluate your model in both and use it as a diagnosis to identify underfitting(high bias) or overfitting(high variance), and depending on the diagnosis perform actions like:
For overfitting: get more data, use less features, use a less complex model , add or increase regularization
For underfitting: add more features, use a more complex model, decrease regularization.
And don't forget ,before start training to explore and evaluate your data, use scatter plots to see if indeed its separable, perform feature engineering and preprocessing for this ask yourself: given this features, would a human expert be able to perform predictions?, if your answer is not, transform or drop features so that the answer is positive

Why is dropping predictors based on low correlation between predictors and the dependent variable before performing cross validation incorrect?

Say I have predictors X1, X2, ..., Xn and a dependent variable Y.
I check correlation between the predictors and Y and drop predictors that have a low correlation with Y. Now I use cross validation between Y and the remaining predictors to train a logistic regression model.
What is wrong with this method?
There are many possible problems with doing so, which would end up in a very lengthy answer - I'll just point out two that I think are most important and of which you can use the "buzzwords" to look up anythink still unclear:
Dropping features based on their feature-to-target-correlation essentially is a form of feature filtering. It is important to understand that feature filtering does not necessarily improve predictive performance. Think e.g. of 2 features in AND or OR configuration to the target variable, and just together will allow for correct prediction of the target variable. Correlation of those features with the target will be low, but dropping them might very well decrease your predictive performance. Besides feature filters there are feature wrappers, with which you essentially use a subset of features with a model and evaluate the predictive performance of the model. So, in contrast to feature filters, which just look at the features and target, feature wrappers look at the actual model performance. BTW: if you end up using feature filters based on feature correlation, you might still not only want to discard features with low feature-target correlation, but also features with high inter-feature correlation (because such features just don't contain much new information at all).
If you want to tune your feature selection (e.g. the amount of information/variance you want to preserve in your data, the nr. of features you want to keep, the amount of correlation you allow, etc) and you do this outside of your cross validation and resampling approach, you will likely end up with overly optimistic error estimates of your final model. This is because by not including those in the CV process, you will end up selecting one "best" configuration that was not correctly (=independently) estimated, hence might just by chance have been good. So, if you want to correctly estimate the error, you should consider including your feature selection in the CV process too.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Classification Algorithm which can take predefined weights for attributes as input

I have 20 attributes and one target feature. All the attributes are binary(present or not present) and the target feature is multinomial(5 classes).
But for each instance, apart from the presence of some attributes, I also have the information that how much effect(scale 1-5) did each present attribute have on the target feature.
How do I make use of this extra information that I have, and build a classification model that helps in better prediction for the test classes.
Why not just use the weights as the features, instead of binary presence indicator? You can code the lack of presence as a 0 on the continuous scale.
EDIT:
The classifier you choose to use will learn optimal weights on the features in training to separate the classes... thus I don't believe there's any better you can do if you do not have access to test weights. Essentially a linear classifier is learning a rule of the form:
c_i = sgn(w . x_i)
You're saying you have access to weights, but without an example of what the data look like, and an explanation of where the weights come from, I'd have to say I don't see how you'd use them (or even why you'd want to---is standard classification with binary features not working well enough?)
This clearly depends on the actual algorithms that you are using.
For decision trees, the information is useless. They are meant to learn which attributes have how much effect.
Similarly, support vector machines will learn the best linear split, so any kind of weight will disappear since the SVM already learns this automatically.
However, if you are doing NN classification, just scale the attributes as desired, to emphasize differences in the influential attributes.
Sorry, you need to look at other algorithms yourself. There are just too many.
Use the knowledge as prior over the weight of features. You can actually compute the posterior estimation out of the data and then have the final model

Resources