I have data that resembles students' grade which is normal distribution from 50 to 100 with mean of 80. I would like to do regression to predict this. In order to do regression analysis, because my data is imbalanced (as it has normal distribution), will it be a problem? Or it doesn't matter? Thanks!
Related
I have a semester project where I have to detect phishing website using ML. I have been using support vector binary classifier which is trained on an existing dataset to predict that whether a website is legitimate or not. The problem is SVMs need high calculations to train our data and are delicate with noisy data. Therefore, there is a high chance of overfitting. Is there any other classification model which will help to optimize my model?
I have done the similar project in my Engineering days, i used NB Classifier.
I have been reading about probability distributions lately and got confused that what actually is the difference between probability distribution and data distribution or are they the same? Also what actually is the importance of probability distribution in Machine Learning?
Thanks
Data distribution is a function or a listing that shows all the possible values (or intervals) of the data. This can help you decide if the set of good that you have is good enough to apply any techniques over it. You want to avoid skewed data.
Probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This helps you decide what type of statistical methods you can apply to your data. Example: If your data forms a Gaussian distribution then you already know how values would look like when they are 1-standard deviation away from normal and what is the probability of expecting more than 1-standard deviation.
NOTE: You may want to learn about how hypothesis testing is done for ML models.
In a classification problem, we care about the distribution of the labels in train and validation set. In sklearn, there is stratify option in train_test_split to ensure that the distribution of the labels in train and validation set are similar.
In a regression problem, let's say we want to predict the housing price based on a bunch of features. Do we need to care about the distribution of the housing price in train and validation set?
If yes, how to we achieve this in sklearn?
Forcing features to have similar distributions in your training and in your validation set assumes highly trusting the data you have to be representative of the data you will encounter in real life (ie. in a production environment), which is often not the case.
Also, doing so may virtually increase your validation score, compared to your test score.
Instead of adjusting feature distributions in train and validation sets, I would suggest you to perform cross-validation (in sklearn), which may be more representative of a testing situation.
This book ('A. Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017) provides an excellent introductory discussion of this in chapter 2. To paraphrase:
Generally for large datasets you don't need to perform stratified sampling: You training set should be a fair representation of the range of observed instances (there are of course exceptions to this). For smaller datasets you could introduce sampling bias (I.e., disproportionately recording data from only a particular region of the expected range of target attributes) if you performed random sampling and stratified sampling is properly required.
Practically, you will need to create a new categorical feature by binning this continuous feature. You can then perform stratified sampling of this categorical feature. Make sure to remove this new categorical feature before training your data!
However, to do this you will need to have a good understanding of your features, I doubt there will be much point in performing stratified sampling of features of weak predictive power. I guess it could even do harm if you introduce some unintentional bias in the data by performing non-random sampling.
Take home message:
My instinct is that stratified sampling of a continuous feature should always be information and understanding lead. I.e, if you know a feature is a strong predictor of the target variable and you also know the sampling across its values is not uniform, you probably want to perform stratified sampling to make sure the range of values are properly represented in both the training and validation set.
I used Logistic Regression as a classifier. I have six features, I want to know the important features in this classifier that influence the result more than other features. I used Information Gain but it seems that it doesn't depend on the used classifier. Is there any method to rank the features according to their importance based on specific classifier (like Logistic Regression)?
any help would be highly appreciated.
You could use Random Forest Classifier to give you a ranking of your features. You could then select the top x features from this and use it for logistic regression, although Random Forest would work perfectly fine as well.
Check out variable importance at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
One way to do this is by null hypothesis significance testing. Basically, for each feature, you test for evidence that the coefficient of that feature is nonzero. Most statistical software reports the results of these tests by default in the model summary (Scikit-learn and other machine-learning oriented tools tend to not do so). With a small number of features, you can use this information and stepwise regression to rank the importance of the features.
I am using Naive Bayes classifier for my sentiment analysis on customer support. But unfortunately I don't have huge annotated data sets in the customer support domain. But I have a little amount of annotated data in the same domain(around 100 positive and 100 negative). I have the amazon product review data set as well.
Is there anyway can I implement a weighted naive bayes classifier using mahout, so that I can give more weight to the small set of customer support data and small weight to the amazon product review data. A training on the above weighted data set would drastically improve accuracy I guess. Kindly help me with the same.
One really simple approach is oversampling. Ie just repeat the customer support examples in your training data multiple times.
Though it's not the same problem you might get some further ideas by looking into the approaches used for class imbalance; in particular oversampling (as mentioned) and undersampling.