When is the features independent in order to use NB classifier? - machine-learning

I am working with classification models and as I am new to it I have a question. It is said that Naive Bayes performs well when features are independent of each other. How do I know if features in my feature set are independent? Any example? Thanks!!

Independence of Features
In most cases people want to check whether one feature is highly correlated with another (or even repeated), so that one of those can be omitted. A correlation of 1 means, that you don't lose any information if the one of the correlated features is omitted. There are multiple ways to check correlation, e.g. in Python np.corrcoef, pd.DataFrame.corr and scipy.stats.pearsonr.
But things can be more complicated.
Features are independent of each other if you cant use features x_1, ..., x_n to predict feature x_n+1. In most cases one might check if features are linear dependent of each other, meaning:
x_n+1 = a_1 * x_1 + ... + a_n * x_n + error
If this is the case (and the error contribution is small) one might neglect the dependent feature. Note that you can therefore omit any of all n+1-features, since you can restructure your equation to have any of x_i on the lhs.
To check this one might calculate the eigenvalues and check for values close to zero.
Removing dependent Features
from sklearn import datasets
import numpy as np
from sklearn import decomposition
from sklearn import naive_bayes
from sklearn import model_selection
X, y = datasets.make_classification(n_samples=10000, n_features=10, n_repeated=0, n_informative=6, n_redundant=4, n_classes=2)
u, s, vh = np.linalg.svd(X)
#display s
s
array([8.06415389e+02, 6.69591201e+02, 4.31329281e+02, 4.02622029e+02,
2.85447317e+02, 2.53360358e+02, 4.07459972e-13, 2.55851809e-13,
1.72445591e-13, 6.68493846e-14])
So basically, 4 features are redundant. So now, we can use a feature reduction technique such as Principal Component Analysis or Linear Discriminant Analysis to reduce to only 6 features.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)
gnb.score(X_test, y_test) #results in 0.7216
Now we reduce the features to 6.
pca = decomposition.PCA(n_components=6)
X_trafo = pca.fit_transform(X)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_trafo, y)
gnb.fit(X_train, y_train)
gnb.score(X_test, y_test) #results in 0.7216
Note that the values don't need to be exaclty the same.

Related

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

Inspection of trees in a Quantile Random Forest Regression model

I am interested in training a random forest to learn some conditional quantile on some data {X, y} sampled independently from some distribution.
That is, for some $$\alpha \in (0, 1)$$, a mapping $$\hat{q}{\alpha}(x) \in [0, 1]$$ such that for each $X$, $$argmin{\hat{q}{\alpha} P(y < \hat{q}\alpha(x)) > \alpha$$.
Is there any clear way to build a random forest effectively in python that could yield such a model?
Additionally, I have one added requirement that may be possible with the current libraries, though I am unsure. Requirement: I would like to select a subset of points, A, from my training set and select and exclude those trees that were trained with points in A from my random forest as I make predictions.
There is a Python-based, scikit-learn compatible/compliant Quantile Regression Forest implementation that can be used to estimate conditional quantiles here: https://github.com/zillow/quantile-forest
Your additional requirement of making predictions on training samples by excluding trees that included those samples during training is called out-of-bag (OOB) estimation, and can also be done with the above package.
Setup should be as easy as:
pip install quantile-forest
Then, here's an example of how to fit a quantile random forest model and use it to predict quantiles with OOB estimation for a subset (here the first 100 rows) of the training data:
import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
qrf = RandomForestQuantileRegressor()
qrf.fit(X_train, y_train)
# Predict OOB quantiles for first 100 training samples.
y_pred_oob = qrf.predict(
X_train[:100, :],
quantiles=[0.025, 0.5, 0.975],
oob_score=True,
indices=np.arange(100),
)

sklearn cross valid / cross predict

I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks
The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)

Linear regression gives worse results after normalization or standardization

I'm performing linear regression on this dataset:
archive.ics.uci.edu/ml/datasets/online+news+popularity
It contains various types of features - rates, binary, numbers etc.
I've tried using scikit-learn Normalizer, StandardScaler and PowerTransformer, but the've all resulted in worse results than without using them.
I'm using them like this:
from sklearn.preprocessing import StandardScaler
X = df.drop(columns=['url', 'shares'])
Y = df['shares']
transformer = StandardScaler().fit(X)
X_scaled = transformer.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
perform_linear_and_ridge_regression(X=X_scaled, Y=Y)
The function on the last line perform_linear_and_ridge_regression() is correct for sure and is using GridSearchCV to determine the best hyperparameters.
Just to make sure I include the function as well:
def perform_linear_and_ridge_regression(X, Y):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=10)
lin_reg_parameters = { 'fit_intercept': [True, False] }
lin_reg = GridSearchCV(LinearRegression(), lin_reg_parameters, cv=5)
lin_reg.fit(X=X_train, y=Y_train)
Y_pred = lin_reg.predict(X_test)
print('Linear regression MAE =', median_absolute_error(Y_test, Y_pred))
The results are surprising as all of them provide worse results:
Linear reg. on original data: MAE = 1620.510555135375
Linear reg. after using Normalizer: MAE = 1979.8525218964242
Linear reg. after using StandardScaler: MAE = 2915.024521207241
Linear reg. after using PowerScaler: MAE = 1663.7148884463259
Is this just a special case, where Standardization doesn't help, or am I doing something wrong?
EDIT: Even when I leave the binary features out, most of the transformers gives worse results.
Your dataset has many categorical and ordinal features. You should take care of that first separately. Also, it seems like you are applying normalization on categorical variables too, which is completely wrong.
Here is nice-link, which explains how to handle categorical features for regression problem.

sklearn multiclass svm function

I have multi class labels and want to compute the accuracy of my model.
I am kind of confused on which sklearn function I need to use.
As far as I understood the below code is only used for the binary classification.
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state = 0)
# training a linear SVM classifier
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)
# model accuracy for X_test
accuracy = svm_model_linear.score(X_test, y_test)
print accuracy
and as I understood from the link:
Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?
for multiclass classification I should use OneVsRestClassifier with decision_function_shape (with ovr or ovo and check which one works better)
svm_model_linear = OneVsRestClassifier(SVC(kernel = 'linear',C = 1, decision_function_shape = 'ovr')).fit(X_train, y_train)
The main problem is that the time of predicting the labels does matter to me but it takes about 1 minute to run the classifier and predict the data (also this time is added to the feature reduction such as PCA which also takes sometime)? any suggestions to reduce the time for svm multiclassifer?
There are multiple things to consider here:
1) You see, OneVsRestClassifier will separate out all labels and train multiple svm objects (one for each label) on the given data. So each time, only binary data will be supplied to single svm object.
2) SVC internally uses libsvm and liblinear, which have a 'OvO' strategy for multi-class or multi-label output. But this point will be of no use because of point 1. libsvm will only get binary data.
Even if it did, it doesnt take into account the 'decision_function_shape'. So it does not matter if you provide decision_function_shape = 'ovr' or decision_function_shape = 'ovr'.
So it seems that you are looking at the problem wrong. decision_function_shape should not affect the speed. Try standardizing your data before fitting. SVMs work well with standardized data.
When wrapping models with the ovr or ovc classifiers, you could set the n_jobs parameters to make them run faster, e.g. sklearn.multiclass.OneVsOneClassifier(estimator, n_jobs=-1) or sklearn.multiclass.OneVsRestClassifier(estimator, n_jobs=-1).
Although each single SVM classifier in sklearn could only use one CPU core at a time, the ensemble multi class classifier could fit multiple models at the same time by setting n_jobs.

Resources