Related
I want to try all regression algorithms on my dataset and choose a best. I decide to start from Linear Regression. But i get some error.
I tried to do scaling but also get another error.
Here is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
train_df = pd.read_csv('train.csv', index_col='ID')
train_df.head()
target = 'Result'
X = train_df.drop(target, axis=1)
y = train_df[target]
# Trying to scale and get even worse error
#ss = StandardScaler()
#df_scaled = pd.DataFrame(ss.fit_transform(train_df),columns = train_df.columns)
#X = df_scaled.drop(target, axis=1)
#y = df_scaled[target]
model = LogisticRegression()
model.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=10,
warm_start=False)
print(X.iloc[10])
print(model.predict([X.iloc[10]]))
print(y[10])
Here is an error:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
A 0
B -19
C -19
D -19
E 0
F -19
Name: 10, dtype: int64
[0]
-19
And here is an example of dataset:
ID,A,B,C,D,E,F,Result
0,-18,18,18,-2,-12,-3,-19
1,-19,-8,0,18,18,1,0
2,0,-11,18,0,-19,18,18
3,18,-15,-12,18,-11,-4,-17
4,-17,18,-11,-17,-18,-19,18
5,18,-14,-19,-14,-15,-19,18
6,18,-17,18,18,18,-2,-1
7,-1,-11,0,18,18,18,18
8,18,-19,-18,-19,-19,18,18
9,18,18,0,0,18,18,0
10,0,-19,-19,-19,0,-19,-19
11,-19,0,-19,18,-19,-19,-6
12,-6,18,0,0,0,18,-15
13,-15,-19,-6,-19,-19,0,0
14,0,-15,0,18,18,-19,18
15,18,-19,18,-8,18,-2,-4
16,-4,-4,18,-19,18,18,18
17,18,0,18,-4,-10,0,18
18,18,0,18,18,18,18,-19
What i do wrong?
You're using LogisticRegression, which is a special case of Linear Regression used for categorical dependent variables.
This is not necessarily wrong, as you might intend to do so, but that means you need sufficient data per category and enough iterations for the model to converge (which your error points out, it hasn't done).
I suspect, however, that what you intended to use is LinearRegression (used for continuous dependent variables) from sklearn library.
I am using LGB to handle a machine leaning task. But I found when I use the sklearn API cross_val_score and set cv=10, the time cost is less than single fold fit. I splited dataset useing train_test_split, then fit a LGBClassifier on training set. The time cost of latter is much more than former, why?
Sorry for my bad English.
Environment: Python 3.5, scikit-learn 0.20.3, lightgbm 2.2.3
Inter Xeon CPU E5-2650 v4
Memory 128GB
X = train_df.drop(['uId', 'age'], axis=1)
Y = train_df.loc[:, 'age']
X_test = test_df.drop(['uId'], axis=1)
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.1,
stratify=Y)
# (1809000, 12) (1809000,) (201000, 12) (201000,) (502500, 12)
print(X_train.shape, Y_train.shape, X_val.shape, Y_val.shape, X_test.shape)
from lightgbm import LGBMClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import time
lgb = LGBMClassifier(n_jobs=-1)
tic = time.time()
scores = cross_val_score(lgb, X, Y,
scoring='accuracy', cv=10, n_jobs=-1)
toc = time.time()
# 0.3738402985074627 0.0009231167322574765 300.1487271785736
print(np.mean(scores), np.std(scores), toc-tic)
tic = time.time()
lgb.fit(X_train, Y_train)
toc = time.time()
# 0.3751492537313433 472.1763586997986 (is much more than 300)
print(accuracy_score(Y_val, lgb.predict(X_val)), toc-tic)
Sorry, I found the answer. Here writtens in documentation of LightGBM: 'for the best speed, set this to the number of real CPU cores, not the number of threads'. So set n_jobs=-1 is not a best choice.
I'm currently using RandomForestRegression for Titanic(Kaggle).
%%timeit
model = RandomForestRegressor(n_estimators=200, oob_score=False,n_jobs=1,random_state=42)
model.fit(X,y)
#y_oob = model.oob_prediction_
#print("c-stat:", roc_auc_score(y,model.oob_prediction_))
prediction_regression = model.predict(X_test)
# dataframe with predictions
kaggle = pd.DataFrame({'PassengerId': passengerId, 'Survived': prediction_regression})
# save to csv
kaggle.to_csv('./csvToday/prediction_regression.csv', index=False)
but it returns not 0 or 1 . it gives decimal points
892: 0.3163
893: 0.07 such and such
How to make RandomForestRegression return as 0 or 1
Regression is a machine learning problem of predicting quantity/amount/price (such as market stock prediction, home price prediction, e.t.c). As far, as I remember, the goal of titanic competition is to predict whether a passenger survive. It's sounds like a binary classification problem. If it's a classification problem you should use RandomForestClassifier (docs).
So your code would look like:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
#some parameters
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
submit_df = pd.DataFrame({'PassengerId': passengerId, 'Survived': y_pred})
submit_df.to_csv('./csvToday/submission.csv', index=False)
This kernel can provide you with some more insights.
I am trying to compare multiple classifiers on a dataset that I have. To get accurate accuracy scores for the classifiers I am now performing 10 fold cross validation for each classifier. This goes well for all of them except SVM (both linear and rbf kernels). The data is loaded like this:
dataset = pd.read_csv("data/distance_annotated_indels.txt", delimiter="\t", header=None)
X = dataset.iloc[:, [5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Cross validation for for example a Random Forest works fine:
start = time.time()
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cv = ShuffleSplit(n_splits=10, test_size=0.2)
scores = cross_val_score(classifier, X, y, cv=10)
print(classification_report(y_test, y_pred))
print("Random Forest accuracy after 10 fold CV: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) + ", " + str(round(time.time() - start, 3)) + "s")
Output:
precision recall f1-score support
0 0.97 0.95 0.96 3427
1 0.95 0.97 0.96 3417
avg / total 0.96 0.96 0.96 6844
Random Forest accuracy after 10 fold CV: 0.92 (+/- 0.06), 90.842s
However for SVM this process takes ages (waited for 2 hours, still nothing). The sklearn website does not make me any wiser. Is there something I should be doing different for SVM classifiers? The SVM code is as follows:
start = time.time()
classifier = SVC(kernel = 'linear')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
scores = cross_val_score(classifier, X, y, cv=10)
print(classification_report(y_test, y_pred))
print("Linear SVM accuracy after 10 fold CV: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) + ", " + str(round(time.time() - start, 3)) + "s")
If you have a lot of samples the computational complexity of the problem gets in the way, see Training complexity of Linear SVM.
Consider playing with the verbose flag of cross_val_score to see more logs about progress. Also, with n_jobs set to a value > 1 (or even using all CPUs with n_jobs set to -1, if memory allows) you could speed up computation via parallelization. http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html can be useful to evaluate these options.
If performance is poor I'd consider reducing the value of cv (see https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation for a discussion on this)
Also you can control the time with changing max_iter. If it set to -1 it can go forever according to soltion space. Set some integer value say 10000 as a stopping criteria.
alternatively you can try using optimized SVM implementation - for example with scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()
So elastic net is supposed to be a hybrid between ridge regression (L2 regularization) and lasso (L1 regularization). However, it seems that even if l1_ratio is 0 I don't get the same result as ridge. I know ridge using gradient descent and elastic net uses coordinate descent, but the optima should be the same, no? Moreover, I've found that elasticnet often throws ConvergenceWarnings for no obvious reason, while lasso and ridge don't. Here's a snippet:
from sklearn.datasets import load_boston
from sklearn.utils import shuffle
from sklearn.linear_model import ElasticNet, Ridge, Lasso
from sklearn.model_selection import train_test_split
data = load_boston()
X, y = shuffle(data.data, data.target, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)
alpha = 1
en = ElasticNet(alpha=alpha, l1_ratio=0)
en.fit(X_train, y_train)
print('en train score: ', en.score(X_train, y_train))
rr = Ridge(alpha=alpha)
rr.fit(X_train, y_train)
print('rr train score: ', rr.score(X_train, y_train))
lr = Lasso(alpha=alpha)
lr.fit(X_train, y_train)
print('lr train score: ', lr.score(X_train, y_train))
print('---')
print('en test score: ', en.score(X_test, y_test))
print('rr test score: ', rr.score(X_test, y_test))
print('lr test score: ', lr.score(X_test, y_test))
print('---')
print('en coef: ', en.coef_)
print('rr coef: ', rr.coef_)
print('lr coef: ', lr.coef_)
Even though l1_ratio is 0, the train and test scores of elastic net are close to the lasso scores (and not ridge as you would expect). Moreover, elastic net seems to throw a ConvergenceWarning, even if I increase max_iter (even up to 1000000 there seems to be no effect) and tol (0.1 still throws an error, but 0.2 doesn't). Increasing alpha (as the warning suggests) also has no effect.
Based on the answer by #sascha, one can match the results between the two models:
import sklearn
print(sklearn.__version__)
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.datasets import load_boston
dataset = load_boston()
X = dataset.data
y = dataset.target
f = Ridge(alpha=1,
fit_intercept=True, normalize=False,
copy_X=True, max_iter=1000, tol=1e-4, random_state=42,
solver='auto')
g = ElasticNet(alpha=1/X.shape[0], l1_ratio=1e-16,
fit_intercept=True, normalize=False,
copy_X=True, max_iter=1000, tol=1e-4, random_state=42,
precompute=False, warm_start=False,
positive=False, selection='cyclic')
f.fit(X, y)
g.fit(X, y)
print(abs(f.coef_ - g.coef_) / abs(f.coef_))
Output:
0.19.2
[1.19195623e-14 1.17076625e-15 3.25973465e-13 1.61694280e-14
4.77274767e-15 4.15332538e-15 6.15640568e-14 1.61772832e-15
4.56125088e-14 5.44320605e-14 8.99189018e-15 2.31213025e-15
3.74181954e-15]
Just read the docs. Then you will find out that none of these is using Gradient-descent and more important:
Ridge
Elastic Net
which shows, when substituting a=1, p=0, that:
ElasticNet has one more sample-dependent factor on top of the loss not found in Ridge
ElasticNet has one more 1/2 factor in the l2-term
Why different models? Probably because sklearn follows the canonical/original R-based implementation glmnet.
Furthermore i would not be surprised to see numerical-issues when doing mixed-norm optimization while i'm forcing a non-mixed norm like l1=0, especially when there are specialized solvers for both non-mixed optimization-problems.
Luckily, sklearn even has to say something about it:
Currently, l1_ratio <= 0.01 is not reliable, unless you supply your own sequence of alpha.