I'm playing around with a new data set with XGBoost. Following is my code:
import xgboost as xgb
import pandas as pd
import numpy as np
train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})
features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)
params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss", "seed": 2333}
rounds = 6000
result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=2333)
print result
The result is (omitted intermediate results):
test-logloss-mean test-logloss-std train-logloss-mean
0 0.683354 0.000058 0.683206
165 0.622318 0.000661 0.607680
But when I'm trying to do parameter tuning with GridSearchCV, I found the result to be quite different. To be more specific, this is my code:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
import numpy as np
import pandas as pd
train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)
params = {"max_depth": [5], "min_child_weight": [2]}
estimator = XGBClassifier(learning_rate=0.1, n_estimators=170, max_depth=2, min_child_weight=4, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=2333)
gsearch1 = GridSearchCV(estimator, param_grid=params, n_jobs=4, iid=False, verbose=1, scoring="neg_log_loss")
gsearch1.fit(train_features.values, train_labels.values)
print pd.DataFrame(gsearch1.cv_results_)
print gsearch1.best_params_
print -gsearch1.best_score_
and I got:
mean_fit_time mean_score_time mean_test_score mean_train_score
0 87.71497 0.209772 -3.134132 -0.567306
It is clear that 3.134132 is very different from 0.622318. What's the reason of this?
Thank you!
You pass different parameters to the both:
max_depth: 5 vs 2
eta: 0.1 vs 0.3 (default)
min_child_weight: 2 vs 4
The parameters you pass to sklearn are more conservative (it's less likely you will overfit the model), so the algorithm does not try too much to fit the model to the data. In turn, on second you get a lower score - exactly what should be expected.
Related
Im trying to make a Machine Learning approach but I'm having some problems. This is my Code:
import sys
import scipy
import numpy
import matplotlib
import pandas
import sklearn
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
dataset = pandas.read_csv('Libro111.csv')
array = numpy.asarray(dataset,dtype=numpy.float64) #all values are float64
X = array[:,1:49]
Y = array[:,0]
validation_size = 0.2
seed = 7.0
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
scoring = 'accuracy'
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
And then I get two different errors.
For Logistic Regression:
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 172, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
I found someone who had the same problems but I couldn't sort it out yet..
And (most important):
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 97, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array([ 0.5, 0. , 1. , 1. , 0.5, 0.5, 1. , 0.5, 0. , 0.5, 1. ,
0. , 0. , 0. , 1. , 1......
In both cases the error come when I execute "cv_result" line... So, I hope you can help me...
"ValueError: Unknown label type: 'continuous'" means Your "Y" values are not class type of data (multiple rows share a same integer value. each integer represent a class). Therefore, you cannot use "DecisionTreeClassifier", "KNeighborsClassifier", "LogisticRegression"(do not be fooled by its name, LogisticRegression is a boolean classification method) or any other classification machine learning methods. In reality, your "Y" values are all different or 'continuous' (probably are float numbers), so you can only use the regression machine learning (i.e. "RandomForestRegressor").
Here are two solutions:
a) Group Y values into bins (classes). Apply classification modeling to your data.
b) If you prefer your predictions to have values (float numbers), You need to use the regression machine learning methods to predict Y values.
By the way, the "scoring = 'accuracy'" evaluation method is for classification modeling.
I have succeeded build binary classification model for image in CNN using Keras and made the prediction using model.predict_classes() and here is my code:
import numpy as np
import os,sys
from keras.models import load_model
import PIL
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
model = load_model('./potholes16_2.h5')
model.compile (loss = 'binary_crossentropy',
optimizer = 'adam',
metric = ['accuracy'])
path= os.path.abspath("./potholes14/test/positive")
extensions = 'JPG'
if __name__ == "__main__":
for f in os.listdir(path):
if os.path.isfile(os.path.join(path,f)):
f_text, f_ext= os.path.splitext(f)
f_ext= f_ext[1:].upper()
if f_ext in extensions:
print (f)`enter code here`
img = Image.open(os.path.join(path,f))
new_width = 200
new_height = 200
img = img.resize((new_width, new_height), Image.ANTIALIAS)
#width, height= image.size
img = np.reshape(img,[1,new_width,new_height,3])
classes = model.predict_classes(img)
print (classes)
Now I want to count total of images which correctly predicted, for example how many classes are belong to class 0 or class 1?
You need to invoke the model.evaluate function; supposing you want to evaluate the data in x_test with the ground truth labels in y_test, then:
score = model.evaluate(x_test, y_test, verbose=0)
score[0] will give you the loss (binary cross entropy in your case), while score[1] contains the required binary accuracy.
See the docs for more details (scroll down looking for evaluate).
You must have the a a sample array of the data you are predicting on correct? well you could load that data as well. Keep the code you have,
classes = model.predict_classes(img)
yields
array([[ 0.94981687],[ 0.57888238],[ 0.58651019],[ 0.30058956],[ 0.21879381]])
and your class data looks like this
class_validation = np.array([[1],[0],[0],[0],[1]])
Then just find where there equal once rounding classes
np.where(np.round(classes,0)==class_validation)[0].shape[0]
Note: there are many was to write the last line, that assums your numpy array is shape (number_of_sample,1)
Another way to check
totalCorrect = class_validation[((np.round(classes,0) - class_validation)==0)]
print('Correct in Class 1 = ',np.count_nonzero(totalCorrect),'Correct in Class 0 = ',abs(len(totalCorrect)-np.count_nonzero(totalCorrect)))
I am attempting to use the anneal.arff dataset with Python scikit-learn's semisupervised algorithm LabelPropagation. The anneal dataset is categorical data, so I preprocessed it so that the output class for each item of instance
looks like [0. 0. 1. 0. 0.]. This is a numeric list that encodes the output class
as 5 possible values with 0's everywhere, and 1. in the position of the corresponding class. This is what I would expect.
For semi-supervised learning, most of the training data must be unlabeled, so
I modified the training set so that the unlabeled data has output [-1, -1, -1, -1, -1]. I previously tried just using -1, but the code emits the same error as shown below.
I train the classifier as follows, Y_train includes labeled and "unlabeled" data:
lp_model = LabelSpreading(gamma=0.25, max_iter=5)
lp_model.fit(X, Y_train)
I receive the error shown below after calling the fit method:
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\semi_supervised\label_propagation.py", line 221, in fit
X, y = check_X_y(X, y)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 526, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (538, 5)
This suggests that something is wrong with the shape of my Y_train list,
but this is the correct shape. What am I doing wrong?
Can LabelPropagation take as training data in this form, or does it only
accept unlabeled data as a scalar -1?
--- edit ---
Here is the code that generates the error. I'm sorry about the confusion over algorithms--I want to use both LabelSpreading and LabelPropagation, and choosing one or the other doesn't fix this error.
from scipy.io import arff
import pandas as pd
import numpy as np
import math
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from copy import deepcopy
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
f = "../../Documents/UCI/anneal.arff"
dataAsRecArray, meta = arff.loadarff(f)
dataset_raw = pd.DataFrame.from_records(dataAsRecArray)
dataset = pd.get_dummies(dataset_raw)
class_names = [col for col in dataset.columns if 'class_' in col]
print (dataset.shape)
number_of_output_columns = len(class_names)
print (number_of_output_columns)
def run(name, model, dataset, percent):
# Split-out validation dataset
array = dataset.values
X = array[:, 0:-number_of_output_columns]
Y = array[:, -number_of_output_columns:]
validation_size = 0.40
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
num_samples = len(Y_train)
num_labeled_points = math.floor(percent*num_samples)
indices = np.arange(num_samples)
unlabeled_set = indices[num_labeled_points:]
Y_train[unlabeled_set] = [-1, -1, -1, -1, -1]
lp_model = LabelSpreading(gamma=0.25, max_iter=5)
lp_model.fit(X_train, Y_train)
"""
predicted_labels = lp_model.transduction_[unlabeled_set]
print(predicted_labels[:10])
"""
if __name__ == "__main__":
#percentages = [0.1, 0.2, 0.3, 0.4]
percentages = [0.1]
models = []
models.append(('LS', LabelSpreading()))
#models.append(('CART', DecisionTreeClassifier()))
#models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
for percent in percentages:
run(name, model, dataset, percent)
print ("bye")
Your Y_train has shape (538, 5) but should be 1d. LabelPropagation doesn't support multi-label or multi-output multi-class right now.
The error message could be more informative, though :-/
So I have been trying to create a confusion metrics in my autoencoder
from __future__ import division, print_function, absolute_import
import numpy as np
#import matplotlib.pyplot as plt
import tflearn
import tensorflow as tf
from random import randint
from tensorflow.contrib import metrics as ms
# Data loading and preprocessing
import tflearn.datasets.mnist as mnist
Images, Lables, testImages, testLables = mnist.load_data(one_hot=True)
f = randint(0,20)
x = tf.placeholder("float",[None, 784])
y = tf.placeholder("float",[None, 10])
# Building the encoder
encoder = tflearn.input_data(shape=[None, 784])
encoder = tflearn.fully_connected(encoder, 256)
encoder = tflearn.fully_connected(encoder, 64)
encoder = tflearn.fully_connected(encoder, 10)
acc= tflearn.metrics.Accuracy()
# Regression, with mean square error
net = tflearn.regression(encoder, optimizer='adam', learning_rate=0.001,
loss='mean_square', metric=acc, shuffle_batches=True, validation_monitors = ?)
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(Images, Lables, n_epoch=20, validation_set=(testImages, testLables),
run_id="auto_encoder", batch_size=256,show_metric=True)
#Applying the above model on test Images and evaluating as well as prediction of the labels
evali= model.evaluate(testImages,testLables)
print("Accuracy of the model is :", evali)
lables = model.predict_label(testImages)
print("The predicted labels are :",lables[f])
prediction = model.predict(testImages)
print("The predicted probabilities are :", prediction[f])
I have gone through the documantation but they were not very useful to me.
How would I configure to get the confusion matrix?
validation_monitors ={?}
Attempting to create a decision tree with cross validation using sklearn and panads.
My question is in the code below, the cross validation splits the data, which i then use for both training and testing. I will be attempting to find the best depth of the tree by recreating it n times with different max depths set. In using cross validation should i instead be using k folds CV and if so how would I use that within the code I have?
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation
features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]
df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})
x = df[features[:-1]]
y = df['class']
x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)
depth = []
for i in range(3,20):
clf = tree.DecisionTreeClassifier(max_depth=i)
clf = clf.fit(x_train,y_train)
depth.append((i,clf.score(x_test,y_test)))
print depth
here is a link to the data that i am using in case that helps anyone. https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope
In your code you are creating a static training-test split. If you want to select the best depth by cross-validation you can use sklearn.cross_validation.cross_val_score inside the for loop.
You can read sklearn's documentation for more information.
Here is an update of your code with CV:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.cross_validation import cross_val_score
from pprint import pprint
features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]
df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})
x = df[features[:-1]]
y = df['class']
# x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)
depth = []
for i in range(3,20):
clf = tree.DecisionTreeClassifier(max_depth=i)
# Perform 7-fold cross validation
scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4)
depth.append((i,scores.mean()))
print(depth)
Alternatively, you can use sklearn.grid_search.GridSearchCV and not write the for loop yourself, especially if you want to optimize for more than one hyper-parameter.
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import GridSearchCV
features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]
df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})
x = df[features[:-1]]
y = df['class']
parameters = {'max_depth':range(3,20)}
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4)
clf.fit(X=x, y=y)
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_)
Edit: changed how GridSearchCV is imported to accommodate learn2day's comment.