Uniformly distributed random variables in RandomSearchCV algorithm - machine-learning

i would like to clarify one thing. i know that following command will generate a uniformly distributed random variable between(loc, loc+scale)
from scipy.stats import uniform
C =uniform.rvs(loc=0,scale=4)
and let us suppose that i want to use this distribution in logistic regression while using RandomiizedSearchCV algorithm as it is shown below :
parameters =dict(C =uniform(loc=0,scale=4),penalty=['l2', 'l1'])
from sklearn.model_selection import RandomizedSearchCV
clf = RandomizedSearchCV(logreg, parameters, random_state=0)
search = clf.fit(iris.data, iris.target)
but i did not understand one thing: RandomizedSearchCV is like a gridsearch, just it tries to select random number of combination with given amount of trial (n_iter), but here C is a object, it is not array or something like this, even i can't print its value, so how can i understand this code? how it generates random number? without indication of rvs?

According to the documentation for the param_distributions argument (here parameters):
Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.
So, what is happening at each iteration is:
Sample a value for C according to a uniform distribution in [0, 4]
Sample a value for penalty, uniformly between l1 and l2 (i.e with 50% probability for each)
Use these sampled values for running a CV and store the results
Using the example from the documentation (practically identical with the parameters in your question):
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
clf = RandomizedSearchCV(logistic, distributions, random_state=0)
search = clf.fit(iris.data, iris.target)
we get
# {'C': 2.195254015709299, 'penalty': 'l1'}
We can go a step further, and see all the (10) combinations used, along with their performance:
import pandas as pd
df = pd.DataFrame(search.cv_results_)
# result:
params mean_test_score
0 {'C': 2.195254015709299, 'penalty': 'l1'} 0.980000
1 {'C': 3.3770629943240693, 'penalty': 'l1'} 0.980000
2 {'C': 2.1795327319875875, 'penalty': 'l1'} 0.980000
3 {'C': 2.4942547871438894, 'penalty': 'l2'} 0.980000
4 {'C': 1.75034884505077, 'penalty': 'l2'} 0.980000
5 {'C': 0.22685190926977272, 'penalty': 'l2'} 0.966667
6 {'C': 1.5337660753031108, 'penalty': 'l2'} 0.980000
7 {'C': 3.2486749151019727, 'penalty': 'l2'} 0.980000
8 {'C': 2.2721782443757292, 'penalty': 'l1'} 0.980000
9 {'C': 3.34431505414951, 'penalty': 'l2'} 0.980000
from where it is apparent indeed that all values of C tried were in [0, 4], as requested. Also, since there were more than one combinations that achieved a best score of 0.98, scikit-learn uses the first one as returned in cv_results_.
Looking closely, we see that only 4 trials were run with l1 penalty (and not the 50% of the 10, i.e. 5, as we might expect), but this is something to be expected with small random samples (here only 10).

you want to use more than one value for C for RandomizedSearchCV to discover. refit=True and return_train_score=True allow you to use the clf with the best model fit.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)
clf = RandomizedSearchCV(logreg, parameter_grid,
n_iter = 10,
return_train_score = True,
search = clf.fit(X_train,y_train)
print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))
cv_results_df = pd.DataFrame(clf.cv_results_)
column = cv_results_df.loc[:, ['params']]
# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1 ]
#print(clf.best_index_) you can use with iloc to slice the best row


Restricting prediction range of sklearn regressor

let's say I have the dataframe below, where we describe the course of two cases.
import pandas as pd
data = {
df = pd.DataFrame(data)
Imagine I want to predict the duration of case 2 based on the duration of case 1. For this I could set up the following code.
train = df[df['case'] == 1]
test = df[df['case'] == 2]
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = ['duration','stage']
y = ['total_duration']
train_X, train_y = train[X], train[y]
test_X, test_y = test[X], test[y]
output: array([10., 10., 10., 10., 10.])
Because the dataset is so small, the model naively predicts the total duration of case 2 to be the same as case 1. However, the prediction is not feasible for one data point, where the current duration of case is already 13. This exceeds the predicted duration of 10.
Is there a way to restrict the model to not predict a total duration which is lower as the current duration? Which would give the output as follows:
output: array([10., 10., 10., 10., 13.])
This may not be an ideal way to predict such a feature, and an alternative may be to predict duration_left. But that would add a trend to my target variable which is what I want to prevent.
Is there a way I can achieve the goal mentioned above in sklearn?

What do the 'normalize' parameters mean in sklearns confusion_matrix?

I am using sklearns confusion_matrix package to plot the results coupled with the accuracy, recall and precision score etc and the graph renders as it should. However I am slightly confused by what the different values for what the normalize parameter mean. Why do we do it and what are the differences between the 3 options? As quoting from their documentation:
normalize{‘true’, ‘pred’, ‘all’}, default=None
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population.
If None, confusion matrix will not be normalized.
Does it normalize the points to a percentage format to make it easily visually if datasets are too large? Or am I missing the point all together here. I have searched but the questions all appear to be stating how to do it, rather than the meaning behind them.
A normalized version makes it easier to visually interpret how the labels are being predicted and also to perform comparisons. You can also pass values_format= '.0%' to display the values as percentages. The normalize parameter specifies what the denominator should be
'true': sum of rows (True label)
'pred': sum of columns (Predicted label)
'all': sum of all
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
# Generate some example data
X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=10)
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
plot_confusion_matrix(clf, X_test, y_test); plt.title("Not normalized");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='true'); plt.title("normalize='true'");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='pred'); plt.title("normalize='pred'");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='all'); plt.title("normalize='all'");
Yes, you can think of it as a percentage. The default is to just show the absolute count value in each cell of the confusion matrix, i.e. how often each combination of true and predicted category levels occurrs.
But if you choose e.g. normalize='all', every count value will be divided by the sum of all count values, so that you have relative frequencies whose sum over the whole matrix is 1. Similarly, if you pick normalize='true', you will have relative frequencies per row.
If you repeat an experiment with different sample sizes, you may want to compare confusion matrices across experiments. To do so, you wouldn't want to see the total counts for each matrix. Instead, you would want to see the counts normalized but you need to decide if you want terms normalized by total number of samples ("all"), predicted class counts ("pred"), or true class counts ("true"). For example:
In [30]: yt
Out[30]: array([1, 0, 0, 0, 0, 1, 1, 0, 0, 0])
In [31]: yp
Out[31]: array([0, 0, 1, 0, 1, 0, 0, 1, 0, 0])
In [32]: confusion_matrix(yt, yp)
array([[4, 3],
[3, 0]])
In [33]: confusion_matrix(yt, yp, normalize='pred')
array([[0.57142857, 1. ],
[0.42857143, 0. ]])
In [34]: confusion_matrix(yt, yp, normalize='true')
array([[0.57142857, 0.42857143],
[1. , 0. ]])
In [35]: confusion_matrix(yt, yp, normalize='all')
array([[0.4, 0.3],
[0.3, 0. ]])

high variance with Randomforest learner

I'm using Random Forest Regressor to fit a 10-dimensional regression problem with around 300 thousand samples. Although not necessary when dealing with Random Forest I started by putting the data on the same scale (by using preprocessing of sklearn) and then I did a randomised search over the following parameter space:
n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
max_features= auto, sqrt
max_depth= from 1- to 150 with step =11
Bootstrap true or false
Moreover, after getting the best parameters I did a second narrower search.
Though I am using a 10-Fold cross validation scheme with the random search I'm still getting a serious overfitting problem!
Moreover, I have also tried using DBSCAN algorithm to check for outliers. After excluding some parts of the dataset I got even worse results!
Should I include other parameters of the Random Forest in the randomised search? or should I apply some more preprocessing techniques on the data set before fitting?
For convenience, this is my implementation I wrote:
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop =
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42,
n_jobs = 32)
rf_random.fit(x_train, y_train)
the best parameters returned by the randomizedsearch function:
bootstrap: Fasle. Min_samples_leaf=2. n_estimators= 1647. Max_features: sqrt. min_samples_split=3. Max_depth: None.
The range of the target is from 0 to 10000 [unit]. This model is resulting in 6.98 [unit] RMSE accuracy on the training set and and average of 67.54 [unit] RMSE accuracy on the test sets.
that line
max_depth= from 1- to 150 with step =11
For a 10 feature problem, the optimum depth is under 10. You are overfitting like crazy beacause of that. consider putting max_depth from 1 to 15 with step 1
This should help reduce the variance, however, the step of 11 for max_depth is killing all the efforts you could possibly make

Neural Network for Regression with tflearn

My question is about coding a neural network which does regression (and NOT classification) using tflearn.
fixed acidity volatile acidity citric acid ... alcohol quality
7.4 0.700 0.00 ... 9.4 5
7.8 0.880 0.00 ... 9.8 5
7.8 0.760 0.04 ... 9.8 5
11.2 0.280 0.56 ... 9.8 6
7.4 0.700 0.00 ... 9.4 5
I want to build a neural network which takes in 11 features (chemical values in wine) and outputs or predicts a score i.e., quality(out of 10). I DON'T want to classify the wine like quality_1, quality_2,... I want the model to perform a regression function for my features and predict a value out of 10(could be even a float).
The quality column in my data only has values = [3, 4, 5, 6, 7, 8, 9].
It does not contain 1, 2, and 10.
Due to the lack in experience, I could only code a neural network that CLASSIFIES the wine into classes like [score_3, score_4,...] and I used one hot encoding to do so.
Processed Data:
[[ 7.5999999 0.23 0.25999999 ..., 3.02999997 0.44
[ 6.9000001 0.23 0.34999999 ..., 2.79999995 0.54000002
11. ]
[ 6.69999981 0.17 0.37 ..., 3.25999999 0.60000002
[ 6.30000019 0.28 0.47 ..., 3.11999989 0.50999999
9.5 ]
[ 5.19999981 0.64499998 0. ..., 3.77999997 0.61000001
12.5 ]
[ 8. 0.23999999 0.47999999 ..., 3.23000002 0.69999999
10. ]]
[[ 0. 1. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 1. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]]
Code for a neural network which CLASSIFIES into different classes:
import pandas as pd
import numpy as np
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from sklearn.model_selection import train_test_split
def preprocess():
data_source_red = 'F:\Gautam\...\Datasets\winequality-red.csv'
data_red = pd.read_csv(data_source_red, index_col=False, sep=';')
data = pd.get_dummies(data, columns=['quality'], prefix=['score'])
x = data[data.columns[0:11]].values
y = data[data.columns[11:18]].values
x = np.float32(x)
y = np.float32(y)
return (x, y)
x, y = preprocess()
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2)
network = input_data(shape=[None, 11], name='Input_layer')
network = fully_connected(network, 10, activation='relu', name='Hidden_layer_1')
network = fully_connected(network, 10, activation='relu', name='Hidden_layer_2')
network = fully_connected(network, 7, activation='softmax', name='Output_layer')
network = regression(network, batch_size=2, optimizer='adam', learning_rate=0.01)
model = tflearn.DNN(network)
model.fit(train_x, train_y, show_metric=True, run_id='wine_regression',
validation_set=0.1, n_epoch=1000)
The neural network above is a poor one(accuracy=0.40). Moreover, it classifies the data into different classes. I would like to know how to code a regression neural network which gives a score out of 10 for the input features (and NOT CLASSIFICATION). I would also prefer tflearn as I'm quite comfortable with it.
This is the line in your code which makes your network a classifier with seven categories, instead of a regressor:
network = fully_connected(network, 7, activation='softmax', name='Output_layer')
I don't use TFLearn any more, I have switched over to Keras (which is similar, and has better support). However, I will suggest that you want the following output layer instead:
network = fully_connected(network, 1, activation='linear', name='Output_layer')
Also, your training data will need to change. If you want to perform a regression, you want a one-dimensional scalar label instead. I assume that you still have the original data, which you say that you altered? If not, the UC Irvine Machine Learning Data Repository has the wine quality data with a single, numerical Quality column.

keras stuck during optimization

After trying the Keras example on CIFAR10, I decided to go for something bigger : a VGG-like net on the Tiny Imagenet dataset. This is a subset of the ImageNet dataset with 200 classes (instead of 1000) and 100K images downscaled to 64x64.
I got the VGG-like model from the file vgg_like_convnet.py here. Unfortunately, things are going pretty much like here except that this time changing the learning rate or swapping TH for TF does not help. Neither changing the optimizer (see code below).
Accuracy is basically stuck at 0.005 which, as it was pointed out, is what you would expected for completely random answer with 200 classes. Worse, if, by a fluke of weights init, it starts at, say, 0.007, it will quickly converges to 0.005 and firmly stays there for any subsequent epoch.
The Keras code (TH version) is below :
from __future__ import print_function
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.regularizers import l2, activity_l2, l1, activity_l1
from keras.optimizers import SGD, Adam, Adagrad, Adadelta
from keras.utils import np_utils
import numpy as np
import cPickle as pickle
# seed = 7
# np.random.seed(seed)
batch_size = 64
nb_classes = 200
nb_epoch = 30
# input image dimensions
img_rows, img_cols = 64, 64
# the tiny image net images are RGB
img_channels = 3
# Load the train dataset for TH
print('Load training data')
X_train=pickle.load(open('xtrain_shu_th.p','rb')) # np.zeros((100000,3,64,64)).astype('uint8')
y_train=pickle.load(open('ytrain_shu_th.p','rb')) # np.zeros((100000,1)).astype('uint8')
# Load the test dataset for TH
print('Load validation data')
X_test=pickle.load(open('xtest_th.p','rb')) # np.zeros((10000,3,64,64)).astype('uint8')
y_test=pickle.load(open('ytest_th.p','rb')) # np.zeros((10000,1)).astype('uint8')
# the data, shuffled and split between train and test sets
# (X_train, y_train), (X_test, y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
model = Sequential()
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(Convolution2D(64, 3, 3, activation='relu',))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(128, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_6'].values()))
model.add(Convolution2D(128, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_8'].values()))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(256, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_11'].values()))
model.add(Convolution2D(256, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_13'].values()))
model.add(Convolution2D(256, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_15'].values()))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_18'].values()))
model.add(Convolution2D(512, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_20'].values()))
model.add(Convolution2D(512, 3, 3, activation='relu'))#,weights=pretrained_weights['layer_22'].values()))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Dense(200, activation='softmax'))
# let's train the model using SGD + momentum (how original).
opt = SGD(lr=0.0001, decay=1e-6, momentum=0.7, nesterov=True)
# opt= Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
# opt = Adadelta(lr=1.0, rho=0.95, epsilon=1e-08, decay=0.0)
# opt = Adagrad(lr=0.01, epsilon=1e-08, decay=0.0)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
model.fit(X_train, Y_train,
validation_data=(X_test, Y_test),
# Save the resulting model
The Tiny Imagenet dataset consists of JPEG images that I converted to PPM with djpeg. I then created a large binary file containing, for each image, the class label (1 byte) followed by (64x64x3 bytes).
Reading this file from Keras was excruciatingly slow. So (I'm very new to Python, it might sound dumb to you), I decided to init a 4D Numpy array (100000,3,64,64) (for TH, (100000,64,64,3) for TF) with the dataset and pickle it. It now takes ~40s to load the dataset in the array when I run the code above.
I even checked that the pickled array contained the data in the right order with the code below:
import numpy as np
import cPickle as pickle
print("Reading data")
f.write('P6\n64 64\n255\n')
for y in range(0,64):
for x in range(0,64):
This extracts PPM images back from the dataset.
Finally, I noticed that the training dataset was too ordered (i.e. the first 500 images all belonged to class 0, the second 500 to class 1, etc. etc.)
So I shuffled them with the code below:
# Dataset preparation for Theano backend
import cPickle as pickle
import numpy as np
import random as rnd
print('Load training data')
X_train=pickle.load(open('xtrain_th.p','rb')) # np.zeros((100000,3,64,64)).astype('uint8')
y_train=pickle.load(open('ytrain_th.p','rb')) # np.zeros((100000,1)).astype('uint8')
# Shuffle the data
print('Shuffling training data')
for _ in range(0,n):
print 'Pickle dump'
Nothing helped. I wasn't expecting 99% accuracy at the first attempt, but at least some movement and then plateau.
I wanted to try TFLearn, but it had a pending bug when I looked a few days ago.
Any ideas ? Thanks in advance
You can use the build in shuffle of the keras model API (https://keras.io/models/model/#fit). Just set the shuffle parameter to true. You can do both batch shuffle and global shuffle. The default is global shuffle.
One thing to note though is that the validation split in fit is done before the shuffling takes place. Therefore in case you want to shuffle your validation data too I would advise you to use: sklearn.utils.shuffle. (http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html)
From github:
if shuffle == 'batch':
index_array = batch_shuffle(index_array, batch_size)
elif shuffle:
