I've been testing the dask_ml.xgboost regressor on a synthetic 10GB dataset. When training, the memory usage of the workers exceeds the amount available on my local laptop. I am aware that I can try running on an online dask cluster with larger memory, or that I can sample the data (and ignore the rest) before training. But is there a different solution? I tried limiting the number and the depth of the trees generated, subsampling the rows and columns, and changing the tree construction algorithm but the workers still run out of memory.
Given a fixed memory allocation, is there a way to reduce the memory consumption of each worker when training dask_ml.xgboost?
Here is a code snippet:
import dask.dataframe as dd
from dask.distributed import Client
from dask_ml.xgboost import XGBRegressor
client = Client(memory_limit='7GB')
ddf = dd.read_csv('10GB_float.csv')
X = ddf[ddf.columns.difference(['float_1'])].persist()
y = ddf['float_1'].persist()
reg = XGBRegressor(
objective='reg:squarederror', n_estimators=10, max_depth=2, tree_method='hist',
subsample=0.001, colsample_bytree=0.5, colsample_bylevel=0.5,
colsample_bynode=0.5, n_jobs=-1)
reg.fit(X, y)
The synthetic dataset 10GB_float.csv has 50 columns and 26758707 rows containing random floats (float64) ranging from 0 to 1. Below are the cluster details:
Cluster
Workers: 4
Cores: 12
Memory: 28.00 GB
And some information about my local laptop:
Memory: 31.1 GiB
Processor: Intel® Core™ i7-8750H CPU # 2.20GHz × 12
Additionally, here are the parameters of XGBRegressor (using .get_params()):
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 0.5,
'colsample_bynode': 0.5,
'colsample_bytree': 0.5,
'gamma': 0,
'importance_type': 'gain',
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 2,
'min_child_weight': 1,
'missing': None,
'n_estimators': 10,
'n_jobs': -1,
'nthread': None,
'objective': 'reg:squarederror',
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.001,
'verbosity': 1,
'tree_method': 'hist'}
Thank you very much for your time!
Related
I'm using Random Forest Regressor to fit a 10-dimensional regression problem with around 300 thousand samples. Although not necessary when dealing with Random Forest I started by putting the data on the same scale (by using preprocessing of sklearn) and then I did a randomised search over the following parameter space:
n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
max_features= auto, sqrt
max_depth= from 1- to 150 with step =11
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
Bootstrap true or false
Moreover, after getting the best parameters I did a second narrower search.
Though I am using a 10-Fold cross validation scheme with the random search I'm still getting a serious overfitting problem!
Moreover, I have also tried using DBSCAN algorithm to check for outliers. After excluding some parts of the dataset I got even worse results!
Should I include other parameters of the Random Forest in the randomised search? or should I apply some more preprocessing techniques on the data set before fitting?
For convenience, this is my implementation I wrote:
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop =
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42,
n_jobs = 32)
rf_random.fit(x_train, y_train)
the best parameters returned by the randomizedsearch function:
bootstrap: Fasle. Min_samples_leaf=2. n_estimators= 1647. Max_features: sqrt. min_samples_split=3. Max_depth: None.
The range of the target is from 0 to 10000 [unit]. This model is resulting in 6.98 [unit] RMSE accuracy on the training set and and average of 67.54 [unit] RMSE accuracy on the test sets.
that line
max_depth= from 1- to 150 with step =11
For a 10 feature problem, the optimum depth is under 10. You are overfitting like crazy beacause of that. consider putting max_depth from 1 to 15 with step 1
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
This should help reduce the variance, however, the step of 11 for max_depth is killing all the efforts you could possibly make
I'm trying to train a CNN on the CIFAR-10 Dataset in Keras, but I'm only getting around 10% accuracy, essentially random. I'm training over 50 epochs, with a batch size of 32 and learning rate of 0.01. Is there anything in particular that I am doing wrong?
import os
import numpy as np
import pandas as pd
from PIL import Image
from keras.models import Model
from keras.layers import Input, Dense, Conv2D, MaxPool2D, Dropout, Flatten
from keras.optimizers import SGD
from keras.utils import np_utils
# trainingData = np.array([np.array(Image.open("train/" + f)) for f in os.listdir("train")]) #shape: 50k, 32, 32, 3
# testingData = np.array([np.array(Image.open("test/" + f)) for f in os.listdir("test")]) #shape: same as training
#
# trainingLabels = np.array(pd.read_csv("trainLabels.csv"))[:,1] #categorical labels ["dog", "cat", "etc"....]
# listOfLabels = sorted(list(set(trainingLabels)))
# trainingOutput = np.array([np.array([1.0 if label == ind else 0.0 for ind in listOfLabels]) for label in trainingLabels]) #converted to output
# #for example: training output for dog =
# #[1.0, 0.0, 0.0, ...]
# np.save("trainingInput.np", trainingData)
# np.save("testingInput.np", testingData)
# np.save("trainingOutput.np", trainingOutput)
trainingInput = np.load("trainingInput.npy") #shape = 50k, 32, 32, 3
testingInput = np.load("testingInput.npy") #shape = 10k, 32, 32, 3
listOfLabels = sorted(list(set(np.array(pd.read_csv("trainLabels.csv"))[:,1]))) #categorical list of labels as strings
trainingOutput = np.load("trainingOutput.npy") #shape = 50k, 10
#looks like [0.0, 1.0, 0.0 ... 0.0, 0.0]
print(listOfLabels)
print("Data loaded\n______________\n")
inp = Input(shape=(32, 32, 3))
conva1 = Conv2D(64, (3, 3), padding='same', activation='relu')(inp)
conva2 = Conv2D(64, (3, 3), padding='same', activation='relu')(conva1)
poola = MaxPool2D(pool_size=(3, 3))(conva2)
dropa = Dropout(0.1)(poola)
convb1 = Conv2D(128, (5, 5), padding='same', activation='relu')(dropa)
convb2 = Conv2D(128, (5, 5), padding='same', activation='relu')(convb1)
poolb = MaxPool2D(pool_size=(3, 3))(convb2)
dropb = Dropout(0.1)(poolb)
flat = Flatten()(dropb)
dropc = Dropout(0.5)(flat)
out = Dense(len(listOfLabels), activation='softmax')(dropc)
print(out.shape)
model = Model(inputs=inp, outputs=out)
lrSet = SGD(lr=0.01, clipvalue=0.5)
model.compile(loss='categorical_crossentropy', optimizer=lrSet, metrics=['accuracy'])
model.fit(trainingInput, trainingOutput, batch_size=32, epochs=50, verbose=1, validation_split=0.1)
print(model.predict(testingInput))
Is there anything in particular that I am doing wrong?
Not necessarily "wrong", but some pointers I can suggest are:
It is important that you rescale your data, in case you are not doing so. Instead of handling values ranging from [0,255] it is better to divide all by 255 and handle data with ranges [0,1]. This helps your model's weights converge faster, as each gradient update will be more significant compared to it's unscaled version.
I think that your dropout may be affecting your performance. Even more seeing that you are using CNNs and a strong (0.5) Dropout when passing data to your output. Quoting this great answer:
In the original paper that proposed dropout layers, by Hinton (2012), dropout (with p=0.5) was used on each of the fully connected (dense) layers before the output; it was not used on the convolutional layers. This became the most commonly used configuration.
More recent research has shown some value in applying dropout also to convolutional layers, although at much lower levels: p=0.1 or 0.2.
So perhaps reducing your dropout or playing with it a bit will yield better results. Do notice that you are doing consecutive dropouts on your data, which doesn't seem quite helpful in my opinion and could also be causing problem, so consider redesigning that part:
dropb = Dropout(0.1)(poolb) #drop
flat = Flatten()(dropb) #flatten
dropc = Dropout(0.5)(flat) #then drop again?
Your learning rate may be higher than what is normally used. Although that is SGD's default learning rate, with higher learning values you may be "rushing" your training and failing to find better minima that could yield better performance. Consider using a lower learning rate (0.001 or lower, adjust epochs as needed), or well adding weight decay on your SGD instance. This will prevent your model from getting stuck on local minima that give sub-optimal results.
I am doing a convolution in Theano:
theano.tensor.nnet.conv.conv2d(x,h, border_mode='full')
and it runs out of memory, I get the following message:
RuntimeError: GpuCorrMM failed to allocate working memory of 3591 x 319086
Apply node that caused the error: GpuCorrMM_gradInputs{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, (True, False, True, False)), CudaNdarrayType(float32, (False, True, False, False))]
Inputs shapes: [(1, 513, 1, 7), (1, 1, 513, 622)]
Inputs strides: [(0, 7, 0, 1), (0, 0, 622, 1)]
Inputs values: ['not shown', 'not shown']
I have tried setting theano flags to 'optimizer_excluding=conv_dnn', but still didn't work. Is there any way around this?
You are trying to allocate a matrix which need something like 9TB of memory. An individual neuron needs 2.5GB of memory. The only optimization I know for such issues is to either decrease the number of units or buying more RAM. Loads of RAM :)
For me, I disabled g++ during runtime by simply remove the (MinGW) bin directory from the path variable. The processing time is slow, but it completes process.
My program execution enviroment: OS Windows Vista 32 bit, CPU Intel 2.16 GHz, RAM 4.00 GB and no GPU
I have a CNN trained upon the images (cropped faces) of Mark Ruffalo. For my positive class I have around 200 images and for the negative datapoints I have sampled 200 random faces.
The model has a high recall but a very low precision. How could I increase the precision ?Also I am constrained by the number of positive images that I have. I am ready to compromise the recall in this tradeoff.
I have tried increasing the number of negative samples but that introduces a form of bias and the model starts classifying everything as negative to attain a local optima.
I have based my CNN upon overfeat:
local features = nn.Sequential()
features:add(nn.SpatialConvolutionMM(3, 96, 11, 11))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
features:add(nn.SpatialConvolutionMM(96, 256, 5, 5))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
features:add(nn.SpatialConvolutionMM(256, 512, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
-- 24x24x512
features:add(nn.SpatialConvolutionMM(512, 1024, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
--11x11x1024
features:add(nn.SpatialConvolutionMM(1024, 1024, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
-- 1.3. Create Classifier (fully connected layers)
local classifier = nn.Sequential()
classifier:add(nn.View(1024*4*4))
classifier:add(nn.Dropout(0.5))
classifier:add(nn.Linear(1024*4*4, 3072))
classifier:add(nn.Threshold(0, 1e-6))
classifier:add(nn.Dropout(0.5))
classifier:add(nn.Linear(3072, 4096))
classifier:add(nn.Threshold(0, 1e-6))
classifier:add(nn.Linear(4096, noutputs))
model = nn.Sequential():add(features):add(classifier)
Kindly Help
Try playing with the raw output of the CNN instead of taking the sign() of the output node (since it is a positive and negative class I assume there is only one output in the range [-1,1]).
For instance, for one sample, the output could be [0.9] indicating that the positive class should be picked. But if you play with this values, you can find a specific threshold value, hopefully, that gives you the precision you need. In other words, if you find that anything greater than [-0.35] should actually be chosen as the positive class because it gived you better precision, then -0.35 should be your threshold value.
This is where ROC analysis comes in handy.
Let me know if this helps.
I know that a Gaussian Process model is best suited for regression rather than classification. However, I would still like to apply a Gaussian Process to a classification task but I am not sure what is the best way to bin the predictions generated by the model. I have reviewed the Gaussian Process classification example that is available on the scikit-learn website at:
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
But I found this example confusing (I have listed the things I found confusing about this example at the end of the question). To try and get a better understanding I have created a very basic python code example using scikit-learn that generates classifications by applying a decision boundary to the predictions made by a gaussian process:
#A minimum example illustrating how to use a
#Gaussian Processes for binary classification
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.gaussian_process import GaussianProcess
if __name__ == "__main__":
#defines some basic training and test data
#If the descriptive features have large values
#(i.e., 8s and 9s) the target is 1
#If the descriptive features have small values
#(i.e., 2s and 3s) the target is 0
TRAININPUTS = np.array([[8, 9, 9, 9, 9],
[9, 8, 9, 9, 9],
[9, 9, 8, 9, 9],
[9, 9, 9, 8, 9],
[9, 9, 9, 9, 8],
[2, 3, 3, 3, 3],
[3, 2, 3, 3, 3],
[3, 3, 2, 3, 3],
[3, 3, 3, 2, 3],
[3, 3, 3, 3, 2]])
TRAINTARGETS = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
TESTINPUTS = np.array([[8, 8, 9, 9, 9],
[9, 9, 8, 8, 9],
[3, 3, 3, 3, 3],
[3, 2, 3, 2, 3],
[3, 2, 2, 3, 2],
[2, 2, 2, 2, 2]])
TESTTARGETS = np.array([1, 1, 0, 0, 0, 0])
DECISIONBOUNDARY = 0.5
#Fit a gaussian process model to the data
gp = GaussianProcess(theta0=10e-1, random_start=100)
gp.fit(TRAININPUTS, TRAINTARGETS)
#Generate a set of predictions for the test data
y_pred = gp.predict(TESTINPUTS)
print "Predicted Values:"
print y_pred
print "----------------"
#Convert the continuous predictions into the classes
#by splitting on a decision boundary of 0.5
predictions = []
for y in y_pred:
if y > DECISIONBOUNDARY:
predictions.append(1)
else:
predictions.append(0)
print "Binned Predictions (decision boundary = 0.5):"
print predictions
print "----------------"
#print out the confusion matrix specifiy 1 as the positive class
cm = confusion_matrix(TESTTARGETS, predictions, [1, 0])
print "Confusion Matrix (1 as positive class):"
print cm
print "----------------"
print "Classification Report:"
print metrics.classification_report(TESTTARGETS, predictions)
When I run this code I get the following output:
Predicted Values:
[ 0.96914832 0.96914832 -0.03172673 0.03085167 0.06066993 0.11677634]
----------------
Binned Predictions (decision boundary = 0.5):
[1, 1, 0, 0, 0, 0]
----------------
Confusion Matrix (1 as positive class):
[[2 0]
[0 4]]
----------------
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 2
avg / total 1.00 1.00 1.00 6
The approach used in this basic example seems to work fine with this simple dataset. But this approach is very different from the classification example given on the scikit-lean website that I mentioned above (url repeated here):
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
So I'm wondering if I am missing something here. So, I would appreciate if anyone could:
With respect to the classification example given on the scikit-learn website:
1.1 explain what the probabilities being generated in this example are probabilities of? Are they the probability of the query instance belonging to the class >0?
1.2 why the example uses a cumulative density function instead of a probability density function?
1.3 why the example divides the predictions made by the model by the square root of the mean square error before they are input into the cumulative density function?
With respect to the basic code example I have listed here, clarify whether or not applying a simple decision boundary to the predictions generated by a gaussian process model is an appropriate way to do binary classification?
Sorry for such a long question and thanks for any help.
In the GP classifier, a standard GP distribution over functions is "squashed," usually using the standard normal CDF (also called the probit function), to map it to a distribution over binary categories.
Another interpretation of this process is through a hierarchical model (this paper has the derivation), with a hidden variable drawn from a Gaussian Process.
In sklearn's gp library, it looks like the output from y_pred, MSE=gp.predict(xx, eval_MSE=True) are the (approximate) posterior means (y_pred) and posterior variances (MSE) evaluated at points in xx before any squashing occurs.
To obtain the probability that a point from the test set belongs to the positive class, you can convert the normal distribution over y_pred to a binary distribution by applying the Normal CDF (see [this paper again] for details).
The hierarchical model of the probit squashing function is defined by a 0 decision boundary (the standard normal distribution is symmetric around 0, meaning PHI(0)=.5). So you should set DECISIONBOUNDARY=0.