I created a function to perform a grid search for the optimal parameters of my xgboost classifier. because my training set is large I want to limit the grid search to a sample of about 5000 observations.
this is the function:
def xgboost_search(X, y, search_verbose=1):
params = {
"gamma":[0.5, 1, 1.5, 2, 5],
"max_depth":[3,4,5,6],
"min_child_weight": [100],
"subsample": [0.6, 0.8, 1.0],
"colsample_bytree": [0.6, 0.8, 1.0],
"learning_rate": [0.1, 0.01, 0.001]
}
xgb = XGBClassifier(objective="binary:logistic", eval_metric="auc", use_label_encoder=False)
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1234)
grid_search = GridSearchCV(estimator=xgb, param_grid=params, scoring="roc_auc", n_jobs=1, cv=skf.split(X,y), verbose=search_verbose)
grid_search.fit(X, y)
print("Best estimator: ")
print(grid_search.best_estimator_)
print("Parameters: ", grid_search.best_params_)
print("Highest AUC: %.2f" % grid_search.best_score_)
return grid_search.best_params_
this is what I tried to get the 5000 observations:
rows = random.sample(list(X_res), 5000)
model_params = xgboost_search(X_res[rows], Y_res[rows])
I got this error:
IndexError Traceback (most recent call last)
/var/folders/cf/yh2vvpdn0klby68k9zrttfv00000gp/T/ipykernel_80963/3533706692.py in <module>
1 rows = random.sample(list(X_res), 5000)
----> 2 model_params = xgboost_search(X_res[rows], Y_res[rows])
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I think this is because my 'X_res' and 'Y_res' are arrays and 'rows' is a list.
can someone help?
Arrays can be indexed by a list, the issue here is the type of indices which are not integer in your list:
only integers ... are valid indices
in the error is probably what was wrong.
This is because you sampled 5000 elements of Xres with random.sample(list(X_res), 5000), not 5000 indices between 0 and len(Xres) as you probably meant to.
Try:
rows = random.sample(range(len(Xres)), 5000)
model_params = xgboost_search(X_res[rows], Y_res[rows])
Related
I am trying to use bincount over a 2D array. Specifically I have this code:
import numpy as np
import dask.array as da
def dask_bincount(weights, x):
da.bincount(x, weights)
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((1000, 2))
bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
The idea is that the bincount can be made with the same idx array on each one of the weight columns. That would return an array of size (np.amax(x) + 1, 2) if I am correct.
However when doing this I get this error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-5b8eed89ad32> in <module>
----> 1 bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs)
454 if shape is None or dtype is None:
455 test_data = np.ones((1,), dtype=arr.dtype)
--> 456 test_result = np.array(func1d(test_data, *args, **kwargs))
457 if shape is None:
458 shape = test_result.shape
<ipython-input-14-34fd0eb9b775> in dask_bincount(weights, x)
1 def dask_bincount(weights, x):
----> 2 da.bincount(x, weights)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in bincount(x, weights, minlength, split_every)
670 raise ValueError("Input array must be one dimensional. Try using x.ravel()")
671 if weights is not None:
--> 672 if weights.chunks != x.chunks:
673 raise ValueError("Chunks of input array x and weights must match.")
674
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
I thought that when dask array were created the library automatically assigns them chunks, so the error does not say much. How can I fix this?
I made an script that does it on numpy with map.
idx_np = np.random.randint(0, 1024, 1000)
weight_np = np.random.random((1000,2))
f = lambda y: np.bincount(idx_np, weight_np[:,y])
result = map(f, [i for i in range(2)])
np.array(list(result))
array([[0.9885341 , 0.9977873 , 0.24937023, ..., 0.31024526, 1.40754883,
0.87609759],
[1.77406303, 0.84787723, 0.14591474, ..., 0.54584068, 0.38357015,
0.85202672]])
I would like to the same but with dask
There are multiple problems at play.
Weights should be (2, 1000)
You discover this by trying to write the same function in numpy using apply_along_axis.
idx_np = np.random.random_integers(0, 1024, 1000)
weight_np = np.random.random((2, 1000)) # <- transposed
# This gives the same result as the code you provided
np.apply_along_axis(lambda weight, idx: np.bincount(idx, weight), 1, weight_np, idx_np)
da.apply_along_axis applies the function to numpy arrays
You're getting the error
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
This suggests that what makes it into the da.bincount method is actually a numpy array. The fact is that da.apply_along_axis actually takes each row of weight and sends it to the function as a numpy array.
Your function should therefore actually be a numpy function:
def bincount(weights, x):
return np.bincount(x, weights)
However, if you try this, you will still get the same error. I believe that happens for a whole another reason though:
Dask doesn't know what the output shape will be and tries to infer it
In the code and/or documentation for apply_along_axis, we can see that Dask tries to infer the output shape and dtype by passing in the array [1] (related question). This is a problem, since bincount cannot just accept such argument.
What we can do instead is provide shape and dtype to the method so that Dask doesn't have to infer it.
The problem here is that bincount's output shape depends on the maximum value of the input array. Unless you know it beforehand, you will sadly need to compute it. The whole operation therefore won't be fully lazy.
This is the full answer:
import numpy as np
import dask.array as da
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((2, 1000))
def bincount(weights, x):
return np.bincount(x, weights)
m = idx.max().compute()
da.apply_along_axis(bincount, 1, weight, idx, shape=(m,), dtype=weight.dtype)
Appendix: randint vs random_integers
Be careful, because these are subtly different
randint takes integers from low (inclusive) to high (exclusive)
random_integers takes integers from low (inclusive) to high (inclusive)
Thus you have to call randint with high + 1 to get the same value.
I created a ResNet18 to detect if 2 individuals are siblings or not, by giving an image of each one (the model has input_size = 2).
I need to create my dataset, in which I will specify which pair are siblings or not.
I tried:
training_set = train_datagen.flow_from_directory('training',
target_size=(28,28),
batch_size=32,
class_mode='binary')
And I got training_set.classes array([0, 0, 0, 0, 1, 1, 1, 1])
for training_set.filenames
'false\\false1\\_DSC5763.jpg',
'false\\false2\\_DSC5751.jpg',
'false\\false2\\_DSC5760.jpg',
'siblings\\siblings1\\_DSC5751.jpg',
'siblings\\siblings1\\_DSC5755_1.jpg',
'siblings\\siblings2\\_DSC5760.jpg',
'siblings\\siblings2\\_DSC5763.jpg'
The training_set.classes should be array([0, 0, 1, 1]), for my purposes.
How can I do this?
I finished my project and I thought to come back to post the answer I found. I am trying to classify if 2 individuals are siblings or not.
#Lists used for creating the dataset
categories = []
first_img= []
second_img = []
#Parsing throw the images and making 2 arrays
for filename in filenames:
category = filename.split('.')[0]
#Each pair is named <sibling/false>+<nr_of_pair>+0/1
if 'sibling' in category:
if filename.split('_')[1][0] == '0':
first_img.append(filename)
categories.append(1)
else:
second_img.append(filename)
else:
if filename.split('_')[1][0] == '0':
first_img.append(filename)
categories.append(0)
else:
second_img.append(filename)
#dataset of the first individual of the pair and it's label
df1 = pd.DataFrame({
'filename': first_img,
'category': categories
}).astype('str')
#dataset of the second individual of the pair and it's label
df2 =pd.DataFrame({
'filename': second_img,
'category': categories
}).astype('str')
And for the fit_generator I used the function.
def generate_generator_multiple(datagen):
train_generator1 = datagen.flow_from_dataframe(df1,
"../train/input/",
x_col='filename',
y_col='category',
class_mode='binary',
target_size=(image_size1, image_size2),
batch_size = batch_size)
train_generator2 = datagen.flow_from_dataframe(df2,
"../train/input/",
x_col='filename',
y_col='category',
class_mode='binary',
target_size=(image_size1, image_size2),
batch_size = batch_size)
while True:
X1i = train_generator1.next()
X2i = train_generator2.next()
yield [X1i[0], X2i[0]], X2i[1] #Yield both images and their mutual
datagen is a ImageDataGenerator object
I've been trying to use tensorflow's tf.estimator, but I'm getting the following errors regarding the shape of input/output data.
ValueError: Dimension size must be evenly divisible by 9 but is 12 for
'linear/linear_model/x/Reshape' (op: 'Reshape') with input shapes:
[4,3], [2] and with input tensors computed as partial shapes: input[1]
= [?,9].
Here is the code:
data_size = 3
iterations = 10
learn_rate = 0.005
# generate test data
input = np.random.rand(data_size,3)
output = np.dot(input, [2, 3 ,7]) + 4
output = np.transpose([output])
feature_columns = [tf.feature_column.numeric_column("x", shape=(data_size, 3))]
estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)
input_fn = tf.estimator.inputs.numpy_input_fn({"x":input}, output, batch_size=4, num_epochs=None, shuffle=True)
estimator.train(input_fn=input_fn, steps=iterations)
input data shape is shape=(3, 3):
[[ 0.06525168 0.3171153 0.61675511]
[ 0.35166298 0.71816544 0.62770994]
[ 0.77846666 0.20930611 0.1710842 ]]
output data shape is shape=(3, 1)
[[ 9.399135 ]
[ 11.25179188]
[ 7.38244104]]
I have sense it is related to input data, output data and batch_size, because when input data changed to 1 row it works. When input data rows count equal to batch_size(data_size = 10 and batch_size=10) then it throws other error:
ValueError: Shapes (1, 1) and (10, 1) are incompatible
Any help with the errors would be much appreciated.
I'm trying to get a basic LSTM working in TensorFlow. I'm receiving the following error:
TypeError: 'Tensor' object is not iterable.
The offending line is:
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, x, sequence_length=seqlen,
initial_state=init_state,)`
I'm using version 1.0.1 on windows 7. My inputs and label have the following shapes
x_shape = (50, 40, 18), y_shape = (50, 40)
Where:
batch size = 50
sequence length = 40
input vector length at each step = 18
I'm building my graph as follows
def build_graph(learn_rate, seq_len, state_size=32, batch_size=5):
# use a fixed sequence length
seqlen = tf.constant(seq_len, shape=[batch_size],dtype=tf.int32)
# Placeholders
x = tf.placeholder(tf.float32, [batch_size, None, 18])
y = tf.placeholder(tf.float32, [batch_size, None])
keep_prob = tf.constant(1.0)
# RNN
cell = tf.contrib.rnn.LSTMCell(state_size)
init_state = tf.get_variable('init_state', [1, state_size],
initializer=tf.constant_initializer(0.0))
init_state = tf.tile(init_state, [batch_size, 1])
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, x, sequence_length=seqlen,
initial_state=init_state,)
# Add dropout, as the model otherwise quickly overfits
rnn_outputs = tf.nn.dropout(rnn_outputs, keep_prob)
# Prediction layer
with tf.variable_scope('prediction'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
preds = tf.tanh(tf.matmul(rnn_outputs, W) + b)
# MSE
loss = tf.square(tf.subtract(y, preds))
# loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y))
train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss)
Can anyone tell me what I am missing?
Sequence length should be iterable e.g. a list or tensor, not a scalar. In your case specifically, you need to replace sequence length = 40 with a list of the lengths of each input. For instance, if your first sequence has 10 steps, the second 13 and the third 18, you would pass in [10, 13, 18]. This lets TensorFlow's dynamic RNN know how many steps to unroll for (I believe it uses a while loop internally).
I am building CNN fitting my own data, based on this example
Basically, my data has 3640 features; I have a convolution layer followed by a pooling layer, that pools every other feature, so I end up with dimensions (?, 1, 1819, 1) because 3638 features after conv layer / 2 == 1819.
When I try to reshape my data after pooling to get it in the form [n_samples, n_fetures]
print("pool_shape", pool_shape) #pool (?, 1, 1819, 10)
print("y_shape", y_shape) #y (?,)
pool.set_shape([pool_shape[0], pool_shape[2]*pool_shape[3]])
y.set_shape([y_shape[0], 1])
I get an error:
ValueError: Shapes (?, 1, 1819, 10) and (?, 18190) are not compatible
My code:
N_FEATURES = 140*26
N_FILTERS = 1
WINDOW_SIZE = 3
def my_conv_model(x, y):
x = tf.cast(x, tf.float32)
y = tf.cast(y, tf.float32)
print("x ", x.get_shape())
print("y ", y.get_shape())
# to form a 4d tensor of shape batch_size x 1 x N_FEATURES x 1
x = tf.reshape(x, [-1, 1, N_FEATURES, 1])
# this will give you sliding window of 1 x WINDOW_SIZE convolution.
features = tf.contrib.layers.convolution2d(inputs=x,
num_outputs=N_FILTERS,
kernel_size=[1, WINDOW_SIZE],
padding='VALID')
print("features ", features.get_shape()) #features (?, 1, 3638, 10)
# Max pooling across output of Convolution+Relu.
pool = tf.nn.max_pool(features, ksize=[1, 1, 2, 1],
strides=[1, 1, 2, 1], padding='SAME')
pool_shape = pool.get_shape()
y_shape = y.get_shape()
print("pool_shape", pool_shape) #pool (?, 1, 1819, 10)
print("y_shape", y_shape) #y (?,)
### here comes the error ###
pool.set_shape([pool_shape[0], pool_shape[2]*pool_shape[3]])
y.set_shape([y_shape[0], 1])
pool_shape = pool.get_shape()
y_shape = y.get_shape()
print("pool_shape", pool_shape) #pool (?, 1, 1819, 10)
print("y_shape", y_shape) #y (?,)
prediction, loss = learn.models.logistic_regression(pool, y)
return prediction, loss
How to reshape the data to get any meaningful representation of it and to later pass it to logistic regression layer?
This looks like a confusion between the Tensor.set_shape() method and the tf.reshape() operator. In this case, you should use tf.reshape() because you are changing the shape of the pool and y tensors:
The tf.reshape(tensor, shape) operator takes a tensor of any shape, and returns a tensor with the given shape, as long as they have the same number of elements. This operator should be used to change the shape of the input tensor.
The tensor.set_shape(shape) method takes a tensor that might have a partially known or unknown shape, and asserts to TensorFlow that it actually has the given shape. This method should be used to provide more information about the shape of a particular tensor.
It can be used, e.g., when you take the output of an operator that has a data-dependent output shape (such as tf.image.decode_jpeg()) and assert that it has a static shape (e.g. based on knowledge about the sizes of images in your dataset).
In your program, you should replace the calls to set_shape() with something like the following:
pool_shape = tf.shape(pool)
pool = tf.reshape(pool, [pool_shape[0], pool_shape[2] * pool_shape[3]])
y_shape = tf.shape(y)
y = tf.reshape(y, [y_shape[0], 1])
# Or, more straightforwardly:
y = tf.expand_dims(y, 1)