RandomOverSampler conditions to create equal distribution - machine-learning

I am currently working on an ML based project and there is a slight imbalance in my data and would require an over_sampling technique. The features (X_train) dimension is (90664, 190) and target (Y_binary_train_trans) is (90664, ). However, the code runs and still outputs the same, unequal distribution of the target.
Here is the code used for RandomOverSampler, it has been tried using smote as well;
counter= Counter(Y_binary_train_trans)
ros= RandomOverSampler(random_state=42)
X_train, Y_binary_train_trans = ros.fit_resample(X_train,Y_binary_train_trans)
counter = Counter(Y_binary_test_trans)

counter= Counter(Y_binary_train_trans)
ros= RandomOverSampler(random_state=42)
X_train, Y_binary_train_trans = ros.fit_resample(X_train,Y_binary_train_trans)
counter = Counter(Y_binary_test_trans)
As for this code, your second counter counts the test samples and not the training samples which you actually changed!
Instead it should be:
counter= Counter(Y_binary_train_trans)
ros= RandomOverSampler(random_state=42)
X_train, Y_binary_train_trans = ros.fit_resample(X_train,Y_binary_train_trans)
counter = Counter(Y_binary_train_trans)

Related

LSTM sequence prediction overfits on one specific value only

hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.

Issue with multilabel classification

I followed this tutorial: https://medium.com/#vijayabhaskar96/multi-label-image-classification-tutorial-with-keras-imagedatagenerator-cd541f8eaf24
and wrote some of my code for multilabel classification. I had it working with one-hot encoding on a small scale but I had to move to option 2 mentioned in the article because I have 6000 classes and therefore one hot was not viable. I managed to train the network and it said 99% accuracy and 83% f1 score. However, when I'm trying to test the network, for every image it's outputting some combination of only 3 labels when there are 6000 possible labels. I wondered if maybe the code to test the model was incorrect. I tried using the code mentioned in the post and it doesn't work:
test_generator.reset()
pred = model.predict_generator(test_generator, steps=STEP_SIZE_TEST, verbose=1);
pred_bool = (pred > 0.5)
unorderable types: list() > float()
I've tried hard to fix this and not figured it out and I can't find any examples online of anyone doing something similar. Does anyone have an idea of how to get this prediction part working using this code block (I had it with another 2 options and was getting that issue printing one or several labels) or why the model might be failing in training with this behavior?
EDIT: for more context on the training issue, here is all the training code:
import json
input_file = open ('class_names_6000.json')
json_array = json.load(input_file)
#print(str(json_array))
args = parser.parse_args()
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print('Loading Data...')
df = pd.read_csv('dataset_train.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
datagen=ImageDataGenerator(rescale=1./255.)
test_datagen=ImageDataGenerator(rescale=1./255.)
train_generator=datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_test.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
test_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_validation.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
valid_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
print('Data Loaded.')
f1_score_callback = ComputeF1()
model = build_model('train', numclasses=len(json_array), model_name = args.model)
ImageFile.LOAD_TRUNCATED_IMAGES = True
Also, an important detail, when training, it says the accuracy is 99% and the f1 score is 84% with an validation f1 score at 84% as well.

XGBoost plot_importance cannot show feature names

I used the plot_importance to show me the importance variables. But some variables are categorical, so I did some transformation. After I transformed the type of the variables, when i plot importance features, the plot does not show me feature names. I attached my code, and the plot.
dataset = data.values
X = dataset[1:100,0:-2]
predictors=dataset[1:100,-1]
X = X.astype(str)
encoded_x = None
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
feature = label_encoder.fit_transform(X[:,i])
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False)
feature = onehot_encoder.fit_transform(feature)
if encoded_x is None:
encoded_x = feature
else:
encoded_x = np.concatenate((encoded_x, feature), axis=1)
print("X shape: : ", encoded_x.shape)
response='Default'
#predictors=list(data.columns.values[:-1])
# Randomly split indexes
X_train, X_test, y_train, y_test = train_test_split(encoded_x,predictors,train_size=0.7, random_state=5)
model = XGBClassifier()
model.fit(X_train, y_train)
plot_importance(model)
plt.show()
[enter image description here][1]
[1]: https://i.stack.imgur.com/M9qgY.png
This is the expected behaviour- sklearn.OneHotEncoder.transform() returns a numpy 2d array instead of the input pd.DataFrame (i assume that's the type of your dataset). So it is not a bug, but a feature. It doesn't look like there is a way to pass feature names manually in the sklearn API (it is possible to set those in xgb.Dmatrix creation in the native training API).
However, your problem is easily solvable with pd.get_dummies() instead of the LabelEncoder + OneHotEncoder combination that you have implemented. I do not know why did you choose to use it instead (it can be useful, if you need to handle also a test set but then you need to play extra tricks), but i would advise in favour of pd.get_dummies()

How to create feature columns for TensorFlow classifier

I have a very simple dataset for binary classification in csv file which looks like this:
"feature1","feature2","label"
1,0,1
0,1,0
...
where the "label" column indicates class (1 is positive, 0 is negative). The number of features is actually pretty big but it doesn't matter for that question.
Here is how I read the data:
train = pandas.read_csv(TRAINING_FILE)
y_train, X_train = train['label'], train[['feature1', 'feature2']].fillna(0)
test = pandas.read_csv(TEST_FILE)
y_test, X_test = test['label'], test[['feature1', 'feature2']].fillna(0)
I want to run tensorflow.contrib.learn.LinearClassifier and tensorflow.contrib.learn.DNNClassifier on that data. For instance, I initialize DNN like this:
classifier = DNNClassifier(hidden_units=[3, 5, 3],
n_classes=2,
feature_columns=feature_columns, # ???
activation_fn=nn.relu,
enable_centered_bias=False,
model_dir=MODEL_DIR_DNN)
So how exactly should I create the feature_columns when all the features are also binary (0 or 1 are the only possible values)?
Here is the model training:
classifier.fit(X_train.values,
y_train.values,
batch_size=dnn_batch_size,
steps=dnn_steps)
The solution with replacing fit() parameters with the input function would also be great.
Thanks!
P.S. I'm using TensorFlow version 1.0.1
You can directly use tf.feature_column.numeric_column :
feature_columns = [tf.feature_column.numeric_column(key = key) for key in X_train.columns]
I've just found the solution and it's pretty simple:
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
Apparently infer_real_valued_columns_from_input() works well with categorical variables.

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:
https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/
argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']
df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])
X =df[list(explanatory_variable_columns)].as_matrix()
kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)
dt=DecisionTreeClassifier(criterion='entropy')
min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)
scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)
print scores
print np.mean(scores)
print dtgs.best_score_
RESULTS OBTAINED:
# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532
ADDITIONAL NOTE:
I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.
# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578
The code is confusing several things.
dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.
[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.
And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).

Resources