Accuracy in logistic regression - machine-learning

Here is slightly modified code that I found here...
I am using the same logic as the original author and still not getting good accuracy. The Mean Reciprocal Rank is close (mine: 52.79, example: 48.04)
cv = CountVectorizer(binary=True, max_df=0.95)
feature_set = cv.fit_transform(df["short_description"])
X_train, X_test, y_train, y_test = train_test_split(
feature_set, df["category"].values, random_state=2000)
scikit_log_reg = LogisticRegression(
verbose=1, solver="liblinear", random_state=0, C=5, penalty="l2", max_iter=1000)
model = scikit_log_reg.fit(X_train, y_train)
target = to_categorical(y_test)
y_pred = model.predict_proba(X_test)
label_ranking_average_precision_score(target, y_pred)
>> 0.5279108613021547
model.score(X_test, y_test)
>> 0.38620071684587814
But the accuracy of notebook sample (59.80) does not match with my code (38.62)
Is the following function used in the sample notebook correctly returning accuracy?
def compute_accuracy(eval_items:list):
correct=0
total=0
for item in eval_items:
true_pred=item[0]
machine_pred=set(item[1])
for cat in true_pred:
if cat in machine_pred:
correct+=1
break
accuracy=correct/float(len(eval_items))
return accuracy

The notebook code is checking whether the actual category is in the top 3 returned from the model:
def get_top_k_predictions(model, X_test, k):
probs = model.predict_proba(X_test)
best_n = np.argsort(probs, axis=1)[:, -k:]
preds=[[model.classes_[predicted_cat] for predicted_cat in prediction] for prediction in best_n]
preds=[item[::-1] for item in preds]
return preds
If you replace the evaluation part of your code with the below, you'll see that your model returns a top-3 accuracy of 0.5980 as well:
...
model = scikit_log_reg.fit(X_train, y_train)
top_preds = get_top_k_predictions(model, X_test, 3)
pred_pairs = list(zip([[v] for v in y_test], top_preds))
print(compute_accuracy(pred_pairs))
# below is a simpler & more Pythonic version of compute_accuracy
print(np.mean([actual in pred for actual, pred in zip(y_test, top_preds)]))

Related

Input/Target size mismatch when training a downstream BERT for classification (huggingface pretrained)

I am training a BERT model with a downstream task to classify movie genres. I am using HuggingFace pretrained model (aleph-bert since data is in Hebrew)
When training, I get the following error:
ValueError: Expected input batch_size (3744) to match target batch_size (16).
This is my notebook:
https://colab.research.google.com/drive/1mqIUPnLOu_H-URn5tzE6gGySsW3oAcRY?usp=sharing
The error happens in the compute_loss functions, while performing the cross_entropy step.
My batch size is 16 but for some reason the bert output returns a different size.
The relevant code:
def data_prep_for_genre(genre):
X = movies_df['overview']
y = movies_df[genre].rename('labels', inplace=True).astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train = tokenizer(X_train.to_list(), truncation=True)
X_test = tokenizer(X_test.to_list(), truncation=True)
train_dataset = TextData(X_train, y_train.to_list())
test_dataset = TextData(X_test, y_test.to_list())
# define model:
model = BertForTokenClassification.from_pretrained("onlplab/alephbert-base", num_labels=2)
return model, train_dataset, test_dataset
class MyTrainer(Trainer):
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
warmup_steps=50,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
for genre in GENRE_SET:
model, train_dataset, test_dataset = data_prep_for_genre(genre)
trainer = MyTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
# eval_dataset=test_dataset,
data_collator=data_collator
)
trainer.train()

making predictions using classification models with multiple independent variables in hand

I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
print(dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)

sklearn and weka kNN predictions exactly same for all except for one data point

I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions
knn_regressor.fit(X_train_trans, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
print(test_predictions)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?

Size Mismatch using pytorch when trying to train data

I am really new to pytorch and just trying to use my own dataset to do a simple Linear Regression Model. I am only using the numbers values as inputs, too.
I have imported the data from the CSV
dataset = pd.read_csv('mlb_games_overview.csv')
I have split the data into four parts X_train, X_test, y_train, y_test
X = dataset.drop(['date', 'team', 'runs', 'win'], 1)
y = dataset['win']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=True)
I have converted the data to pytorch tensors
X_train = torch.from_numpy(np.array(X_train))
X_test = torch.from_numpy(np.array(X_test))
y_train = torch.from_numpy(np.array(y_train))
y_test = torch.from_numpy(np.array(y_test))
I have created a LinearRegressionModel
class LinearRegressionModel(torch.nn.Module):
def __init__(self):
super(LinearRegressionModel, self).__init__()
self.linear = torch.nn.Linear(1, 1)
def forward(self, x):
y_pred = self.linear(x)
return y_pred
I have initialized the optimizer and the loss function
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
Now when I start to train the data I get the runtime error mismatch
EPOCHS = 500
for epoch in range(EPOCHS):
pred_y = model(X_train) # RUNTIME ERROR HERE
loss = criterion(pred_y, y_train)
optimizer.zero_grad() # zero out gradients to update parameters correctly
loss.backward() # backpropagation
optimizer.step() # update weights
print('epoch {}, loss {}'. format(epoch, loss.data[0]))
Error Log:
RuntimeError Traceback (most recent call last)
<ipython-input-40-c0474231d515> in <module>
1 EPOCHS = 500
2 for epoch in range(EPOCHS):
----> 3 pred_y = model(X_train)
4 loss = criterion(pred_y, y_train)
5 optimizer.zero_grad() # zero out gradients to update parameters correctly
RuntimeError: size mismatch, m1: [3540 x 8], m2: [1 x 1] at
C:\w\1\s\windows\pytorch\aten\src\TH/generic/THTensorMath.cpp:752
In your Linear Regression model, you have:
self.linear = torch.nn.Linear(1, 1)
But your training data (X_train) shape is 3540 x 8 which means you have 8 features representing each input example. So, you should define the linear layer as follows.
self.linear = torch.nn.Linear(8, 1)
A linear layer in PyTorch has parameters, W and b. If you set the in_features to 8 and out_features to 1, then the shape of the W matrix will be 1 x 8 and the length of b vector will be 1.
Since your training data shape is 3540 x 8, you can perform the following operation.
linear_out = X_train W_T + b
I hope it clarifies your confusion.

How to split data on balanced training set and test set on sklearn

I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class.
Actually, I amusing this function
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)
but it gives unbalanced dataset! Any suggestion.
Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.
So you could do:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)
The trick here is that it starts from version 0.17 in sklearn.
From the documentation about the parameter stratify:
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the labels array.
New in version 0.17: stratify splitting
You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
X_train=X[train_idx]
y_train=y[train_idx]
print(X_train)
# [[3 7]
# [2 4]]
print(y_train)
# [1 0]
If the classes are not balanced but you want the split to be balanced, then stratifying isn't going to help. There doesn't seem to be a method for doing balanced sampling in sklearn but it's kind of easy using basic numpy, for example a function like this might help you:
def split_balanced(data, target, test_size=0.2):
classes = np.unique(target)
# can give test_size as fraction of input data size of number of samples
if test_size<1:
n_test = np.round(len(target)*test_size)
else:
n_test = test_size
n_train = max(0,len(target)-n_test)
n_train_per_class = max(1,int(np.floor(n_train/len(classes))))
n_test_per_class = max(1,int(np.floor(n_test/len(classes))))
ixs = []
for cl in classes:
if (n_train_per_class+n_test_per_class) > np.sum(target==cl):
# if data has too few samples for this class, do upsampling
# split the data to training and testing before sampling so data points won't be
# shared among training and test data
splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl)))
ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class),
np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)])
else:
ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class,
replace=False))
# take same num of samples from all classes
ix_train = np.concatenate([x[:n_train_per_class] for x in ixs])
ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs])
X_train = data[ix_train,:]
X_test = data[ix_test,:]
y_train = target[ix_train]
y_test = target[ix_test]
return X_train, X_test, y_train, y_test
Note that if you use this and sample more points per class than in the input data, then those will be upsampled (sample with replacement). As a result, some data points will appear multiple times and this may have an effect on the accuracy measures etc. And if some class has only one data point, there will be an error. You can easily check the numbers of points per class for example with np.unique(target, return_counts=True)
Another approach is to over- or under- sample from your stratified test/train split. The imbalanced-learn library is quite handy for this, specially useful if you are doing online learning & want to guarantee balanced train data within your pipelines.
from imblearn.pipeline import Pipeline as ImbalancePipeline
model = ImbalancePipeline(steps=[
('data_balancer', RandomOverSampler()),
('classifier', SVC()),
])
This is my implementation that I use to get train/test data indexes
def get_safe_balanced_split(target, trainSize=0.8, getTestIndexes=True, shuffle=False, seed=None):
classes, counts = np.unique(target, return_counts=True)
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
if nPerClass > np.min(counts):
print("Insufficient data to produce a balanced training data split.")
print("Classes found %s"%classes)
print("Classes count %s"%counts)
ts = float(trainSize*np.min(counts)*len(classes)) / float(len(target))
print("trainSize is reset from %s to %s"%(trainSize, ts))
trainSize = ts
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
# get number of classes
nPerClass = int(nPerClass)
print("Data splitting on %i classes and returning %i per class"%(len(classes),nPerClass ))
# get indexes
trainIndexes = []
for c in classes:
if seed is not None:
np.random.seed(seed)
cIdxs = np.where(target==c)[0]
cIdxs = np.random.choice(cIdxs, nPerClass, replace=False)
trainIndexes.extend(cIdxs)
# get test indexes
testIndexes = None
if getTestIndexes:
testIndexes = list(set(range(len(target))) - set(trainIndexes))
# shuffle
if shuffle:
trainIndexes = random.shuffle(trainIndexes)
if testIndexes is not None:
testIndexes = random.shuffle(testIndexes)
# return indexes
return trainIndexes, testIndexes
This is the function I am using. You can adapt it and optimize it.
# Returns a Test dataset that contains an equal amounts of each class
# y should contain only two classes 0 and 1
def TrainSplitEqualBinary(X, y, samples_n): #samples_n per class
indicesClass1 = []
indicesClass2 = []
for i in range(0, len(y)):
if y[i] == 0 and len(indicesClass1) < samples_n:
indicesClass1.append(i)
elif y[i] == 1 and len(indicesClass2) < samples_n:
indicesClass2.append(i)
if len(indicesClass1) == samples_n and len(indicesClass2) == samples_n:
break
X_test_class1 = X[indicesClass1]
X_test_class2 = X[indicesClass2]
X_test = np.concatenate((X_test_class1,X_test_class2), axis=0)
#remove x_test from X
X_train = np.delete(X, indicesClass1 + indicesClass2, axis=0)
Y_test_class1 = y[indicesClass1]
Y_test_class2 = y[indicesClass2]
y_test = np.concatenate((Y_test_class1,Y_test_class2), axis=0)
#remove y_test from y
y_train = np.delete(y, indicesClass1 + indicesClass2, axis=0)
if (X_test.shape[0] != 2 * samples_n or y_test.shape[0] != 2 * samples_n):
raise Exception("Problem with split 1!")
if (X_train.shape[0] + X_test.shape[0] != X.shape[0] or y_train.shape[0] + y_test.shape[0] != y.shape[0]):
raise Exception("Problem with split 2!")
return X_train, X_test, y_train, y_test

Resources