Model validation with separate dataset - machine-learning

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here

The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

Related

Creating a random forest function

I am trying to create a function that takes a 2-d numpy array (i.e. the data) and data_indices (a list of (train_indices,test_indices) tuples) as input.For each (train_indices,test_indices) tuple in data_indices ---the function should:
Train a new RandomForestRegressor model on the portion of data indexed by train_indices
Evaluate the trained RandomForestRegressor model on the portion of data indexed by test_indices using the mean squared error.
After training and evalating the RandomForestRegressor models, the function should return the RandomForestRegressor model that obtained highest testing set mean_square_error over its allocated data split across all trained models.
The trained RandomForestRegressor models should be trained with random_state equal 42, all other parameters should be left as default.
This is how I tried to do it pandas:
def best_k_model(data,data_indices):
model = RandomForestRegressor(random_state=42)
for train_indices, test_indices in data:
X_train, y_train = data[train_indices,0], data[train_indices,1]
X_test, y_test = data[test_indices,0], data[test_indices,1]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
return MSE.max()
After this inputs:
best_model = best_k_model(data,data_indices)
best_model.predict([[1960]])
this is what i am suppose to get:
array([8.85170916e+08])
But I'm getting the error that says:
index 1960 is out of bounds for axis 0 with size 58

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

How to deal with Support Vector machine and text data when there are multiple text features to input?

I am working on NLP project where I have multiple features to provide to SVM model. All the features to input are in text. If there was only one feature to input we can provide that feature as X and corresponding label as Y for training model but how can I supply more than one feature as X input for the model?
Dataset Format
For now I am trying to pass the parameters 'Questions' and 'WhWord' as input and 'CoarseType' as label. As they are text data I have to apply TfidfVectorizer before applying algorithm.It looks like the TfidfVectorizer doesn't supports the idea of X= multiple features. How can I handle this? Here is what I was doing.
features=['Questions','WhWord']
X = df.loc[:,features].values
y = df.CoarseType
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state = 42)
model = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(SVC(C=1, kernel='sigmoid'))),
])
model.fit(X_train, y_train)
could you show your dataset for better understanding your problem.
As of my understanding on your problem-
you can simply create it like X= all parameters using loc function and Y= target and then use it into algorithms like model.fit(X,Y) (here I wrote model only for you to understand)

MLPRegressor gives very negative scores

I'm kind of new to machine learning and I am using MLPRegressor. I split my data with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
then I make and fit the model, using 10-fold validation for test set.
nn = MLPRegressor(hidden_layer_sizes=(100, 100), activation='relu',
solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
TrainScore = nn.score(X_train, y_train)
kfold = KFold(n_splits=10, shuffle=True, random_state=0)
print("Cross-validation scores:\t{} ".format(cross_val_score(nn, X_test, y_test, cv=kfold)))
av_corss_val_score = np.mean(cross_val_score(nn, X_test, y_test, cv=kfold))
print("The average cross validation score is: {}".format(av_corss_val_score))
The problem is that the test scores I receive are very negative (-4256). What could be possible be wrong?
To keep syntax the same, sklearn maximizes every metric, whether classification accuracy or regression MSE. Therefore, the objective function is defined in a way that a more positive number is good and more negative number is bad. Hence, a less negative MSE is preferred.
Moving on to why it may be so negative in your case, it could be broadly due to two things: overfitting or underfitting. There are tonnes of resources out there to help you from this point forward.

Properly declaring input_shape for neural network in Keras?

I am attempting to write code to identify data types after loading it in from CSV files. So there are 5 possible labels, and the feature vector contains a list of lists. The feature vector is a list of lists with the following shape:
[slash_count, dash_count, colon_count, letters, dot_count, digits]
I then split my feature and label vectors into training, testing, and validation sets. I found some code on Stackoverflow that someone wrote to do this and I have used the same:
X_train, X_test, y_train, y_test = train_test_split(ml_list, labels, test_size=0.3, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=1)
After doing this, I normalize the features in scale [0,1], and then I create the categorical variables for the labels:
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.fit_transform(X_test)
X_val_minmax = min_max_scaler.fit_transform(X_val)
from keras.utils import to_categorical
y_train_minmax = to_categorical(y_train)
y_test_minmax = to_categorical(y_test)
y_val_minmax = to_categorical(y_val)
Next, I attempt to find the shape of the newly recoded variables:
print(y_train_minmax.shape) #(91366, 4)
print(X_train_minmax.shape) #(91366, 6)
print(X_test_minmax.shape) #(55939, 6)
print(X_val_minmax.shape) #(39157, 6)
print(y_train_minmax.shape) #(91366, 4)
print(y_test_minmax.shape) #(55939, 4)
print(y_val_minmax.shape) #(39157, 4)
Finally, I build the model and attempt to fit it:
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(91366, 6)))
model.add(layers.Dense(3, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_minmax, y_train_minmax, epochs=5, batch_size=128)
I get this message when I run the code:
ValueError: Error when checking input: expected dense_1_input to have 3 dimensions, but got array with shape (91366, 6)
I believe that the error is in when I create the neural network with the input shape. I am having a hard time understanding where I went wrong. Any help would be great!
You should change this line:
model.add(layers.Dense(512, activation='relu', input_shape=(6,)))
In keras you don't need to directly specify the number of examples you have in your dataset. As input_shape you need to provide only a shape of a single data point.
Another potential error which I spotted in your code snippet is that you should set:
model.add(layers.Dense(4, activation='softmax'))
As your output single data point has a shape of (4,). It's not consistent with what you've said about possible layers so I'd also advise rechecking your data.
Another possible mistake which I spotted is that you are not training separate scalers for train, test and valid datasets - but a single one on a train set - and then scale your other dataset using a trained scaler.

Resources