Cross validation in classifying text documents using scikit-learn - machine-learning

Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.

Related

I want to impute using my regressor model in pysark

I am fairly new to Pyspark. I am building my own imputing ML function to impute missing values in a Pyspark dataframe. I just want to know the correct syntax to select all values of a respective row during when() so that I may use those values as an argument for my regressor.predict().
for Y in arrEmptyColumns:
#===============BUILDING MODEL===============
X_train, X_test, Y_train, Y_test = train_test_split(tblTempX.toPandas(), tblTempDataSet.select(Y).toPandas(), test_size = 0.20)
regressor = RandomForestRegressor(n_estimators = int(len(tblTempX.columns)*.50), random_state=0)
#===============TRAINING===============
regressor.fit(X_train,Y_train)
Y_pred = regressor.predict(X_test)
print("==============================")
print("Mean Squared Error for ",Y,": ")
print(np.sqrt(mean_squared_error(Y_test,Y_pred)))
print("R2_Score; Performance Accuracy: ")
print(r2_score(Y_test, Y_pred))
#===============TESTING===============
tblFinalOutputDataSet = tblFinalOutputDataSet.withColumn(Y, when(tblFinalOutputDataSet[Y].isNull(), regressor.predict('''*code here to select all values of respective row to be used as argument for the model*''')))
tblFinalOutputDataSet.display()

How to create an ANN regression model with several vectors as input and several vectors as output?

There is one input variable and one output variable.
However each data point of the input/output variable is a vector.
Size of each input vector is 141X1 and size of each output vector is 400X1.
I have attached the input data file(Ivec.xls) and output data file(Ovec.xls)
data link
For training:
Input vectors: Ivec(:,1:9) and output vectors: Ovec(:,1:9)
For testing:
Input vector: Ivec(:,10) and the predicted_Ovec10 can be compared with Ovec(:,10)
to know the performance of the model.
How to create a regression model from this?
dataset_Ivec = pd.read_excel(r'Ivec.xls',header = None)
dataset_Ovec = pd.read_excel(r'Ovec.xls',header = None)
dataset_Ivec_numpy = dataset_Ivec.to_numpy()
dataset_Ovec_numpy = dataset_Ovec.to_numpy()
X_train = dataset_Ivec_numpy[:,:-1]
y_train = dataset_Ovec_numpy[:,:-1]
X_test = dataset_Ivec_numpy[:,9]
y_test = dataset_Ovec_numpy[:,9]
model = Sequential()
model.add(Dense(activation="relu", input_dim=X_train.shape[0], units = X_train.shape[1], kernel_initializer="uniform"))
model.add(Dropout(0.285))
model.add(Dense(activation="linear", input_dim=y_train.shape[1], units = X_train.shape[1], kernel_initializer="uniform"))
model.compile(optimizer="adagrad", loss="mean_squared_error", metrics=["accuracy"])
# model = baseline_model()
result = model.fit(X_train, y_train, batch_size=2, epochs=20, validation_data=(X_test, y_test))
I tried to write this code, however, it is too confusing for me.
And now I am quite stuck for many days. Please help someone.

How does the f_regression function handle features with missing values - are they simply dropped?

I am building a simple regression model in with scikit-learn. My dataset has thousands of features and roughly 600 rows. As a simple feature selection strategy to set my baseline, I am using the f_regression function with SelectKBest - SelectKBest(score_func=f_regression, k=50). My goal is to select the 50 best features. However, a large proportion of these columns have missing values. Without imputing values/dropping them, how does this feature selection strategy handle these features?
My function for selecting the features is below (note I am building a number of models - which are denoted by the key used in the function)
def process_training_and_test_data_for_LG(df):
training_datasets = {}
test_datasets = {}
for drug in df.keys():
print("at "+drug)
X = df[drug].drop(["cell_line_name", "ln_IC50","putative_target"], axis=1)
y = df[drug]["ln_IC50"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#Select 50 top correlated features
feature_selector = SelectKBest(score_func=f_regression, k=50)
feature_selector.fit(X_train, y_train)
X_train = feature_selector.transform(X_train)
X_test = feature_selector.transform(X_test)
#Add data to relevant dictionaries
training_datasets[drug] = [X_train, y_train]
test_datasets[drug] = [X_test, y_test]
return training_datasets, test_datasets

making predictions using classification models with multiple independent variables in hand

I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
print(dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)

sklearn and weka kNN predictions exactly same for all except for one data point

I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions
knn_regressor.fit(X_train_trans, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
print(test_predictions)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?

Resources