Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.
Related
I am fairly new to Pyspark. I am building my own imputing ML function to impute missing values in a Pyspark dataframe. I just want to know the correct syntax to select all values of a respective row during when() so that I may use those values as an argument for my regressor.predict().
for Y in arrEmptyColumns:
#===============BUILDING MODEL===============
X_train, X_test, Y_train, Y_test = train_test_split(tblTempX.toPandas(), tblTempDataSet.select(Y).toPandas(), test_size = 0.20)
regressor = RandomForestRegressor(n_estimators = int(len(tblTempX.columns)*.50), random_state=0)
#===============TRAINING===============
regressor.fit(X_train,Y_train)
Y_pred = regressor.predict(X_test)
print("==============================")
print("Mean Squared Error for ",Y,": ")
print(np.sqrt(mean_squared_error(Y_test,Y_pred)))
print("R2_Score; Performance Accuracy: ")
print(r2_score(Y_test, Y_pred))
#===============TESTING===============
tblFinalOutputDataSet = tblFinalOutputDataSet.withColumn(Y, when(tblFinalOutputDataSet[Y].isNull(), regressor.predict('''*code here to select all values of respective row to be used as argument for the model*''')))
tblFinalOutputDataSet.display()
There is one input variable and one output variable.
However each data point of the input/output variable is a vector.
Size of each input vector is 141X1 and size of each output vector is 400X1.
I have attached the input data file(Ivec.xls) and output data file(Ovec.xls)
data link
For training:
Input vectors: Ivec(:,1:9) and output vectors: Ovec(:,1:9)
For testing:
Input vector: Ivec(:,10) and the predicted_Ovec10 can be compared with Ovec(:,10)
to know the performance of the model.
How to create a regression model from this?
dataset_Ivec = pd.read_excel(r'Ivec.xls',header = None)
dataset_Ovec = pd.read_excel(r'Ovec.xls',header = None)
dataset_Ivec_numpy = dataset_Ivec.to_numpy()
dataset_Ovec_numpy = dataset_Ovec.to_numpy()
X_train = dataset_Ivec_numpy[:,:-1]
y_train = dataset_Ovec_numpy[:,:-1]
X_test = dataset_Ivec_numpy[:,9]
y_test = dataset_Ovec_numpy[:,9]
model = Sequential()
model.add(Dense(activation="relu", input_dim=X_train.shape[0], units = X_train.shape[1], kernel_initializer="uniform"))
model.add(Dropout(0.285))
model.add(Dense(activation="linear", input_dim=y_train.shape[1], units = X_train.shape[1], kernel_initializer="uniform"))
model.compile(optimizer="adagrad", loss="mean_squared_error", metrics=["accuracy"])
# model = baseline_model()
result = model.fit(X_train, y_train, batch_size=2, epochs=20, validation_data=(X_test, y_test))
I tried to write this code, however, it is too confusing for me.
And now I am quite stuck for many days. Please help someone.
I am building a simple regression model in with scikit-learn. My dataset has thousands of features and roughly 600 rows. As a simple feature selection strategy to set my baseline, I am using the f_regression function with SelectKBest - SelectKBest(score_func=f_regression, k=50). My goal is to select the 50 best features. However, a large proportion of these columns have missing values. Without imputing values/dropping them, how does this feature selection strategy handle these features?
My function for selecting the features is below (note I am building a number of models - which are denoted by the key used in the function)
def process_training_and_test_data_for_LG(df):
training_datasets = {}
test_datasets = {}
for drug in df.keys():
print("at "+drug)
X = df[drug].drop(["cell_line_name", "ln_IC50","putative_target"], axis=1)
y = df[drug]["ln_IC50"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#Select 50 top correlated features
feature_selector = SelectKBest(score_func=f_regression, k=50)
feature_selector.fit(X_train, y_train)
X_train = feature_selector.transform(X_train)
X_test = feature_selector.transform(X_test)
#Add data to relevant dictionaries
training_datasets[drug] = [X_train, y_train]
test_datasets[drug] = [X_test, y_test]
return training_datasets, test_datasets
I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
print(dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)
I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions
knn_regressor.fit(X_train_trans, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
print(test_predictions)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?