I want to impute using my regressor model in pysark - machine-learning

I am fairly new to Pyspark. I am building my own imputing ML function to impute missing values in a Pyspark dataframe. I just want to know the correct syntax to select all values of a respective row during when() so that I may use those values as an argument for my regressor.predict().
for Y in arrEmptyColumns:
#===============BUILDING MODEL===============
X_train, X_test, Y_train, Y_test = train_test_split(tblTempX.toPandas(), tblTempDataSet.select(Y).toPandas(), test_size = 0.20)
regressor = RandomForestRegressor(n_estimators = int(len(tblTempX.columns)*.50), random_state=0)
#===============TRAINING===============
regressor.fit(X_train,Y_train)
Y_pred = regressor.predict(X_test)
print("==============================")
print("Mean Squared Error for ",Y,": ")
print(np.sqrt(mean_squared_error(Y_test,Y_pred)))
print("R2_Score; Performance Accuracy: ")
print(r2_score(Y_test, Y_pred))
#===============TESTING===============
tblFinalOutputDataSet = tblFinalOutputDataSet.withColumn(Y, when(tblFinalOutputDataSet[Y].isNull(), regressor.predict('''*code here to select all values of respective row to be used as argument for the model*''')))
tblFinalOutputDataSet.display()

Related

How does the f_regression function handle features with missing values - are they simply dropped?

I am building a simple regression model in with scikit-learn. My dataset has thousands of features and roughly 600 rows. As a simple feature selection strategy to set my baseline, I am using the f_regression function with SelectKBest - SelectKBest(score_func=f_regression, k=50). My goal is to select the 50 best features. However, a large proportion of these columns have missing values. Without imputing values/dropping them, how does this feature selection strategy handle these features?
My function for selecting the features is below (note I am building a number of models - which are denoted by the key used in the function)
def process_training_and_test_data_for_LG(df):
training_datasets = {}
test_datasets = {}
for drug in df.keys():
print("at "+drug)
X = df[drug].drop(["cell_line_name", "ln_IC50","putative_target"], axis=1)
y = df[drug]["ln_IC50"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#Select 50 top correlated features
feature_selector = SelectKBest(score_func=f_regression, k=50)
feature_selector.fit(X_train, y_train)
X_train = feature_selector.transform(X_train)
X_test = feature_selector.transform(X_test)
#Add data to relevant dictionaries
training_datasets[drug] = [X_train, y_train]
test_datasets[drug] = [X_test, y_test]
return training_datasets, test_datasets

SMOTE resampling produces nan values

I am using SMOTE to oversample the minority of a dataset. My code is as follows:
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(features_coded, labels, test_size=0.2, random_state=42)
sm = SMOTE(random_state=42, sampling_strategy='all')
# also tried the following, same result
# sm = SMOTE(random_state=42, sampling_strategy=0.5)
X_train, y_train = sm.fit_resample(X_train, y_train)
I check features_coded, labels, X_train and y_train using statements like the following:
features_coded[features_coded.isnull().any(axis=1)]
I am pretty sure that they do not contain any nan values before oversampling. However, after resampling, there are a lot of nan values in the X_train dataframe.
Just in case you are wondering:
This is my dataframe (saved as csv file) before oversampling, nothing is missing.
This is my dataframe (saved as csv file) after oversampling, a lot of empty values!
Is anything wrong?
I had a similar issue, I converted my inputs X and Y as arrays using the lines X_arr = numpy.array(X) and y_arr = numpy.array(Y) and fed them to train_test_split() as follows:
X_train, X_test, y_train, y_test = train_test_split(X_arr, y_arr, test_size = 0.2, random_state = 2)
smote = SMOTE(random_state=2)
X_train_balanced, Y_train_balanced = smote.fit_resample(X_train, y_train)

making predictions using classification models with multiple independent variables in hand

I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
print(dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)

Training and testing ML from two different sources

I am using sklearn for a classification task. I want to train my model on data from table "train" and test it on data from a different table"test". Both tables have the same exact features, but different numbers of rows. I have the code below, but I am getting the error:
(<class 'ValueError'>, ValueError('Found input variables with inconsistent numbers of samples: [123, 174]',), <traceback object at 0x0000016476E10C48>).
what am I doing wrong?
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X = df_train[:, 2:30]
Y = df_test[:, :30]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
model.fit(X_train, Y_train)
predictions = model.predict(X_test)
split_mat=confusion_matrix(Y_test, predictions)
If you want to train on dataframe df_train and test on dataframe df_test, why are you taking the features of df_train and the target column of df_test and pass them to the train_test_split function?
You can simply do the following:
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X_train = df_train[:, 2:30]
y_train = df_train.y # assuming y is the name of your target variable in df_train
X_test = df_test[:, i:j] # change i to j with the number that allow you to take the same columns as X_train
y_test = df_test.y # assuming y is the name of your target variable in df_test
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Do something with predictions, e.g.
mean(predictions == y_test)

Cross validation in classifying text documents using scikit-learn

Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.

Resources