ValueError: Found input variables with inconsistent numbers of samples: [2, 44] - machine-learning

Have a piece of code where i am cleaning the text from the 'Description' column and storing it as "cleaned"
Then I create a ML model using the above as one of my features.
X = data[['originalname','cleaned']]
Y = data['Total score']
X_train, X_test, y_train, y_test = train_test_split(X,Y,
test_size=0.2,random_state=42)
pipeline = Pipeline(
[('vect', TfidfVectorizer(ngram_range=(1, 2),
stop_words="english", sublinear_tf=True)),
('chi', SelectKBest(chi2, k='all')),
('clf', LinearSVC(C=1.0, penalty='l1', max_iter=300, dual=False))])
X_train.shape--->(44, 2)
y_train.shape--->(44,)
Trying to train the model gives me the above error
model = pipeline.fit(X_train, y_train)
How do I use both 'original description' and 'cleaned' as my feature to predict 'Total score' without the above error ?

Related

I want to impute using my regressor model in pysark

I am fairly new to Pyspark. I am building my own imputing ML function to impute missing values in a Pyspark dataframe. I just want to know the correct syntax to select all values of a respective row during when() so that I may use those values as an argument for my regressor.predict().
for Y in arrEmptyColumns:
#===============BUILDING MODEL===============
X_train, X_test, Y_train, Y_test = train_test_split(tblTempX.toPandas(), tblTempDataSet.select(Y).toPandas(), test_size = 0.20)
regressor = RandomForestRegressor(n_estimators = int(len(tblTempX.columns)*.50), random_state=0)
#===============TRAINING===============
regressor.fit(X_train,Y_train)
Y_pred = regressor.predict(X_test)
print("==============================")
print("Mean Squared Error for ",Y,": ")
print(np.sqrt(mean_squared_error(Y_test,Y_pred)))
print("R2_Score; Performance Accuracy: ")
print(r2_score(Y_test, Y_pred))
#===============TESTING===============
tblFinalOutputDataSet = tblFinalOutputDataSet.withColumn(Y, when(tblFinalOutputDataSet[Y].isNull(), regressor.predict('''*code here to select all values of respective row to be used as argument for the model*''')))
tblFinalOutputDataSet.display()

making predictions using classification models with multiple independent variables in hand

I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
print(dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)

Using Ray-Tune with sklearn's RandomForestClassifier

Putting together different base and documentation examples, I have managed to come up with this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
def objective(config, reporter):
for i in range(config['iterations']):
model = RandomForestClassifier(random_state=0, n_jobs=-1, max_depth=None, n_estimators= int(config['n_estimators']), min_samples_split=int(config['min_samples_split']), min_samples_leaf=int(config['min_samples_leaf']))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Feed the score back to tune?
reporter(precision=precision_score(y_test, y_pred, average='macro'))
space = {'n_estimators': (100,200),
'min_samples_split': (2, 10),
'min_samples_leaf': (1, 5)}
algo = BayesOptSearch(
space,
metric="precision",
mode="max",
utility_kwargs={
"kind": "ucb",
"kappa": 2.5,
"xi": 0.0
},
verbose=3
)
scheduler = AsyncHyperBandScheduler(metric="precision", mode="max")
config = {
"num_samples": 1000,
"config": {
"iterations": 10,
}
}
results = run(objective,
name="my_exp",
search_alg=algo,
scheduler=scheduler,
stop={"training_iteration": 400, "precision": 0.80},
resources_per_trial={"cpu":2, "gpu":0.5},
**config)
print(results.dataframe())
print("Best config: ", results.get_best_config(metric="precision"))
It runs and I am able to get a best configuration at the end of everything. However, my doubt mainly lies in the objective function. Do I have that properly written? There are no samples that I could find
Follow up question:
What is num_samples in the config object? Is it the number of samples it will extract from the overall training data for each trial?
Tune now has native sklearn bindings: https://github.com/ray-project/tune-sklearn
Can you give that a shot instead?
To answer your original question, the objective function looks good; and num_samples is the total number of hyperparameter configurations you want to try.
Also, you'll want to remove the forloop from your training function:
def objective(config, reporter):
model = RandomForestClassifier(random_state=0, n_jobs=-1, max_depth=None, n_estimators= int(config['n_estimators']), min_samples_split=int(config['min_samples_split']), min_samples_leaf=int(config['min_samples_leaf']))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Feed the score back to tune
reporter(precision=precision_score(y_test, y_pred, average='macro'))

Training and testing ML from two different sources

I am using sklearn for a classification task. I want to train my model on data from table "train" and test it on data from a different table"test". Both tables have the same exact features, but different numbers of rows. I have the code below, but I am getting the error:
(<class 'ValueError'>, ValueError('Found input variables with inconsistent numbers of samples: [123, 174]',), <traceback object at 0x0000016476E10C48>).
what am I doing wrong?
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X = df_train[:, 2:30]
Y = df_test[:, :30]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
model.fit(X_train, Y_train)
predictions = model.predict(X_test)
split_mat=confusion_matrix(Y_test, predictions)
If you want to train on dataframe df_train and test on dataframe df_test, why are you taking the features of df_train and the target column of df_test and pass them to the train_test_split function?
You can simply do the following:
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X_train = df_train[:, 2:30]
y_train = df_train.y # assuming y is the name of your target variable in df_train
X_test = df_test[:, i:j] # change i to j with the number that allow you to take the same columns as X_train
y_test = df_test.y # assuming y is the name of your target variable in df_test
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Do something with predictions, e.g.
mean(predictions == y_test)

Cross validation in classifying text documents using scikit-learn

Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.

Resources