I have some questions about Normalization:
When you extract features and you want to normalize your features before classification.
How do you do normalize the features ( e.g. the two classes you have)?
1- Do you normalize each class seperatly? or you normalize the two classes together?
2- Do you normalize the whole data before spliting trianing and testing ? or you normalize training first , then normalize each new testing sample separately?
3- Any Reference? book or paper?
*Do you normalize the whole data before spliting trianing and testing ?*
There is no need to split the data for training and testing
Code :
from sklearn.preprocessing import StandarScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
x = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
thanks
Related
I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)
Assume that I have 3 dataset in a ML problem.
train dataset: used to estimate ML model parameters (training)
test dataset: used to evaulate trained model, calculate accuracy of trained model
prediction dataset: used only for prediction after model deployment
I don't have evaluation dataset, and I use Grid Search with k-fold cross validation to find the best model.
Also, I have two python scripts as follows:
train.py: used to train and test ML model, load train and test dataset, save the trained model, best model is found by Grid Search.
predict.py: used to load pre-trained model & load prediction dataset, predict model output and calculate accuracy.
Before starting training process in train.py, I use MinMaxScaler as follows:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_train) # fit only on train dataset
x_train_norm = scaler.transform(x_train)
x_test_norm = scaler.transform(x_test)
In predict.py, after loding prediction dataset, I need to use the same data pre-processing as below:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_predict)
x_predict_norm = scaler.transform(x_predict)
As you can see above, both fit and transform are done on prediction dataset. However, in train.py, fit is done on train dataset, and the same MinMaxScaler is applied to transform test dataset.
My understanding is that test dataset is a simulation of real data that model is supposed to predict after deployment. Therefore, data pre-processing of test and prediction dataset should be the same.
I think separate MinMaxScaler should be used in train.py for train and test dataset as follows:
from sklearn.preprocessing import MinMaxScaler
scaler_train = MinMaxScaler()
scaler_test = MinMaxScaler()
scaler_train.fit(x_train) # fit only on train dataset
x_train_norm = scaler_train.transform(x_train)
scaler_test.fit(x_test) # fit only on test dataset
x_test_norm = scaler_test.transform(x_test)
What is the difference?
Value of x_test_norm will be different if I use separate MinMaxScaler as explained above. In this case, value of x_test_norm is in the range of [-1, 1]. However, If I transform test dataset by a MinMaxScaler which was fit by train dataset, value of x_test_norm can be outside the range of [-1, 1].
Please let me know your idea about it.
When you run .transform() MinMax scaling does something like: (value - min) / (Max - min) The value of min and Max are defined when you run .fit(). So the answer - yes, you should fit MinMaxScaller on the training dataset and then use it on the test dataset.
Just imagine the situation when in the training dataset you have some feature with Max=100 and min=10, while in the test dataset Max=10 and min=1. If you will train separate MinMaxScaller for test subset, yes, it will scale the feature in the range [-1, 1], but in comparison to the training dataset, the called values should be lower.
Also, regarding Grid Search with k-fold cross-validation, you should use the Pipeline. In this case, Grid Search will automatically fit MinMaxScaller on the k-1 folds. Here is a good example of how to organize pipeline with Mixed Types.
I'm currently using sklearn for a school project and I have some questions about how GridsearchCV applies preprocessing algorithms such as PCA or Factor Analysis. Let's suppose I perform hold out:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = 0.1, stratify = y)
Then, I declare some hyperparameters and perform a GridSearchCV (it would be the same with RandomSearchCV but whatever):
params = {
'linearsvc__C' : [...],
'linearsvc__tol' : [...],
'linearsvc__degree' : [...]
}
clf = make_pipeline(PCA(), SVC(kernel='linear'))
model = GridSearchCV(clf, params, cv = 5, verbose = 2, n_jobs = -1)
model.fit(X_tr, y_tr)
My issue is: my teacher told me that you should never fit the preprocessing algorithm (here PCA) on the validation set in case of a k fold cv, but only on the train split (here both the train split and validation split are subsets of X_tr, and of course they change at every fold). So if I have PCA() here, it should fit on the part of the fold used for training the model and eventually when I test the resulting model against the validation split, preprocess it using the PCA model obtained fitting it against the training set. This ensures no leaks whatsowever.
Does sklearn account for this?
And if it does: suppose that now I want to use imblearn to perform oversampling on an unbalanced set:
clf = make_pipeline(SMOTE(), SVC(kernel='linear'))
still according to my teacher, you shouldn't perform oversampling on the validation split as well, as this could lead to inaccurate accuracies. So the statement above that held for PCA about transforming the validation set on a second moment does not apply here.
Does sklearn/imblearn account for this as well?
Many thanks in advance
So I was given Xtrain, ytrain, Xtest, ytest, Xvalid, yvalid data for a HW assignment. This assignment is for a Random Forest but I think my question can apply to any/most models.
So my understanding is that you use Xtrain and ytrain to fit the model such as (clf.fit(Xtrain, ytrain)) and this creates the model which can provide you a score and predictions for your training data
So when I move on to Test and Valid data sets, I only use ytest and yvalid to see how they predict and score. My professor provided us with three X dataset (Xtrain, Xtest, Xvalid), but to me I only need the Xtrain to train the model initially and then test the model on the different y data sets.
If i did .fit() for each pair of X,y I would create/fit three different models from completely different data so the models are not comparable from my perspective.
Am I wrong?
Training step :
Assuming your are using sklearn, the clf.fit(Xtrain, ytrain) method enables you to train your model (clf) to best fit the training data Xtrain and labels ytrain. At this stage, you can compute a score to evaluate your model on training data, as you said.
#train step
clf = your_classifier
clf.fit(Xtrain, ytrain)
Test step :
Then, you have to use the test data Xtest to feed the prior trained model in order to generate new labels ypred.
#test step
ypred = clf.predict(Xtest)
Finally, you have to compare these generated labels ypred with the true labels ytest to provide a robust evaluation of the model performance on unknown data (data not used during training) with tools like confusion matrix, metrics...
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
test_cm = confusion_matrix(ytest,ypred)
test_report = classification_report(ytest,ypred)
test_accuracy = accuracy_score(ytest, ypred)
Just a brief idea of my situation:
I have 4 columns of input: id, text, category, label.
I used TFIDFVectorizer on the text which gives me a list of instances with word tokens of TFIDF score.
Now I'd like to include the category (no need to pass TFIDF) as another feature in the data outputed by the vectorizer.
Also note that prior to the vectorization, the data have passed train_test_split.
How could I achieve this?
Initial code:
#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance
#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)
#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
I would suggest doing your train test split after feature extraction.
Once you have the TF-IDF feature lists just add the other feature for each sample.
You will have to encode the category feature, a good choice would be sklearn's LabelEncoder. Then you should have two sets of numpy arrays that can be joined.
Here is a toy example:
X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)
At this point you would continue as you were, starting with the train test split.
You should use FeatureUnions - as explained in the documentation
FeatureUnions combines several transformer objects into a new
transformer that combines their output. A FeatureUnion takes a list of
transformer objects. During fitting, each of these is fit to the data
independently. For transforming data, the transformers are applied in
parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
Another good example on how to use FeatureUnions can be found here: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
Just concatenating different matrices like #AlexG suggests is probably an easier option but FeatureUnions is the scikit-learn way to do these things.