Normalize and de-normalizing data in prediction model - random-forest

I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.

You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)

Related

Should I use MinMaxScaler which was fit on train dataset to transform test dataset, or use a separate MinMaxScaler to fit and transform test dataset?

Assume that I have 3 dataset in a ML problem.
train dataset: used to estimate ML model parameters (training)
test dataset: used to evaulate trained model, calculate accuracy of trained model
prediction dataset: used only for prediction after model deployment
I don't have evaluation dataset, and I use Grid Search with k-fold cross validation to find the best model.
Also, I have two python scripts as follows:
train.py: used to train and test ML model, load train and test dataset, save the trained model, best model is found by Grid Search.
predict.py: used to load pre-trained model & load prediction dataset, predict model output and calculate accuracy.
Before starting training process in train.py, I use MinMaxScaler as follows:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_train) # fit only on train dataset
x_train_norm = scaler.transform(x_train)
x_test_norm = scaler.transform(x_test)
In predict.py, after loding prediction dataset, I need to use the same data pre-processing as below:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_predict)
x_predict_norm = scaler.transform(x_predict)
As you can see above, both fit and transform are done on prediction dataset. However, in train.py, fit is done on train dataset, and the same MinMaxScaler is applied to transform test dataset.
My understanding is that test dataset is a simulation of real data that model is supposed to predict after deployment. Therefore, data pre-processing of test and prediction dataset should be the same.
I think separate MinMaxScaler should be used in train.py for train and test dataset as follows:
from sklearn.preprocessing import MinMaxScaler
scaler_train = MinMaxScaler()
scaler_test = MinMaxScaler()
scaler_train.fit(x_train) # fit only on train dataset
x_train_norm = scaler_train.transform(x_train)
scaler_test.fit(x_test) # fit only on test dataset
x_test_norm = scaler_test.transform(x_test)
What is the difference?
Value of x_test_norm will be different if I use separate MinMaxScaler as explained above. In this case, value of x_test_norm is in the range of [-1, 1]. However, If I transform test dataset by a MinMaxScaler which was fit by train dataset, value of x_test_norm can be outside the range of [-1, 1].
Please let me know your idea about it.
When you run .transform() MinMax scaling does something like: (value - min) / (Max - min) The value of min and Max are defined when you run .fit(). So the answer - yes, you should fit MinMaxScaller on the training dataset and then use it on the test dataset.
Just imagine the situation when in the training dataset you have some feature with Max=100 and min=10, while in the test dataset Max=10 and min=1. If you will train separate MinMaxScaller for test subset, yes, it will scale the feature in the range [-1, 1], but in comparison to the training dataset, the called values should be lower.
Also, regarding Grid Search with k-fold cross-validation, you should use the Pipeline. In this case, Grid Search will automatically fit MinMaxScaller on the k-1 folds. Here is a good example of how to organize pipeline with Mixed Types.

Train Test Valid data sets... General question about fitting the models

So I was given Xtrain, ytrain, Xtest, ytest, Xvalid, yvalid data for a HW assignment. This assignment is for a Random Forest but I think my question can apply to any/most models.
So my understanding is that you use Xtrain and ytrain to fit the model such as (clf.fit(Xtrain, ytrain)) and this creates the model which can provide you a score and predictions for your training data
So when I move on to Test and Valid data sets, I only use ytest and yvalid to see how they predict and score. My professor provided us with three X dataset (Xtrain, Xtest, Xvalid), but to me I only need the Xtrain to train the model initially and then test the model on the different y data sets.
If i did .fit() for each pair of X,y I would create/fit three different models from completely different data so the models are not comparable from my perspective.
Am I wrong?
Training step :
Assuming your are using sklearn, the clf.fit(Xtrain, ytrain) method enables you to train your model (clf) to best fit the training data Xtrain and labels ytrain. At this stage, you can compute a score to evaluate your model on training data, as you said.
#train step
clf = your_classifier
clf.fit(Xtrain, ytrain)
Test step :
Then, you have to use the test data Xtest to feed the prior trained model in order to generate new labels ypred.
#test step
ypred = clf.predict(Xtest)
Finally, you have to compare these generated labels ypred with the true labels ytest to provide a robust evaluation of the model performance on unknown data (data not used during training) with tools like confusion matrix, metrics...
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
test_cm = confusion_matrix(ytest,ypred)
test_report = classification_report(ytest,ypred)
test_accuracy = accuracy_score(ytest, ypred)

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
tokenizer=word_tokenize,
preprocessor=lambda text: text.replace("<br />", " "),
max_features=None)),
('classifier', LinearSVC())
])
clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

Error while predicting a single value using a linear regression model

I'm a beginner and making a linear regression model, when I make predictions on the basis of test sets, it works fine. But when I try to predict something for a specific value. It gives an error. The tutorial I'm watching, they don't have any errors.
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualising the Linear Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue')
plt.title('Truth or Bluff (Linear Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
According to the Scikit-learn documentation, the input array should have shape (n_samples, n_features). As such, if you want a single example with a single value, you should expect the shape of your input to be (1,1).
This can be done by doing:
import numpy as np
test_X = np.array(6.5).reshape(-1, 1)
lin_reg.predict(test_X)
You can check the shape by doing:
test_X.shape
The reason for this is because the input can have many samples (i.e. you want to predict for multiple data points at once), or/and each sample can have many features.
Note: Numpy is a Python library to support large arrays and matrices. When scikit-learn is installed, Numpy should be installed as well.

Text Classification with scikit-learn: how to get a new document's representation from a pickle model

I have a document binomial classifier that uses a tf-idf representation of a training set of documents and applies Logistic Regression to it:
lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])
lr_tfidf.fit(X_train, y_train)
I save the model in pickle and used it to classify new documents:
text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)
How can I get the representation (features + frequencies) used by the model for this new document without explicitly computing it?
EDIT: I am trying to explain better what I want to get.
Wen I use predict_proba, I guess that the new document is represented as a vector of term frequencies (according to the rules used in the model stored) and those frequencies are multiplied by the coefficients learnt by the logistic regression model to predict the class. Am I right? If yes, how can I get the terms and term frequencies of this new document, as used by predict_proba?
I am using sklearn v 0.19
As I understand from the comments, you need to access the tfidfVectorizer from inside the pipeline. This can be done easily by:
tfidfVect = text_model.named_steps['vect']
Now you can use the transform() method of the vectorizer to get the tfidf values.
tfidf_vals = tfidfVect.transform(new_document)
The tfidf_vals will be a sparse matrix of single row containing the tfidf of terms found in the new_document. To check what terms are present in this matrix, you need to use tfidfVect.get_feature_names().

Resources