I have large dataset of events reported during device sessions. Each session contain many events defined by event_id and event_type and class label. How can I model such data into dataset for classification problem?
TL;DR; How would I convert list of attributes (timestamp, device_id, event_type, event_id, label) into dataset for classification?
Use Lable Encoder to convert it into Binary
ex:
#converting into binary
from sklearn.preprocessing import LabelEncoder
lb=LabelEncoder()
df["column"]=lb.fit_transform(df["column"])
Related
I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)
I am trying to normalize my data(with shape (23687,7)), then I save the mean and std of the original dataset to "normalized_param.pkl"
After fitting the normalized data to my LSTM model, I will get an answer array (with shape (23687, 1))
Now what I gonna do is:
test_sc_path = os.path.join('normalized_standard', 'normalized_param.pkl')
test_scaler = load(test_sc_path)
test_denorm_value = test_scaler.inverse_transform(test_normalized_data)
ValueError: non-broadcastable output operand with shape (23687,1) doesn't match the broadcast shape (23687,7)
I think that's because the test_scaler object have 7 dim params inside, so if I want to de-normalize only 1 dim data, I should use
test_scaler.mean_[-1]and「test_scaler.scale_[-1]to get the last param I want to compute.
However, I think it's quite complicated, is there any sklearn method just like scaler.inverse_transform() I can easily use to solve this problem?
thanks
Yes, there is a method for it. See the documentation here.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data) # Basically fits the data, store means & standard deviations.
scaler.transform(data) # Standardize (Normalize) the data with the scaler parameters
scaler.fit_transform(data) # Fits & Transform
scaler.inverse_transform(data) # Apply inverse transformation for the input data.
So I was given Xtrain, ytrain, Xtest, ytest, Xvalid, yvalid data for a HW assignment. This assignment is for a Random Forest but I think my question can apply to any/most models.
So my understanding is that you use Xtrain and ytrain to fit the model such as (clf.fit(Xtrain, ytrain)) and this creates the model which can provide you a score and predictions for your training data
So when I move on to Test and Valid data sets, I only use ytest and yvalid to see how they predict and score. My professor provided us with three X dataset (Xtrain, Xtest, Xvalid), but to me I only need the Xtrain to train the model initially and then test the model on the different y data sets.
If i did .fit() for each pair of X,y I would create/fit three different models from completely different data so the models are not comparable from my perspective.
Am I wrong?
Training step :
Assuming your are using sklearn, the clf.fit(Xtrain, ytrain) method enables you to train your model (clf) to best fit the training data Xtrain and labels ytrain. At this stage, you can compute a score to evaluate your model on training data, as you said.
#train step
clf = your_classifier
clf.fit(Xtrain, ytrain)
Test step :
Then, you have to use the test data Xtest to feed the prior trained model in order to generate new labels ypred.
#test step
ypred = clf.predict(Xtest)
Finally, you have to compare these generated labels ypred with the true labels ytest to provide a robust evaluation of the model performance on unknown data (data not used during training) with tools like confusion matrix, metrics...
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
test_cm = confusion_matrix(ytest,ypred)
test_report = classification_report(ytest,ypred)
test_accuracy = accuracy_score(ytest, ypred)
I want to know is there any way in which we can partially save a Scikit-Learn Machine Learning model and reload it again to train it from the point it was saved before?
For models such as Scikitlearn applied to sentiment analysis, I would suspect you need to save two important things: 1) your model, 2) your vectorizer.
Remember that after training your model, your words are represented by a vector of length N, and that is defined according to your total number of words.
Below is a piece from my test-model and test-vectorizer saved in order to be used latter.
SAVING THE MODEL
import pickle
pickle.dump(vectorizer, open("model5vectorizer.pickle", "wb"))
pickle.dump(classifier_fitted, open("model5.pickle", "wb"))
LOADING THE MODEL IN A NEW SCRIPT (.py)
import pickle
model = pickle.load(open("model5.pickle", "rb"))
vectorizer = pickle.load(open("model5vectorizer.pickle", "rb"))
TEST YOUR MODEL
sentence_test = ["Results by Andutta et al (2013), were completely wrong and unrealistic."]
USING THE VECTORIZER (model5vectorizer.pickle) !!
sentence_test_data = vectorizer.transform(sentence_test)
print("### sentence_test ###")
print(sentence_test)
print("### sentence_test_data ###")
print(sentence_test_data)
# OBS-1: VECTOR HERE WILL HAVE SAME LENGTH AS BEFORE :)
# OBS-2: If you load the default vectorizer or a different one, then you may see the following problems
# sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
# # ValueError: X has 8 features per sample; expecting 11
result1 = model.predict(sentence_test_data) # using saved vectorizer from calibrated model
print("### RESULT ###")
print(result1)
Hope that helps.
Regards,
Andutta
When a data set is fitted to a Scikit-learn machine learning model, it is trained and supposedly ready to be used for prediction purposes. By training a model with let's say, 100 samples and using it and then going back to it and fitting another 50 samples to it, you will not make it better but you will rebuild it.
If your purpose is to build a model and make it more powerful as it interacts with more samples, you would be thinking of a real-time condition, such as a mobile robot for mapping an environment with a Kalman Filter.
I have a handwritten dataset for classification purpose where the classes are from a-z. If I want to use MLPClassifier, I think I cannot use such categorical classes directly because MLP implementation in scikit-learn only handles numerical classes. Thus, what is the appropriate action to do here? How about converting these classes to be numbered from 1-28, does it make sense? If not, does scikit-learn provide special encoding mechanism for class labels to handle this case (I guess one-hot encoding is not the option here)?
Thank you
You may need to preprocess the data, as scikit-learn only handles numeric values. In this case I wanted to predict the currency of a transaction. The currency is expressed in ISO code so LabelEncoder was used to transform it into numeric categories (ie: 1, 2, 3...):
#Import the object LabelEncoder
from sklearn.preprocessing import LabelEncoder
#defining class column
my_encoder = LabelEncoder()
my_class_currency = np.array(my_encoder.fit_transform(my_data['currency'])).reshape(-1,1)
#Create a "diccionary" to translate the categories into the actual values once you have the output
my_class_decoder = list(np.unique(my_data['currency']))