Handle categorical class labels for scikit-learn MLPClassifier - machine-learning

I have a handwritten dataset for classification purpose where the classes are from a-z. If I want to use MLPClassifier, I think I cannot use such categorical classes directly because MLP implementation in scikit-learn only handles numerical classes. Thus, what is the appropriate action to do here? How about converting these classes to be numbered from 1-28, does it make sense? If not, does scikit-learn provide special encoding mechanism for class labels to handle this case (I guess one-hot encoding is not the option here)?
Thank you

You may need to preprocess the data, as scikit-learn only handles numeric values. In this case I wanted to predict the currency of a transaction. The currency is expressed in ISO code so LabelEncoder was used to transform it into numeric categories (ie: 1, 2, 3...):
#Import the object LabelEncoder
from sklearn.preprocessing import LabelEncoder
#defining class column
my_encoder = LabelEncoder()
my_class_currency = np.array(my_encoder.fit_transform(my_data['currency'])).reshape(-1,1)
#Create a "diccionary" to translate the categories into the actual values once you have the output
my_class_decoder = list(np.unique(my_data['currency']))

Related

Specifying class or sample weights in Keras for one-hot encoded labels in a TF Dataset

I am trying to train an image classifier on an unbalanced training set. In order to cope with the class imbalance, I want either to weight the classes or the individual samples. Weighting the classes does not seem to work. And somehow for my setup I was not able to find a way to specify the samples weights. Below you can read how I load and encode the training data and the two approaches that I tried.
Training data loading and encoding
My training data is stored in a directory structure where each image is place in the subfolder corresponding to its class (I have 32 classes in total). Since the training data is too big too all load at once into memory I make use of image_dataset_from_directory and by that describe the data in a TF Dataset:
train_ds = keras.preprocessing.image_dataset_from_directory (training_data_dir,
batch_size=batch_size,
image_size=img_size,
label_mode='categorical')
I use label_mode 'categorical', so that the labels are described as a one-hot encoded vector.
I then prefetch the data:
train_ds = train_ds.prefetch(buffer_size=buffer_size)
Approach 1: specifying class weights
In this approach I try to specify the class weights of the classes via the class_weight argument of fit:
model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,
class_weight=class_weights
)
For each class we compute weight which are inversely proportional to the number of training samples for that class. This is done as follows (this is done before the train_ds.prefetch() call described above):
class_num_training_samples = {}
for f in train_ds.file_paths:
class_name = f.split('/')[-2]
if class_name in class_num_training_samples:
class_num_training_samples[class_name] += 1
else:
class_num_training_samples[class_name] = 1
max_class_samples = max(class_num_training_samples.values())
class_weights = {}
for i in range(0, len(train_ds.class_names)):
class_weights[i] = max_class_samples/class_num_training_samples[train_ds.class_names[i]]
What I am not sure about is whether this solution works, because the keras documentation does not specify the keys for the class_weights dictionary in case the labels are one-hot encoded.
I tried training the network this way but found out that the weights did not have a real influence on the resulting network: when I looked at the distribution of predicted classes for each individual class then I could recognize the distribution of the overall training set, where for each class the prediction of the dominant classes is most likely.
Running the same training without any class weight specified led to similar results.
So I suspect that the weights don't seem to have an influence in my case.
Is this because specifying class weights does not work for one-hot encoded labels, or is this because I am probably doing something else wrong (in the code I did not show here)?
Approach 2: specifying sample weight
As an attempt to come up with a different (in my opinion less elegant) solution I wanted to specify the individual sample weights via the sample_weight argument of the fit method. However from the documentation I find:
[...] This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.
Which is indeed the case in my setup where train_ds is a dataset. Now I really having trouble finding documentation from which I can derive how I can modify train_ds, such that it has a third element with the weight. I thought using the map method of a dataset can be useful, but the solution I came up with is apparently not valid:
train_ds = train_ds.map(lambda img, label: (img, label, class_weights[np.argmax(label)]))
Does anyone have a solution that may work in combination with a dataset loaded by image_dataset_from_directory?

Use array feature in RandomForest without flatten

How to use array features in RandomForest without flatten the input?
import numpy as np
from sklearn.ensemble import RandomForestClassifier
array_feature = np.array([0,0,1])
train_x = np.matrix([[1, 2, array_feature], [3, 4, array_feature] , [1,1, array_feature] ])
train_y = np.array([1,0,1])
clf_rf = RandomForestClassifier(n_estimators=2)
clf_rf.fit(train_x, train_y)
ValueError: setting an array element with a sequence.
You can't.
In sklearn, most models can only use numerical data, and preprocessing is done separately. Tree models (in sklearn) in particular can only make splits on whether a given feature is less or greater than a given value. You can either flatten the arrays, or provide some encoding for them, depending on what those arrays represent.
*(Tree models in other packages, and perhaps soon in sklearn, can treat categorical variables directly. Ordinal variables get treated just like continuous ones, and unordered categorical variables can be split into arbitrary bipartitions in CART or cause multiple-arity splits in Quinlan-family trees. But then still you would need to inform the model that your arrays should be treated as ordinal or unordered categorical or ...)

Is there any method like 「scaler.inverse_transform()」to get partial scaler params to de-normalize the answer?

I am trying to normalize my data(with shape (23687,7)), then I save the mean and std of the original dataset to "normalized_param.pkl"
After fitting the normalized data to my LSTM model, I will get an answer array (with shape (23687, 1))
Now what I gonna do is:
test_sc_path = os.path.join('normalized_standard', 'normalized_param.pkl')
test_scaler = load(test_sc_path)
test_denorm_value = test_scaler.inverse_transform(test_normalized_data)
ValueError: non-broadcastable output operand with shape (23687,1) doesn't match the broadcast shape (23687,7)
I think that's because the test_scaler object have 7 dim params inside, so if I want to de-normalize only 1 dim data, I should use
test_scaler.mean_[-1]and「test_scaler.scale_[-1]to get the last param I want to compute.
However, I think it's quite complicated, is there any sklearn method just like scaler.inverse_transform() I can easily use to solve this problem?
thanks
Yes, there is a method for it. See the documentation here.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data) # Basically fits the data, store means & standard deviations.
scaler.transform(data) # Standardize (Normalize) the data with the scaler parameters
scaler.fit_transform(data) # Fits & Transform
scaler.inverse_transform(data) # Apply inverse transformation for the input data.

does multiple classification label need to be converted into one-hot encoding from float value

I have iris example, the type from sample data has three float values:
0.0, 1.0, 2.0.
The size relationship between them might misleading training model, I just guess.
Am I right? Should it be converted into three vectors using one-hot encoding or other ways?
from keras.utils import np_utils
trainY = np_utils.to_categorical(trainY)
In the iris dataset, there are three possible labels.
When converted to a number you get 0, 1, 2 discrete integers. So, you have three classes for the classification problem.
If you convert them to one hot then usecategorical_crossentropy else use sparse_categorical_crossentropy.

Text Classification with scikit-learn: how to get a new document's representation from a pickle model

I have a document binomial classifier that uses a tf-idf representation of a training set of documents and applies Logistic Regression to it:
lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])
lr_tfidf.fit(X_train, y_train)
I save the model in pickle and used it to classify new documents:
text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)
How can I get the representation (features + frequencies) used by the model for this new document without explicitly computing it?
EDIT: I am trying to explain better what I want to get.
Wen I use predict_proba, I guess that the new document is represented as a vector of term frequencies (according to the rules used in the model stored) and those frequencies are multiplied by the coefficients learnt by the logistic regression model to predict the class. Am I right? If yes, how can I get the terms and term frequencies of this new document, as used by predict_proba?
I am using sklearn v 0.19
As I understand from the comments, you need to access the tfidfVectorizer from inside the pipeline. This can be done easily by:
tfidfVect = text_model.named_steps['vect']
Now you can use the transform() method of the vectorizer to get the tfidf values.
tfidf_vals = tfidfVect.transform(new_document)
The tfidf_vals will be a sparse matrix of single row containing the tfidf of terms found in the new_document. To check what terms are present in this matrix, you need to use tfidfVect.get_feature_names().

Resources