The expanding mean is a way to prevent overfitting when performing target encoding. But what I do not understand is how to use this technic to apply a fit on the train set and a transform on the test set to encode my features, as this encoding technic encode the features dynamically; the encoding value for a given feature level is varying input after input as it depends from a cumulative sum.
cumulative_sum = training.groupby(column)["target"].cumsum() - training["target"]
cumulative_count = training.groupby(column).cumcount()
train_new[column + "_mean_target"] = cumulative_sum/cumulative_count
Shouldn't you simply map the mean values of the target variable calculated for different categories to the corresponding categories in your test set? The cumulative means are needed only for the training part for regularization purposes.
I would also be interested to know how to compute the mean encoding for the test set.
For now, I am recomputing the mean over the train set and assigning the values to the test set.
test.merge(training.groupby(column).mean().reset_index(), on=column)
Related
I have a problem where there are 20 classes. I have designed a neural network and using the loss as categorical_crossentropy.
When dealing with categorical cross entropy the output label must be one hot encoded.
So, when I one hot encoded the output label, the label in every row was one hot encoded in a matrix, while in label encoder I got the same encoding in an array.
oht = OneHotEncoder()
y_train_oht = oht.fit_transform(np.array(y_train).reshape(-1,1))
below is the snippet of label encoding
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_train_le_cat = to_categorical(y_train_le)
one hot encoding sample output one hot encoding
label encoding sample output label encoding
I find the one hot encoding gives a matrix while label encoding gives an array. Can I please know when one hot encoding does the same job why do we have a label encoder. What kind of optimization does the label encoder bring in?
If using the label encoder happens to be more optimal then why do we not use the label encoder to encode categorical input data instead of one hot encoding?
Label encoding imposes artificial order: if you label-encode your pet target as 'Dog':0, 'Cat':1, 'Turtle':2, 'Golden Fish':3, then you get the awkward situation where 'Dog' < 'Cat' and 'Turtle is the average of 'Cat' + 'Golden Fish'.
In the case of predictor features (not the target), this is a problem since your Random Forest can be learning something like "if it less than 'Turtle', then...".
Also, you may have categories in the testing set (or even worse, new data during deployment) that were not present in the training, and the transformer doesn't know what to do, so it throws an error. This may be the case or not depending on the particular problem and particular feature you are encoding, obviously not for the target variable.
When hot encoding, if a category absent in the training is present in a prediction, it just get encoded as 0 in each of the encoded features (new columns representing each category), so you don't get an error. Your model still has the other features to make a reasonable guess.
As a general rule, you want to use label encoding for target variables and OHE for predictor features. Note that in general you don't care about artificial order in the target, since the prediction is usually categorical also (A forest will choose a number, not a range of numbers; a network will have one activation unit per category...)
I don't think optimization should be part of the discussion here since they are used for different scenarios demanding different outputs: surely it's more efficient to use the OHE transformer than trying to hack it by performing label encoding and then some pandas trickery to create the same result as with one hot encoding.
Here there are useful comments about the different scenarios (type of model, type of data) and some issues related to efficiency.
Here there's an example on why label encoding is a bad practice for input features.
And let's not forget that the goal of the model is to make predictions, so at the end what's important is not just the output of <transformer>.fit_transform, but also the fitted transformer itself that's going to be applied to the new observations. OHE will deal with new cases differently than label-encoder (e.g. when the value of the feature in the observation was not present in the training set). That's in my opinion enough reason to have different methods, even when they act in a way similar enough so, for some inputs, you may be able to force them to give similar outputs.
I have a set of inputs that has 5000ish features with values that vary from 0.005 to 9000000. Each of the features has similar values (a feature with a value of 10ish will not also have a value of 0.1ish)
I am trying to apply linear regression to this data set, however, the wide range of input values is inhibiting effective gradient descent.
What is the best way to handle this variance? If normalization is best, please include details on the best way to implement this normalization.
Thanks!
Simply perform it as a pre-processing step. You can do it as following:
1) Calculate mean values for each of the features in the training set and store it. Be careful, do not mess up feature mean and sample mean, so you will have a vector of size [number_of_features (5000ish)].
2) Calculate std. for each feature in the training set and store it. Size of [number_of_feature] as well
3) Update each training and testing entry as:
updated = (original_vector - mean_vector)/ std_vector
That's it!
The code will look like:
# train_data shape [train_length,5000]
# test_data [test_length, 5000]
mean = np.mean(train_data,1)
std = np.std(train_data,1)
normalized_train_data = (train_data - mean)/ std
normalized_test_data = (test_data - mean)/ std
I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?
Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.
This process is built-in in most ML tools already, like with this short example in R, using the caret library:
library(caret)
library(lattice)
library(doMC)
registerDoMC(3)
model <- train(x = iris[,1:4],
y = iris[,5],
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
trControl = trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
allowParallel = T))
# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results)
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3))
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)
Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).
BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as #christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.
I was using Vowpal Wabbit and was generating the classifier trained as a readable model.
My dataset had 22 features and the readable model gave as output:
Version 7.2.1
Min label:-50.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
101143:0.035237
101144:0.033885
101145:0.013357
101146:-0.007537
101147:-0.039093
101148:-0.013357
101149:0.001748
116060:0.499471
157941:-0.037318
157942:0.008038
157943:-0.011337
196772:0.138384
196773:0.109454
196774:0.118985
196775:-0.022981
196776:-0.301487
196777:-0.118985
197006:-0.000514
197007:-0.000373
197008:-0.000288
197009:-0.004444
197010:-0.006072
197011:0.000270
Can somebody please explain to me how to interpret the last portion of the file (after options: )? I was using logistic regression and I need to check how iterating over the training updates my classifier so that I can understand when I reach a convergence...
Thanks in advance :)
The values you see are the hash-values and weights of all your 22 features and one additional "Constant" feature (its hash value is 116060) in the resulting trained model.
The format is:
hash_value:weight
In order to see your original feature names instead of the hash value, you may use one of two methods:
Use the utl/vw-varinfo (in the source tree) utility on your training set with the same options you used for training. Try utl/vw-varinfo for a help/usage message
Use the relatively new --invert_hash readable.model option
BTW: inverting the hash values back to the original feature names is not the default due to the large performance penalty. By default, vw applies the one way hash on each feature string it sees. It doesn't maintain a hash-map between feature-names and their hash-values at all.
Edit:
Another little tidbit that may be of interest is the first entry after options: which reads:
:0
It essentially means that any "other" feature (all those which are not in the training-set, and thus, not hashed into the weight vector) defaults to a weight of 0. This means that it is redundant in vowpal-wabbit to train on features with values of zero, which is the default anyway. Explicit :0 value features simply won't contribute anything to the model. When you leave-out a weight in your training-set, as in: feature_name without a trailing :<value> vowpal wabbit implicitly assumes that it is a binary feature, with a TRUE value. IOW: it defaults all value-less features, to a value of one (:1) rather than a value of zero (:0). HTH.
Vowpal Wabbit also now has an --invert_hash option, which will give you a readable model with the actual variables, as well as just the hashes.
It consumes a LOT more memory, but since your model seems to be pretty small it will probably work.
For standard Q-Learning combined with a neural network things are more or less easy.
One stores (s,a,r,s’) during interaction with the environment and use
target = Qnew(s,a) = (1 - alpha) * Qold(s,a) + alpha * ( r + gamma * max_{a’} Qold(s’, a’) )
as target values for the neural network approximation the Q-function. So the input of the ANN is (s,a) and the output is the scalar Qnew(s,a). Deep Q-Learning papers/tutorials change the structure of the Q-function. Instead of providing a single Q-value for the pair (s,a) it should now provide the Q-Values of all possible actions for a state s, so it is Q(s) instead of Q(s,a).
There comes my problem. The data base filled with (s,a,r,s’) does for a specific state s does not contain the reward for all actions. Only for some, maybe just one action. So how to setup the target values for the network Q(s) = [Q(a_1), …. , Q(a_n) ] without having all rewards for the state s in the database? I have seen different loss functions/target values but all contain the reward.
As you see; I am puzzled. Does someone help me? There are a lot of tutorials on the web but this step is in general poorly descried and even less motivated looking at the theory…
You just get the target value corresponding to the action which exists on the observation s,a,r,s'. Basically you get the target value for all actions, and then choose the maximum of them as you wrote yourself: max_{a'} Qold(s', a'). Then, it is added to the r(s,a) and the result is the target value. For example, assume you have 10 actions, and observation is (s_0, a=5, r(s_0,a=5)=123, s_1). Then, the target value is r(s_0,a=5)+ \gamma* \max_{a'} Q_target(s_1,a'). For example, with tensorflow it could be something like:
Q_Action = tf.reduce_sum(tf.multiply(Q_values,tf.one_hot(action,output_dim)), axis = 1) # dim: [batchSize , ]
in which Q_values is of size batchSize, output_dim. So, the output is a vector of size batchSize, and then there exists a vector of same size obtained as the target value. The loss is the square of the differences of them.
When you calculate loss value, also you only run the backward on the existing action and the gradient from the other action is just zero.
So, you only need the reward of the existing action.