How to "remember" categorical encodings for actual predictions after training? - machine-learning

Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.

What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.

My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.

Related

how to apply custom encoders to multiple clients at once? how to use custom encoders in run_one_round?

So my goal is basically implementing global top-k subsampling. Gradient sparsification is quite simple and I have already done this building on stateful clients example, but now I would like to use encoders as you have recommended here at page 28. Additionally I would like to average only the non-zero gradients, so say we have 10 clients but only 4 have nonzero gradients at a given position for a communication round then I would like to divide the sum of these gradients to 4, not 10. I am hoping to achieve this by summing gradients at numerator and masks, 1s and 0s, at denominator. Also moving forward I will add randomness to gradient selection so it is imperative that I create those masks concurrently with gradient selection. The code I have right now is
import tensorflow as tf
from tensorflow_model_optimization.python.core.internal import tensor_encoding as te
#te.core.tf_style_adaptive_encoding_stage
class GrandienrSparsificationEncodingStage(te.core.AdaptiveEncodingStageInterface):
"""An example custom implementation of an `EncodingStageInterface`.
Note: This is likely not what one would want to use in practice. Rather, this
serves as an illustration of how a custom compression algorithm can be
provided to `tff`.
This encoding stage is expected to be run in an iterative manner, and
alternatively zeroes out values corresponding to odd and even indices. Given
the determinism of the non-zero indices selection, the encoded structure does
not need to be represented as a sparse vector, but only the non-zero values
are necessary. In the decode mehtod, the state (i.e., params derived from the
state) is used to reconstruct the corresponding indices.
Thus, this example encoding stage can realize representation saving of 2x.
"""
ENCODED_VALUES_KEY = 'stateful_topk_values'
INDICES_KEY = 'indices'
SHAPES_KEY = 'shapes'
ERROR_COMPENSATION_KEY = 'error_compensation'
def encode(self, x, encode_params):
shapes_list = [tf.shape(y) for y in x]
flattened = tf.nest.map_structure(lambda y: tf.reshape(y, [-1]), x)
gradients = tf.concat(flattened, axis=0)
error_compensation = encode_params[self.ERROR_COMPENSATION_KEY]
gradients_and_error_compensation = tf.math.add(gradients, error_compensation)
percentage = tf.constant(0.1, dtype=tf.float32)
k_float = tf.multiply(percentage, tf.cast(tf.size(gradients_and_error_compensation), tf.float32))
k_int = tf.cast(tf.math.round(k_float), dtype=tf.int32)
values, indices = tf.math.top_k(tf.math.abs(gradients_and_error_compensation), k = k_int, sorted = False)
indices = tf.expand_dims(indices, 1)
sparse_gradients_and_error_compensation = tf.scatter_nd(indices, values, tf.shape(gradients_and_error_compensation))
new_error_compensation = tf.math.subtract(gradients_and_error_compensation, sparse_gradients_and_error_compensation)
state_update_tensors = {self.ERROR_COMPENSATION_KEY: new_error_compensation}
encoded_x = {self.ENCODED_VALUES_KEY: values,
self.INDICES_KEY: indices,
self.SHAPES_KEY: shapes_list}
return encoded_x, state_update_tensors
def decode(self,
encoded_tensors,
decode_params,
num_summands=None,
shape=None):
del num_summands, decode_params, shape # Unused.
flat_shape = tf.math.reduce_sum([tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]])
sizes_list = [tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]]
scatter_tensor = tf.scatter_nd(
indices=encoded_tensors[self.INDICES_KEY],
updates=encoded_tensors[self.ENCODED_VALUES_KEY],
shape=[flat_shape])
nonzero_locations = tf.nest.map_structure(lambda x: tf.cast(tf.where(tf.math.greater(x, 0), 1, 0), tf.float32) , scatter_tensor)
reshaped_tensor = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(scatter_tensor, sizes_list), encoded_tensors[self.SHAPES_KEY])]
reshaped_nonzero = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(nonzero_locations, sizes_list), encoded_tensors[self.SHAPES_KEY])]
return reshaped_tensor, reshaped_nonzero
def initial_state(self):
return {self.ERROR_COMPENSATION_KEY: tf.constant(0, dtype=tf.float32)}
def update_state(self, state, state_update_tensors):
return {self.ERROR_COMPENSATION_KEY: state_update_tensors[self.ERROR_COMPENSATION_KEY]}
def get_params(self, state):
encode_params = {self.ERROR_COMPENSATION_KEY: state[self.ERROR_COMPENSATION_KEY]}
decode_params = {}
return encode_params, decode_params
#property
def name(self):
return 'gradient_sparsification_encoding_stage'
#property
def compressible_tensors_keys(self):
return False
#property
def commutes_with_sum(self):
return False
#property
def decode_needs_input_shape(self):
return False
#property
def state_update_aggregation_modes(self):
return {}
I have run some simple tests manually following the steps you outlined here at page 45. It works but I have some questions/problems.
When I use list of tensors of same shape (ex:2 2x25 tensors) as input,x, of encode it works without any issues but when I try to use list of tensors of different shapes (2x20 and 6x10) it gives and error saying
InvalidArgumentError: Shapes of all inputs must match: values[0].shape = [2,20] != values1.shape = [6,10] [Op:Pack] name: packed
How can I resolve this issue? As i said I want to use global top-k so it is essential I encode entire trainable model weights at once. Take the cnn model used here, all the tensors have different shapes.
How can I do the averaging I described at the beginning? For example here you have done
mean_factory = tff.aggregators.MeanFactory(
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # numerator
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # denominator )
Is there a way to repeat this with one output of decode going to numerator and other going to denominator? How can I handle dividing 0 by 0? tensorflow has divide_no_nan function, can I use it somehow or do I need to add eps to each?
How is partition handled when I use encoders? Does each client get a unique encoder holding a unique state for it? As you have discussed here at page 6 client states are used in cross-silo settings yet what happens if client ordering changes?
Here you have recommended using stateful clients example. Can you explain this a bit further? I mean in the run_one_round where exactly encoders go and how are they used/combined with client update and aggregation?
I have some additional information such as sparsity I want to pass to encode. What is the suggested method for doing that?
Here are some answers, hope it helps:
If you want to treat all of the aggregated structure just as a single tensor, use concat_factory as the outermost aggregator. That will concatenate entire structure to a rank-1 Tensor at clients, and then unpack back to the original structure at the end. Example use: tff.aggregators.concat_factory(tff.aggregators.MeanFactory(...))
Note the encoding stage objects are meant to work with a single tensor, so what you describe with identical tensors probably works only accidentally.
There are two options.
a. Modify the client training code such that the weights being passed to the weighted aggregator are already what you want it to be (zero/one
mask). In the stateful clients example you link, that would be here. You will then get what you need by default (by summing the numerator).
b. Modify UnweightedMeanFactory to do exactly the variant of averaging you describe and use that. Start would be modifying this
(and 4.) I think that is what you would need to implement. The same way existing client states are initialized in the example here, you would need extend it to contain the aggregator states, and make sure those are sampled together with the clients, as done here. Then, to integrate the aggregators in the example you would need to replace this hard-coded tff.federated_mean. An example of such integration is in the implementation of tff.learning.build_federated_averaging_process, primarily here
I am not sure what the question is. Perhaps get the previous working (seems like a prerequisite to me), and then clarify and ask in a new post?

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.
The example of training data (length>10)
docs = ['This is a sentence', 'This is another sentence', ....]
with some pre-treatment
doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
tagd = TaggedDocument(doc_[i],str(i))
doc_tagged.append(tagd)
tagged docs
TaggedDocument(words=array(['a', 'b', 'c', ..., ],
dtype='<U32'), tags='117')
fit a doc2vec model
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)
then i get the final model
len(model.docvecs)
the result is 10...
I tried other datasets (length>100, 1000) and got same result of len(model.docvecs).
So, my question is:
How to use model.docvecs to get full vectors? (without using model.infer_vector)
Is model.docvecs designed to provide all training docvecs?
The bug is in this line:
tagd = TaggedDocument(doc[i],str(i))
Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.
Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.
Here's the proper solution:
docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]

How to convert a single word into vector for neural network input

I am new to ML and thinking about one simple hello world problem as a practice of using ANN. Here is the problem, say I have a training data set with English name and its corresponding gender:
ALEX,M
BONNIE,F
CARLO,F
DAVID,M
EDDY,M
...
I would like to build a model to predict the gender of name. Since the input to ANN must be a form of vector, I am thinking to convert a name into a vector with a number of features same as the longest name in the data set (i.e. 10) and then put A=1, B=2, ... , Z=26, and null=-1 to the vector.
For example:
ALEX will be [1, 12, 5, 24, -1, -1, -1, -1, -1, -1]
The output layer will be {0, 1} which represent either male or female.
It sounds quite strange to me. Is it a good way to feed a single word into ANN like this?
Use one-hot encoding. This means you have a large input size, but it has proven to work. Using A=1, B=2, Z=26 gives the network the impression that B is closer to A than Z is to A, and it requires a lot of hidden nodes to map the function.
With one-hot encoding:
A: [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
B: [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Z: [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]
none: [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
So it requires 26 inputs per letter.

How can one add batch mechanism to the input function in Tensorflow tutorial overcoming tf.Sparsetensor objects?

How can one add batch mechanism to the input_fn in the TensorFlow Wide & Deep Learning Tutorial overcoming that some features are represented as tf.Sparsetensor objects?
I have made many attempts around adding tf.train.slice_input_producer and tf.train.batchto the original code (see below), but have failed miserably so far.
I would like to keep the global working of that input_fn as it is handy while while training and evaluating the model.
Can someone help, please?
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name (k)
# to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
shape=[df[k].size, 1]) for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
feature_cols = dict(continuous_cols.items() + categorical_cols.items())
# Converts the label column into a constant Tensor.
labels = tf.constant(df[LABEL_COLUMN].values)
'''
Changes from here:
'''
features_slices, features_slices = tf.train.slice_input_producer([features_cols, labels], ...)
features_batches, labels_batches = tf.train.batch([features_slices, features_slices], ...)
# Returns the feature and label batches.
return features_batches, labels_batches

Weka normalizing columns

I have an ARFF file containing 14 numerical columns. I want to perform a normalization on each column separately, that is modifying the values from each colum to (actual_value - min(this_column)) / (max(this_column) - min(this_column)). Hence, all values from a column will be in the range [0, 1]. The min and max values from a column might differ from those of another column.
How can I do this with Weka filters?
Thanks
This can be done using
weka.filters.unsupervised.attribute.Normalize
After applying this filter all values in each column will be in the range [0, 1]
That's right. Just wanted to remind about the difference of "normalization" and "standardization". What mentioned in the question is "standardization", while "normalization" assumes Gaussian distribution and normalizes by mean, and standard variation of each attribute. If you have an outlier in your data, the standardize filter might hurt your data distribution as the min, or max might be much farther than the other instances.
In this case, we can use weka.filters.unsupervised.attribute.Normalize filter to normalize but if we want to normalize only some columns the following will be the best approach.
To apply normalize on selected columns
The unsupervised.attribute.PartitionedMultiFilter can be used for this task.
Thereby you have to configure the filters and ranges sections as per your need.
For Ex: If I want to normalize only on humidity attribute
Step 01 :
After adding the ParririonedMultiFilter -> Tap on filter text box -> choose Normalize from weka.filters.unsupervised.attribute.Normalize -> And edit the Normalize filter as of your need(by giving the scale and translation values)
Step 02:
Tap on ranges text box -> Delete the default filter( which is first-last) -> Then add the column number you want to filter -> Click ok -> Click on Apply
Now the filter will be added only to the selected(humidity) column.
Here is the working normalization example with K-Means in JAVA.
final SimpleKMeans kmeans = new SimpleKMeans();
final String[] options = weka.core.Utils
.splitOptions("-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 10 -A \"weka.core.EuclideanDistance -R first-last\" -I 500 -num-slots 1 -S 50");
kmeans.setOptions(options);
kmeans.setSeed(10);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(25);
kmeans.setMaxIterations(1000);
final BufferedReader datafile = new BufferedReader(new FileReader("/Users/data.arff");
Instances data = new Instances(datafile);
//normalize
final Normalize normalizeFilter = new Normalize();
normalizeFilter.setInputFormat(data);
data = Filter.useFilter(data, normalizeFilter);
//remove class column[0] from cluster
data.setClassIndex(0);
final Remove removeFilter = new Remove();
removeFilter.setAttributeIndices("" + (data.classIndex() + 1));
removeFilter.setInputFormat(data);
data = Filter.useFilter(data, removeFilter);
kmeans.buildClusterer(data);
System.out.println(kmeans.toString());
// evaluate clusterer
final ClusterEvaluation eval = new ClusterEvaluation();
eval.setClusterer(kmeans);
eval.evaluateClusterer(data);
System.out.println(eval.clusterResultsToString());
If you have CSV file then replace BufferedReader line above with below mentioned Datasource:
final DataSource source = new DataSource("/Users/data.csv");
final Instances data = source.getDataSet();

Resources