Constructing discrete table-based CPDs in tensorflow-probablity? - probability-distribution

I'm trying to construct the simplest example of Bayesian network with several discrete random variables and conditional probabilities (the "Student Network" from Koller's book, see 1)
Although a bit unwieldy, I managed to build this network using pymc3. Especially, creating the CPDs is not that straightforward in pymc3, see the snippet below:
import pymc3 as pm
...
with pm.Model() as basic_model:
# parameters for categorical are indexed as [0, 1, 2, ...]
difficulty = pm.Categorical(name='difficulty', p=[0.6, 0.4])
intelligence = pm.Categorical(name='intelligence', p=[0.7, 0.3])
grade = pm.Categorical(name='grade',
p=pm.math.switch(
theano.tensor.eq(intelligence, 0),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.3, 0.4, 0.3], # I=0, D=0
[0.05, 0.25, 0.7] # I=0, D=1
),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.9, 0.08, 0.02], # I=1, D=0
[0.5, 0.3, 0.2] # I=1, D=1
)
)
)
letter = pm.Categorical(name='letter', p=pm.math.switch(
...
But I have no idea how to build this network using tensoflow-probability (versions: tfp-nightly==0.7.0.dev20190517, tf-nightly-2.0-preview==2.0.0.dev20190517)
For the unconditioned binary variables, one can use categorical distribution, such as
from tensorflow_probability import distributions as tfd
from tensorflow_probability import edward2 as ed
difficulty = ed.RandomVariable(
tfd.Categorical(
probs=[0.6, 0.4],
name='difficulty'
)
)
But how to construct the CPDs?
There are few classes/methods in tensorflow-probability that might be relevant (in tensorflow_probability/python/distributions/deterministic.py or the deprecated ConditionalDistribution) but the documentation is rather sparse (one needs deep understanding of tfp).
--- Updated question ---
Chris' answer is a good starting point. However, things are still a bit unclear even for a very simple two-variable model.
This works nicely:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Bernoulli(probs=tf.gather([0.1, 0.9], indices=dist_x), validate_args=True)
))
print(jdn.sample(10))
but this one fails
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather_nd([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))
(I'm trying to model categorical explicitly in the second example just for learning purposes)
-- Update: solved ---
Obviously, the last example wrongly used tf.gather_nd instead of tf.gather as we only wanted to select the first or the second row based on the dist_x outome. This code works now:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))

The tricky thing about this, and presumably the reason it's subtler than expected in PyMC, is -- as with almost everything in vectorized programming -- handling shapes.
In TF/TFP, the (IMO) nicest way to solve this is with one of the new TFP JointDistribution{Sequential,Named,Coroutine} classes. These let you naturally represent hierarchical PGM models, and then sample from them, evaluate log probs, etc.
I whipped up a colab notebook demoing all 3 approaches, for the full student network: https://colab.research.google.com/drive/1D2VZ3OE6tp5pHTsnOAf_7nZZZ74GTeex
Note the crucial use of tf.gather and tf.gather_nd to manage the vectorization of the various binary and categorical switching.
Have a look and let me know if you have any questions!

Related

How to "remember" categorical encodings for actual predictions after training?

Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.
What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.
My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.

DL4J Prediction Formatting

I have two questions on deeplearning4j that are somewhat related.
When I execute “INDArray predicted = model.output(features,false);” to generate a prediction, I get the label predicted by the model; it is either 0 or 1. I tried to search for a way to have a probability (value between 0 and 1) instead of strictly 0 or 1. This is useful when you need to set a threshold for what your model should consider as a 0 and what it should consider as a 1. For example, you may want your model to output '1' for any prediction that is higher than or equal to 0.9 and output '0' otherwise.
My second question is that I am not sure why the output is represented as a two-dimensional array (shown after the code below) even though there are only two possibilities, so it would be better to represent it with one value - especially if we want it as a probability (question #1) which is one value.
PS: in case relevant to the question, in the Schema the output column is defined using ".addColumnInteger". Below are snippets of the code used.
Part of the code:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.iterations(1)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(learningRate)
.updater(org.deeplearning4j.nn.conf.Updater.NESTEROVS).momentum(0.9)
.list()
.layer(0, new DenseLayer.Builder()
.nIn(numInputs)
.nOut(numHiddenNodes)
.weightInit(WeightInit.XAVIER)
.activation("relu")
.build())
.layer(1, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.weightInit(WeightInit.XAVIER)
.activation("softmax")
.weightInit(WeightInit.XAVIER)
.nIn(numHiddenNodes)
.nOut(numOutputs)
.build()
)
.pretrain(false).backprop(true).build();
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
model.setListeners(new ScoreIterationListener(10));
for (int n=0; n<nEpochs; n++) {
model.fit(trainIter);
}
Evaluation eval = new Evaluation(numOutputs);
while (testIter.hasNext()){
DataSet t = testIter.next();
INDArray features = t.getFeatureMatrix();
System.out.println("Input features: " + features);
INDArray labels = t.getLabels();
INDArray predicted = model.output(features,false);
System.out.println("Predicted output: "+ predicted);
System.out.println("Desired output: "+ labels);
eval.eval(labels, predicted);
System.out.println();
}
System.out.println(eval.stats());
Output from running the code above:
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: [1.00, 0.00]
Desired output: [1.00, 0.00]
*What I want the output to look like (i.e. a one-value probability):**
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: 0.14
Desired output: 0.0
I will answer your questions inline but I just want to note:
I would suggest taking a look at our docs and examples:
https://github.com/deeplearning4j/dl4j-examples
http://deeplearning4j.org/quickstart
A 100% 0 or 1 is just a badly tuned neural net. That's not at all how things work. A softmax by default returns probabilities. Your neural net is just badly tuned. Look at updating dl4j too. I'm not sure what version you're on but we haven't used strings in activations for at least a year now? You seem to have skipped a lot of steps when starting with us. I'll reiterate again, at least take a look above for a starting point rather than using year old code.
What you're seeing there is just standard deep learning 101. So the advice I'm about to give you can be found on the internet and is applicable for any deep learning software. A two label softmax sums each row to 1. If you want 1 label, use sigmoid with 1 output and a different loss function. We use softmax because it can work for any number of ouputs and all you have to do is change the number of outputs rather than having to change the loss function and activation function on top of that.

Why does sklearn Imputer need to fit?

I'm really new in this whole machine learning thing and I'm taking an online course on this subject. In this course, the instructors showed the following piece of code:
imputer = Inputer(missing_values = 'Nan', strategy = 'mean', axis=0)
imputer = Imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
I don't really get why this imputer object needs to fit. I mean, I´m just trying to get rid of missing values in my columns by replacing them with the column mean. From the little I know about programming, this is a pretty simple, iterative procedure, and wouldn´t require a model that has to train on data to be accomplished.
Can someone please explain how this imputer thing works and why it requires training to replace some missing values by the column mean?
I have read sci-kit's documentation, but it just shows how to use the methods, and not why they´re required.
Thank you.
The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data.
To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.
from sklearn.preprocessing import Imputer
obj = Imputer(strategy='mean')
obj.fit([[1, 2, 3], [2, 3, 4]])
print(obj.statistics_)
# array([ 1.5, 2.5, 3.5])
X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])
print(X)
# array([[ 4. , 2.5, 6. ],
# [ 5. , 6. , 3.5]])
You can do both steps in one if your train and test data are identical, using fit_transform.
X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])
print(X)
# array([[ 1. , 2. , 4. ],
# [ 2. , 3. , 4. ]])
This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.
See the doc for more information about cross-validation.

How to correctly implement dropout for convolution in TensorFlow

According to the original paper on Dropout said regularisation method can be applied to convolution layers often improving their performance. TensorFlow function tf.nn.dropout supports that by having a noise_shape parameter to allow the user to choose which parts of the tensors will drop out independently. However, neither the paper nor the documentation give a clear explanation of which dimensions should be kept independently, and the TensorFlow explanation of how noise_shape works is rather unclear.
only dimensions with noise_shape[i] == shape(x)[i] will make independent decisions.
I would assume that for a typical CNN layer output of the shape [batch_size, height, width, channels] we don't want individual rows or columns to drop out by themselves, but rather whole channels (which would be equivalent to a node in a fully connected NN) independently of the examples (i.e. different channels could be dropped for different examples in a batch). Am I correct in this assumption?
If so, how would one go about implementing dropout with such specificity using the noise_shape parameter? Would it be:
noise_shape=[batch_size, 1, 1, channels]
or:
noise_shape=[1, height, width, 1]
from here,
For example, if shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n], each batch and channel component will be kept independently and each row and column will be kept or not kept together.
The code may help explain this.
noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
# uniform [keep_prob, 1.0 + keep_prob)
random_tensor = keep_prob
random_tensor += random_ops.random_uniform(noise_shape,
seed=seed,
dtype=x.dtype)
# 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
binary_tensor = math_ops.floor(random_tensor)
ret = math_ops.div(x, keep_prob) * binary_tensor
ret.set_shape(x.get_shape())
return ret
the line random_tensor += supports broadcast. When the noise_shape[i] is set to 1, that means all elements in this dimension will add the same random value ranged from 0 to 1. So when noise_shape=[k, 1, 1, n], each row and column in the feature map will be kept or not kept together. On the other hand, each example (batch) or each channel receives different random values and each of them will be kept independently.

TensorFlow Classification Using Dataset

I need to utilize TensorFlow for a project to classify items based on their attributes to a certain class (either 1, 2, or 3).
Only problem is almost every TF tutorial or example I find online is about image recognition or text classification. I can't find anything about classification based on numbers. I guess what I'm asking for is where to get started. If anyone knows of a relevant example, or if I'm just thinking about this completely wrong.
We are given the 13 attributes for each item, and need to use the TF neural network to classify each item correctly (or mark the margin of error). But nothing online is showing me even how to start with this kind of dataset.
Example of dataset: (first value is class, other values are attributes)
2, 11.84, 2.89, 2.23, 18, 112, 1.72, 1.32, 0.43, 0.95, 2.65, 0.96, 2.52, 500
3, 13.69, 3.26, 2.54, 20, 107, 1.83, 0.56, 0.5, 0.8, 5.88, 0.96, 1.82, 680
3, 13.84, 4.12, 2.38, 19.5, 89, 1.8, 0.83, 0.48, 1.56, 9.01, 0.57, 1.64, 480
2, 11.56, 2.05, 3.23, 28.5, 119, 3.18, 5.08, 0.47, 1.87, 6, 0.93, 3.69, 465
1, 14.06, 1.63, 2.28, 16, 126, 3, 3.17, 0.24, 2.1, 5.65, 1.09, 3.71, 780
Suppose you have the data in a file, data.txt. You can use Numpy to read this:
import numpy as np
xy = np.loadtxt('data.txt', unpack=True, dtype='float32')
x_data = xy[1:]
y_data = xy[0];
More information: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html
Perhaps, you may need 'np.transpose' depends on the shape of your weights and operations.
x_data = np.transpose(xy[1:])
Then, use 'placeholders' and 'feed_dict' to train/test your model:
X = tf.placeholder("float", ...
Y = tf.placeholder("float", ...
....
with tf.Session() as sess:
....
sess.run(optimizer, feed_dict={X:x_data, Y:y_data})
for this kind problem TensorFlow have an in depth tutorial here
or in toward data science here
if your looking for videos to start i think sentdex's tutorials on the titanic data-set
is what your looking for although he is using k means to do the classification
(actually I think his entire deep learning/machine learning playlist is great to start with)
you can find it here
otherwise if your looking for basic how to start
first prepossessing:
try first separating the data into class labels and inputs (pandas lib should be able to help you with this)
make your class labels into a one-hot array
than normalize the data:
it looks like your different data attributes have wildly different ranges, make sure to get them all in the same range between 0 and 1
build your model:
a simple fully connected net should do the trick
remember to make the output layer the same size as the number of classes you have
use an argmax function on the output of the finale layer to decide which class the model thinks is the proper classification

Resources