i'm still new in Julia and in machine learning in general, but I'm quite eager to learn. In the current project i'm working on I have a problem about dimensions mismatch, and can't figure what to do.
I have two arrays as follow:
x_array:
9-element Array{Array{Int64,N} where N,1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 72, 73]
[11, 12, 13, 14, 15, 16, 17, 72, 73]
[18, 12, 19, 20, 21, 22, 72, 74]
[23, 24, 12, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 72, 74]
[36, 37, 38, 39, 40, 38, 41, 42, 72, 73]
[43, 44, 45, 46, 47, 48, 72, 74]
[49, 50, 51, 52, 14, 53, 72, 74]
[54, 55, 41, 56, 57, 58, 59, 60, 61, 62, 63, 62, 64, 72, 74]
[65, 66, 67, 68, 32, 69, 70, 71, 72, 74]
y_array:
9-element Array{Int64,1}
75
76
77
78
79
80
81
82
83
and the next model using Flux:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
softmax
)
I zip both arrays, and then feed them into the model using Flux.train!
data = zip(x_array, y_array)
Flux.train!(loss, Flux.params(model), data, opt)
and immediately throws the next error:
ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")
Now, I know that the first dimension of matrix A is the sum of the hidden layers (256 + 256 + 128 + 128 + 128 + 128) and the second dimension is the input layer, which is 10. The first thing I did was change the 10 for a 9, but then it only throws the error:
ERROR: DimensionMismatch("dimensions must match")
Can someone explain to me what dimensions are the ones that mismatch, and how to make them match?
Introduction
First off, you should know that from an architectural standpoint, you are asking something very difficult from your network; softmax re-normalizes outputs to be between 0 and 1 (weighted like a probability distribution), which means that asking your network to output values like 77 to match y will be impossible. That's not what is causing the dimension mismatch, but it's something to be aware of. I'm going to drop the softmax() at the end to give the network a fighting chance, especially since it's not what's causing the problem.
Debugging shape mismatches
Let's walk through what actually happens inside of Flux.train!(). The definition is actually surprisingly simple. Ignoring everything that doesn't matter to us, we are left with:
for d in data
gs = gradient(ps) do
loss(d...)
end
end
Therefore, let's start by pulling the first element out of your data, and splatting it into your loss function. You didn't specify your loss function or optimizer in the question. Although softmax usually means you should use crossentropy loss, your y values are very much not probabilities, and so if we drop the softmax we can just use the dead-simple mse() loss. For optimizer, we'll default to good old ADAM:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
#softmax, # commented out for now
)
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM(0.001)
data = zip(x_array, y_array)
Now, to simulate the first run of Flux.train!(), we take first(data) and splat that into loss():
loss(first(data)...)
This gives us the error message you've seen before; ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 12"). Looking at our data, we see that yes, indeed, the first element of our dataset has a length of 12. And so we will change our model to instead expect 12 values instead of 10:
model = Chain(
LSTM(12, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
)
And now we re-run:
julia> loss(first(data)...)
50595.52542674723 (tracked)
Huzzah! It worked! We can run this again:
julia> loss(first(data)...)
50578.01417593167 (tracked)
The value changes because the RNN holds memory within itself which gets updated each time we run the network, otherwise we would expect the network to give the same answer for the same inputs!
The problem comes, however, when we try to run the second training instance through our network:
julia> loss([d for d in data][2]...)
ERROR: DimensionMismatch("matrix A has dimensions (1024,12), vector B has length 9")
Understanding LSTMs
This is where we run into Machine Learning problems more than programming problems; the issue here is that we have promised to feed that first LSTM network a vector of length 10 (well, 12 now) and we are breaking that promise. This is a general rule of deep learning; you always have to obey the contracts you sign about the shape of the tensors that are flowing through your model.
Now, the reasons you're using LSTMs at all is probably because you want to feed in ragged data, chew it up, then do something with the result. Maybe you're processing sentences, which are all of variable length, and you want to do sentiment analysis, or somesuch. The beauty of recurrent architectures like LSTMs is that they are able to carry information from one execution to another, and they are therefore able to build up an internal representation of a sequence when applied upon one time point after another.
When building an LSTM layer in Flux, you are therefore declaring not the length of the sequence you will feed in, but rather the dimensionality of each time point; imagine if you had an accelerometer reading that was 1000 points long and gave you X, Y, Z values at each time point; to read that in, you would create an LSTM that takes in a dimensionality of 3, then feed it 1000 times.
Writing our own training loop
I find it very instructive to write our own training loop and model execution function so that we have full control over everything. When dealing with time series, it's often easy to get confused about how to call LSTMs and Dense layers and whatnot, so I offer these simple rules of thumb:
When mapping from one time series to another (E.g. constantly predict future motion from previous motion), you can use a single Chain and call it in a loop; for every input time point, you output another.
When mapping from a time series to a single "output" (E.g. reduce sentence to "happy sentiment" or "sad sentiment") you must first chomp all the data up and reduce it to a fixed size; you feed many things in, but at the end, only one comes out.
We're going to re-architect our model into two pieces; first the recurrent "pacman" section, where we chomp up a variable-length time sequence into an internal state vector of pre-determined length, then a feed-forward section that takes that internal state vector and reduces it down to a single output:
pacman = Chain(
LSTM(1, 128), # map from timepoint size 1 to 128
LSTM(128, 256), # blow it up even larger to 256
LSTM(256, 128), # bottleneck back down to 128
)
reducer = Chain(
Dense(128, 9),
#softmax, # keep this commented out for now
)
The reason we split it up into two pieces like this is because the problem statement wants us to reduce a variable-length input series to a single number; we're in the second bullet point above. So our code naturally must take this into account; we will write our loss(x, y) function to, instead of calling model(x), it will instead do the pacman dance, then call the reducer on the output. Note that we also must reset!() the RNN state so that the internal state is cleared for each independent training example:
function loss(x, y)
# Reset internal RNN state so that it doesn't "carry over" from
# the previous invocation of `loss()`.
Flux.reset!(pacman)
# Iterate over every timepoint in `x`
for x_t in x
y_hat = pacman(x_t)
end
# Take the very last output from the recurrent section, reduce it
y_hat = reducer(y_hat)
# Calculate reduced output difference against `y`
return Flux.mse(y_hat, y)
end
Feeding this into Flux.train!() actually trains, albeit not very well. ;)
Final observations
Although your data is all Int64's, it's pretty typical to use floating point numbers with everything except embeddings (an embedding is a way to take non-numeric data such as characters or words and assign numbers to them, kind of like ASCII); if you're dealing with text, you're almost certainly going to be working with some kind of embedding, and that embedding will dictate what the dimensionality of your first LSTM is, whereupon your inputs will all be "one-hot" encoded.
softmax is used when you want to predict probabilities; it's going to ensure that for each input, the outputs are all between [0...1] and moreover that they sum to 1.0, like a good little probability distribution should. This is most useful when doing classification, when you want to wrangle your wild network output values of [-2, 5, 0.101] into something where you can say "we have 99.1% certainty that the second class is correct, and 0.7% certainty it's the third class."
When training these networks, you're often going to want to batch multiple time series at once through your network for hardware efficiency reasons; this is both simple and complex, because on one hand it just means that instead of passing a single Sx1 vector through (where S is the size of your embedding) you're instead going to be passing through an SxN matrix, but it also means that the number of timesteps of everything within your batch must match (because the SxN must remain the same across all timesteps, so if one time series ends before any of the others in your batch you can't just drop it and thereby reduce N halfway through a batch). So what most people do is pad their timeseries all to the same length.
Good luck in your ML journey!
I'm trying to construct the simplest example of Bayesian network with several discrete random variables and conditional probabilities (the "Student Network" from Koller's book, see 1)
Although a bit unwieldy, I managed to build this network using pymc3. Especially, creating the CPDs is not that straightforward in pymc3, see the snippet below:
import pymc3 as pm
...
with pm.Model() as basic_model:
# parameters for categorical are indexed as [0, 1, 2, ...]
difficulty = pm.Categorical(name='difficulty', p=[0.6, 0.4])
intelligence = pm.Categorical(name='intelligence', p=[0.7, 0.3])
grade = pm.Categorical(name='grade',
p=pm.math.switch(
theano.tensor.eq(intelligence, 0),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.3, 0.4, 0.3], # I=0, D=0
[0.05, 0.25, 0.7] # I=0, D=1
),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.9, 0.08, 0.02], # I=1, D=0
[0.5, 0.3, 0.2] # I=1, D=1
)
)
)
letter = pm.Categorical(name='letter', p=pm.math.switch(
...
But I have no idea how to build this network using tensoflow-probability (versions: tfp-nightly==0.7.0.dev20190517, tf-nightly-2.0-preview==2.0.0.dev20190517)
For the unconditioned binary variables, one can use categorical distribution, such as
from tensorflow_probability import distributions as tfd
from tensorflow_probability import edward2 as ed
difficulty = ed.RandomVariable(
tfd.Categorical(
probs=[0.6, 0.4],
name='difficulty'
)
)
But how to construct the CPDs?
There are few classes/methods in tensorflow-probability that might be relevant (in tensorflow_probability/python/distributions/deterministic.py or the deprecated ConditionalDistribution) but the documentation is rather sparse (one needs deep understanding of tfp).
--- Updated question ---
Chris' answer is a good starting point. However, things are still a bit unclear even for a very simple two-variable model.
This works nicely:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Bernoulli(probs=tf.gather([0.1, 0.9], indices=dist_x), validate_args=True)
))
print(jdn.sample(10))
but this one fails
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather_nd([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))
(I'm trying to model categorical explicitly in the second example just for learning purposes)
-- Update: solved ---
Obviously, the last example wrongly used tf.gather_nd instead of tf.gather as we only wanted to select the first or the second row based on the dist_x outome. This code works now:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))
The tricky thing about this, and presumably the reason it's subtler than expected in PyMC, is -- as with almost everything in vectorized programming -- handling shapes.
In TF/TFP, the (IMO) nicest way to solve this is with one of the new TFP JointDistribution{Sequential,Named,Coroutine} classes. These let you naturally represent hierarchical PGM models, and then sample from them, evaluate log probs, etc.
I whipped up a colab notebook demoing all 3 approaches, for the full student network: https://colab.research.google.com/drive/1D2VZ3OE6tp5pHTsnOAf_7nZZZ74GTeex
Note the crucial use of tf.gather and tf.gather_nd to manage the vectorization of the various binary and categorical switching.
Have a look and let me know if you have any questions!
I am having a data set of 144 student feedback with 72 positive and 72 negative feedback respectively. The data set has two attributes namely data and target which contain the sentence and the sentiment(positive or negative) respectively.
Consider the following code:
import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)
data target
0 facilitates good student teacher communication. positive
1 lectures are very lengthy. negative
2 the teacher is very good at interaction. positive
3 good at clearing the concepts. positive
4 good at clearing the concepts. positive
5 good at teaching. positive
6 does not shows test copies. negative
7 good subjective knowledge. positive
8 good communication skills. positive
9 good teaching methods. positive
10 posseses very good and thorough knowledge of t... positive
11 posseses superb ability to provide a lots of i... positive
12 good conceptual skills and knowledge for subject. positive
13 no commuication outside class. negative
14 rude behaviour. negative
15 very negetive attitude towards students. negative
16 good communication skills, lacks time punctual... positive
17 explains in a better way by giving practical e... positive
18 hardly comes on time. negative
19 good communication skills. positive
20 to make students comfortable with the subject,... negative
21 associated to original world. positive
22 lacks time punctuality. negative
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data['data'].values)
X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X_test = cv.transform(feedback_data_test)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
print(target)
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
#The below line gives the error
clf.fit(X , target)
I do not know what is wrong. Please help
The error comes from the way X as been done. You cannot use directly X in the Fit method. You need first to transform it a little bit more (i could not have told you that for the other problem as i did not have the info)
right now you have the following:
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
...
<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)
Which is enough to do a split.
We are just going to transform it you can understand and so will the fit method:
X = list([list(x.toarray()[0]) for x in X])
What we do is convert the sparse matrix to a numpy array, take the first element (it has only one element) and then convert it to a list to make sure it has the right dimension.
Now why are we doing this:
X is something like that
>>>X[0]
<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
so we transform it to see what it realy is:
>>>X[0].toarray()
array([[0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0]], dtype=int64)
and then as you see there is a slight issue with the dimension so we take the first element.
going back to a list does nothing, it's just for you to understand well what you see. (you can dump it for speed)
your code is now this:
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
clf.fit(X, target)
According to the original paper on Dropout said regularisation method can be applied to convolution layers often improving their performance. TensorFlow function tf.nn.dropout supports that by having a noise_shape parameter to allow the user to choose which parts of the tensors will drop out independently. However, neither the paper nor the documentation give a clear explanation of which dimensions should be kept independently, and the TensorFlow explanation of how noise_shape works is rather unclear.
only dimensions with noise_shape[i] == shape(x)[i] will make independent decisions.
I would assume that for a typical CNN layer output of the shape [batch_size, height, width, channels] we don't want individual rows or columns to drop out by themselves, but rather whole channels (which would be equivalent to a node in a fully connected NN) independently of the examples (i.e. different channels could be dropped for different examples in a batch). Am I correct in this assumption?
If so, how would one go about implementing dropout with such specificity using the noise_shape parameter? Would it be:
noise_shape=[batch_size, 1, 1, channels]
or:
noise_shape=[1, height, width, 1]
from here,
For example, if shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n], each batch and channel component will be kept independently and each row and column will be kept or not kept together.
The code may help explain this.
noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
# uniform [keep_prob, 1.0 + keep_prob)
random_tensor = keep_prob
random_tensor += random_ops.random_uniform(noise_shape,
seed=seed,
dtype=x.dtype)
# 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
binary_tensor = math_ops.floor(random_tensor)
ret = math_ops.div(x, keep_prob) * binary_tensor
ret.set_shape(x.get_shape())
return ret
the line random_tensor += supports broadcast. When the noise_shape[i] is set to 1, that means all elements in this dimension will add the same random value ranged from 0 to 1. So when noise_shape=[k, 1, 1, n], each row and column in the feature map will be kept or not kept together. On the other hand, each example (batch) or each channel receives different random values and each of them will be kept independently.
I have trained a classifier and I now want to pass any single image through.
I'm using the keras library with Tensorflow as the backend.
I'm getting an error I can't seem to get past
img_path = '/path/to/my/image.jpg'
import numpy as np
from keras.preprocessing import image
x = image.load_img(img_path, target_size=(250, 250))
x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
preds = model.predict(x)
Do I need to reshape my data to have None as the first dimension? I'm confused why Tensorflow would expect None as the first dimension?
Error when checking : expected convolution2d_input_1 to have shape (None, 250, 250, 3) but got array with shape (1, 3, 250, 250)
I'm wondering if there has been an issue with the architecture of my trained model?
edit: if i call model.summary() give convolution2d_input_1 as...
Edit: I did play around with the suggestion below but used numpy to transpose instead of tf - still seem to be hitting the same issue!
None matches any number. Usually, when you pass some data to a model, it is expected that you pass tensor of dimensions: None x data_size, meaning the first dimension is any dimension and denotes batch size. In your case, the problem is that you pass 250 x 250 x 3, and it is expected 3 x 250 x 250. Try:
x = image.load_img(img_path, target_size=(250, 250))
x_trans = tf.transpose(x, perm=[2, 0, 1])
x_expanded = np.expand_dims(x_trans, axis=0)
preds = model.predict(x_expanded)
Ok so using feedback rom Sygi i think i have half solved it,
The error was actually telling me i needed to pass in my dimensions as [1, 250, 250, 3] so that was an easy fix; i must say im not sure why TF is expecting the dimensions in this order as looking at the docs it doesnt seem right so more research required here.
Moving ahead im not sure transpose is the way to go as if i use a different input image the dimensions may not be in the same order meaning the transpose doesnt work properly,
Instead of transpose I'm probably trying to t call x_reshape = img.reshape((1, 250, 250, 3)) depending on what i find out about dimension order in reshaping for TS
thanks for the hints Sygi :)