Classifying new instance with bayesian net - machine-learning

Say I have the following bayesian network:
And I want to classify a new instance on wether H=true or H=false,
the new instance looks e.g. like this: Fl=true, A=false, S=true, and Ti=false.
How can I classify the instance with respect to H?
I can compute the probability by multiplying the probabilities from the tables:
0.4 * 0.7 * 0.5 * 0.2 = 0.028
What does this say about whether the new instance is a positive instance H or not?
EDIT
I will try the compute the probability according to Bernhard Kausler's suggestion:
So this is Bayes' rule:
P(H|S,Ti,Fi,A) = P(H,S,Ti,Fi,A) / P(S,Ti,Fi,A)
to compute de denominator:
P(S,Ti,Fi,A) = P(H=T,S,Ti,Fi,A)+P(H=F,S,Ti,Fi,A) = (0.7 * 0.5 * 0.8 * 0.4 * 0.3) + (0.3 * 0.5 * 0.8 * 0.4 * 0.3) =0.048
P(H,S,Ti,Fi,A) = 0.336
so P(H|S,Ti,Fi,A) = 0.0336 / 0.048 = 0.7
now i compute P(H=false|S,Ti,Fi,A) = P(H=false,S,Ti,Fi,A) / P(S,Ti,Fi,A)
we already have the value for P(S,Ti,Fi,A´. I's ´0.048.
P(H=false,S,Ti,Fi,A) =0.0144
so P(H=false|S,Ti,Fi,A) = 0.0144 / 0.048 = 0.3
the Probability for P(H=true,S,Ti,Fi,A) is the highest. so the new instance will be classified as H=True
Is this correct?
Addition: We do not need to calculate P(H=false|S,Ti,Fi,A) because it is 1 - P(H=true|S,Ti,Fi,A).

So, you want to compute the conditional probability P(H|S,Ti,Fi,A). To do that, you have to use Bayes' rule:
P(H|S,Ti,Fi,A) = P(H,S,Ti,Fi,A) / P(S,Ti,Fi,A)
where
P(S,Ti,Fi,A) = P(H=T,S,Ti,Fi,A)+P(H=F,S,Ti,Fi,A)
You then calculate both conditional probabilities P(H=T|S,Ti,Fi,A) and P(H=F|S,Ti,Fi,A) and make a prediction according to which probability is higher.
Just multiplying up the numbers like you did won't help and doesn't even give you a proper probability since the product is not normalized.

Related

How does weighted_metric support the sample_weight in Keras

I read the official doc for creating custom metric. It says:
Note that sample weighting is automatically supported for any such metric.
I wonder how sample weighting is supported for complicated metric. For example, a metric to compute weighted correlation between y_true and y_pred in Keras. Code below:
def customized_correlation(y_true, y_pred, sample_weights):
x = y_true
y = y_pred
mx = K.mean(x)
my = K.mean(y)
xm, ym = x - mx, y - my
r_num = K.sum(xm * ym * sample_weights)
r_den = K.sqrt(K.sum(K.square(xm) * sample_weights) * K.sum(K.square(ym) * sample_weights))
r = r_num / r_den
return r
If we remove the sample_weights variable in code, how does Keras know where sample_weights should be inserted to calculate the weighted correlation?
It does not, and it will not work. Using sample_weights simply means the resulting metric vector will be multiplied (element-wise) by weight vector at the very end

Modifying the loss in ppo in stable-baselines3

I'm trying to implement an addition to the loss function of the ppo algorithm in stable-baselines3. For this I collected additional observations for the states s(t-10) and s(t+1) which I can access in the train-function of the PPO class in ppo.py as part of the rollout_buffer.
I'm using a 3-layer-mlp as my network architecture and need the outputs of the second layer for the triplet (s(t-α), s(t), s(t+1)) to use them to calculate L = max(d(s(t+1) , s(t)) − d(s(t+1) , s(t−α)) + γ, 0), where d is the L2-distance.
Finally I want to add this term to the old loss, so loss = loss + 0.3 * L
This is my implementation starting with the original loss in line 242:
loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
###############################
net1 = nn.Sequential(*list(self.policy.mlp_extractor.policy_net.children())[:-1])
L_losses = []
a = 0
obs = rollout_data.observations
obs_alpha = rollout_data.observations_alpha
obs_plusone = rollout_data.observations_plusone
inds = rollout_data.inds
for i in inds:
if i > alpha: # only use observations for which L can be calculated
fs_t = net1(obs[a])
fs_talpha = net1(obs_alpha[a])
fs_tone = net1(obs_plusone[a])
L = max(
th.norm(th.subtract(fs_tone, fs_t)) - th.norm(th.subtract(fs_tone, fs_talpha)) + 1.0, 0.0)
L_losses.append(L)
else:
L_losses.append(0)
a += 1
L_loss = th.mean(th.FloatTensor(L_losses))
loss += 0.3 * L_loss
So with net1 I tried to get a clone of the original network with the outputs from the second layer. I am unsure if this is the right way to do this.
I do have some questions about my approach as the resulting performance is slightly worse compared to without the added term although it should be slightly better:
Is my way of getting the outputs of the second layer of the mlp network working?
When loss.backward() is called can the gradient be calculated correctly (with the new term included)?

Strange Loss function behaviour when training CNN

I'm trying to train my network on MNIST using a self-made CNN (C++).
It gives enough good results when I use a simple model, like:
Convolution (2 feature maps, 5x5) (Tanh) -> MaxPool (2x2) -> Flatten -> Fully-Connected (64) (Tanh) -> Fully-Connected (10) (Sigmoid).
After 4 epochs, it behaves like here 1.
After 16 epochs, it gives ~6,5% error on a test dataset.
But in the case of 4 feature maps in Conv, the MSE value isn't improving, sometimes even increasing 2,5 times 2.
The online training mode is used, with help of Adam optimizer (alpha: 0.01, beta_1: 0.9, beta_2: 0.999, epsilon: 1.0e-8). It is calculated as:
double AdamOptimizer::calc(int t, double& m_t, double& v_t, double g_t)
{
m_t = this->beta_1 * m_t + (1.0 - this->beta_1) * g_t;
v_t = this->beta_2 * v_t + (1.0 - this->beta_2) * (g_t * g_t);
double m_t_aver = m_t / (1.0 - std::pow(this->beta_1, t + 1));
double v_t_aver = v_t / (1.0 - std::pow(this->beta_2, t + 1));
return -(this->alpha * m_t_aver) / (std::sqrt(v_t_aver) + this->epsilon);
}
So, can be this problem caused by lack of some additional learning techniques (dropout, batch-normalization), or wrongly set parameters? Or it is caused by some implementation issues?
P. S. I provide a github link if necessary.
Try to decrease the learning rate.

Is there an easy way to implement a Optimizer.Maximize() function in TensorFlow

There are several experiments that rely on gradient ascent rather than gradient descent. I have looked into some approaches to using "cost" and the minimize function to simulate the "maximize" function, but I am still not certain I know how to properly implement a maximize() function. Also, in most of these cases, I would say they are closer to an unsupervised learning. So given this code concept for a cost function:
cost = (Yexpected - Ycalculated)^2
train_step = tf.train.AdamOptimizer(0.5).minimize(cost)
I would like to write something were I am following the positive gradient and there may not be a Yexpected value:
maxMe = Function(Ycalculated)
train_step = tf.train.AdamOptimizer(0.5).maximize(maxMe)
A good example of this need is "http://cs229.stanford.edu/proj2009/LvDuZhai.pdf" with Recurrent Reinforcement Learning.
I have read a few papers and references that state changing the sign will flip the direction of movement to increasing gradient, but given TensorFlow's internal calculation of the gradient, I am not sure if this will work to Maximize as I don't know of a way to validate the results:
maxMe = Function(Ycalculated)
train_step = tf.train.AdamOptimizer(0.5).minimize( -1 * maxMe )
The intuition is simple, the minimize() function keeps squashing the given value, for example, if you start with 5, then for every iteration (for example and depending on the learning rate), the value will become say, 4, then 3, then 2, 1, 0 and so on if possible to bring it down more. Now if you pass -5 at the beginning (which is in fact a +5 but you changed the sign explicitly), the gradient will try to change the parameters to bring the number down more, as for example, -5, -6, -7, -8, ...etc. But in fact, the function is increasing because we changed the sign, and the actual sign is (+). In other words, the gradient, in the latter case, is changing the parameters of the neural network in a way that maximizes the function, not minimizing it.
Toy example with arbitrary numbers:
The input x = 1.5, The weight parameter at time (t) w_t = 0.1,
The observed response y = 3.0, The learning rate lr = 0.1.
x * w = 0.15 (this is y predicted for the current w)
loss function = (3.0 - 0.15)^2 = 8.1
Applying gradient descent:
w_(t+1) = w_t - lr * (derivative of loss function with respect to w)
w_(t+1) = 0.1 - (0.1 * [1.5 * 2(0.15 - 3.0)]) = 0.1 - (-0.855) = 0.955
If we use the new w_(t+1) we will have:
1.5 * 0.955 = 1.49 (which is closer to the correct answer 3.0)
and the new loss is: (3.0 - 1.49)^2 = 2.27 (smaller error).
If we keep iterating, we will adjust w to a value that gives us the minimum cost possible.
Now lets repeat the same experiment but with the sign flipped to negative:
loss function = - (3.0 - 0.15)^2 = -8.1
Applying gradient descent:
w_(t+1) = w_t - lr * (derivative of loss function with respect to w)
w_(t+1) = 0.1 - (0.1 * [1.5 * -2(0.15 - 3.0)]) = 0.1 - 0.855 = −0.755
If we apply the new w_(t+1) we will have:
1.5 * −0.755 = −1.1325 and the new loss is: (3.0 - (-1.1325))^2 = 17.07
(the loss function is maximizing!).
That is also applicable to any differentiable function, but this is just a simple naive example to demonstrate the idea.
So, you can do, as you suggested already:
optimizer.minimize( -1 * value)
Or if you like, create a wrapper function (which in fact is needless, but just to mention it):
def maximize(optimizer, value, **kwargs):
return optimizer.minimize(-value, **kwargs)

Weka IBk parameter details (distanceWeighting, meanSquared)

I am using kNN algorithm to classify. In weka they have provided various parameter setting for kNN. I am intersted to know about the distanceWeighting, meanSquared.
In distanceWeighting we have three values (No distance weighting, weight by 1/distance and weight by 1-distance). What are these values and what is their impact?
Can someone please expalin me? :)
If one uses "no distance weighting", then the predicted value for your data points is the average of all k neighbors. For example
# if values_of_3_neigbors = 4, 5, 6
# then predicted_value = (4+5+6)/3 = 5
For 1/distance weighting, the weight of each neigbor is inversely proportional to the distance to it. The idea is: the closer the neighbor, the more it influences the predicted value. For example
# distance_to_3_neigbors = 1,3,5
# weights_of_neighbors = 1/1, 1/3, 1/5 # sum = 1 + 0.33 + 0.2 = 1.53
# normalized_weights_of_neighbors = 1/1.53, 0.33/1.53, 0.2/1.53 = 0.654, 0.216, 0.131
# then predicted_values = 4*0.654 + 5*0.216 + 6*0.131 = 4.48
For 1-distance it is similar. This is only applicable when all your distances are in the [0,1] range.
Hope this helps

Resources