How to Implement "XOR" in Bayesian Networks? - machine-learning

In Graphical Models and Bayesian Networks, how do you implement XOR problem?
I read bayesian network vs bayes classifier here:
A Naive Bayes classifier is a simple model that describes particular class of Bayesian network - where all of the features are class-conditionally independent. Because of this, there are certain problems that Naive Bayes cannot solve (example below). However, its simplicity also makes it easier to apply, and it requires less data to get a good result in many cases.
Example: XOR You have a learning problem with binary features x_1, x_2 and a target variable y = x_1 XOR x_2.
In a Naive Bayes classifier, x_1 and x_2 must be treated independently - so you would compute things like "The probability that y = 1 given that x_1 = 1" - hopefully you can see that this isn't helpful, because x_1 = 1 doesn't make y = 1 any more or less likely. Since a Bayesian network does not assume independence, it would be able to solve such a problem.
I googled, but could not figure out how. Can someone give me a hint or good references? Thanks!

This is actually fairly simple.
The DAG of the model would look like
x1 -> XOR <- x2
The probability distribution for the XOR node can then be written
x1 x2 | P(XOR=1|x1,x2)
0 0 | 0
0 1 | 1
1 0 | 1
1 1 | 0

Related

Can any machine learning algorithm find this pattern: x1 < x2 without generating a new feature (e.g. x1-x2) first?

If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
class1
else
class2
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.

Confused about sklearn’s implementation of OSVM

I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed
this example .
I apologize for the silly questions But I’m a bit confused about two things :
Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?
Which of labels predicted by the OSVM and represent outliers is it 1 or -1
Once again i apologize for those questions but for some reason i cannot find this documented anyware
As this example you reference is about novelty-detection, the docs say:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
Meaning: you should train on regular examples only.
The approach is based on:
Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.
Extract:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
The above docs also say:
Inliers are labeled 1, while outliers are labeled -1.
This can also be seen in your example code, extracted:
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
...
# all regular = inliers (defined above)
y_pred_test = clf.predict(X_test)
...
# -1 = outlier <-> error as assumed to be inlier
n_error_test = y_pred_test[y_pred_test == -1].size

Laplace Smoothing for Bernoulli model for naive bayes classifier

I have to implement a naive bayes classifier for classifying a document to a class. So, in getting the conditional probability for a term belonging to class, along with laplace smoothing, we have:
prob(t | c) = Num(Word occurences in the docs of the class c) + 1 / Num(documents in class c) + |V|
Its a bernoulli model, which will have either 1 or 0 and the vocabulary is really large, like perhaps 20000 words and so on. So, won't the laplace smoothing give really small values due to the large size of the vocabulary or am I doing something wrong.
According to the psuedo code from this link: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html, for the bernoulli model we just add 2 instead of |V|. Why so?
Consider the case of multinomial naive Bayes. The smoothing you defined above is such that you can never get a zero probability.
With the multivariate/Bernoulli case, there is an additional constraint: probabilities of exactly 1 are not allowed either. This is because when some t from the known vocabulary is not present in the document d, a probability of 1 - prob(t | c) is multiplied to the document probability. If prob(t | c) is 1, then once again this is going to produce a posterior probability of 0.
(Likewise, when using logs instead, log(1 - prob(t | c)) is undefined when the probability is 1)
So in the Bernoulli equation (Nct + 1) / (Nc + 2), both cases are protected against. If Nct == Nc, then the probability will be 1/2 rather than 1. This also has the consequence of producing a likelihood of 1/2 regardless of whether t exists (P(t | c) == 1/2) or not (1 - P(t | c) == 1/2)

Naive Bayes density estimator

I am currently studying for a machine learning exam and after a lot of googling and studying slides I'm still not entirely sure how a naive bayes density estimator works. Could someone please explain this to me? This course is still pretty basic so please keep it simple if that's possible :)
Here is a question from an old exam that I got stuck on:
What would a naive bayes density estimator trained on table 1 for the "Win" class predict for a case (x1 = I, x3 = C)?
Table 1:
The answer is apparantly: (3/5) * (1/5) = 0,12. But Where does that 3/5 and 1/5 come from?
Thanks for the help!
Naive bayes uses two assumptions:
features are independent given a class
each feature comes from some known apriori family of densities
What it gives us? First lets use the first assumption
P(x1=I, x3=C | y = Win) = P(x1=I | y=Win) P(x3=C | y=Win)
now we have to calulcate each of the "small" probabilities, and we use a definition of conditional probability and a naive frequentialist approach here, by estimating
P(x=A, y=B) # samples having x=A and y=B
P(x=A | y=B) = ----------- = ----------------------------
P(y=B) # samples having y=B
\________________________/
definition of P(a|b)
\________________________________________/
estimator for the assumed family
thus
P(x1=I | y=Win) = 3/5
P(x3=C | y=Win) = 1/5

Neural Networks: Why does the perceptron rule only work for linearly separable data?

I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.

Resources