Naive Bayes density estimator - machine-learning

I am currently studying for a machine learning exam and after a lot of googling and studying slides I'm still not entirely sure how a naive bayes density estimator works. Could someone please explain this to me? This course is still pretty basic so please keep it simple if that's possible :)
Here is a question from an old exam that I got stuck on:
What would a naive bayes density estimator trained on table 1 for the "Win" class predict for a case (x1 = I, x3 = C)?
Table 1:
The answer is apparantly: (3/5) * (1/5) = 0,12. But Where does that 3/5 and 1/5 come from?
Thanks for the help!

Naive bayes uses two assumptions:
features are independent given a class
each feature comes from some known apriori family of densities
What it gives us? First lets use the first assumption
P(x1=I, x3=C | y = Win) = P(x1=I | y=Win) P(x3=C | y=Win)
now we have to calulcate each of the "small" probabilities, and we use a definition of conditional probability and a naive frequentialist approach here, by estimating
P(x=A, y=B) # samples having x=A and y=B
P(x=A | y=B) = ----------- = ----------------------------
P(y=B) # samples having y=B
\________________________/
definition of P(a|b)
\________________________________________/
estimator for the assumed family
thus
P(x1=I | y=Win) = 3/5
P(x3=C | y=Win) = 1/5

Related

FBProphet: Understanding Regressor Impact on Multivariate Forecast

Please see this example as the project I am working on is quite similar, but with ~8 regressors instead of 2 and I need to understand how each regressor is impacting the forecast model: https://towardsdatascience.com/forecast-model-tuning-with-additional-regressors-in-prophet-ffcbf1777dda
Given a scenario like above with 2 additional regressors: How can we understand the impact of each regressor on the 'yhat' forecast (ex. 'temp' has 30% impact on yhat prediction and 'weathersit' has 70% impact on yhat prediction or something similar) . I have tried using "from fbprophet.utilities import regressor_coefficients" to see regressor coefficients but I'm not sure if that's the right approach.
Additionally, how to interpret regressor columns in the 'forecast' dataframe from '.predict()'?
Thanks for your help.
After running regressor_coefficients(model), you will get the center and coef of each additive regressor. For example:
regressor_coefficients(my_model)
|--|regressor| regressor_mode| center| coef_lower| coef| coef_upper|
|--|---------- |------------------|--------|------------|------|------------|
|0 |temperat|additive | 6.346457 | -51.124462| -51.124462| -51.124462|
|1 |humidity |additive | 66.665910| 7.736604| 7.736604| 7.736604|
So the results from your prediction should be (for additive seasonal trends):
yhat = trend + yearly + extra_regressors_additive,
where
extra_regressors_additive = (temperature_data - temperature_center)*temperature_coef
+ (humidity_data - humidity_center )* humidity_coef
You can have more details about the regressors in the "forecast" dataframe. Look for the columns that represent your regressor name. If you feel that fbprophet is under estimating the impact of your regressor, you can declare your regressor input values as binary instead. You can also clusterize you regressor input values if binary values are not appropriate. If you still feel that your regressor is under estimated, have a look at you historical data of your regressor. Does the y value increase the same day your regressor behaviour change? If not then you need to fix that.
You can also refer to the section "Coefficients of additional regressors" of this website: https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#additional-regressors

Can any machine learning algorithm find this pattern: x1 < x2 without generating a new feature (e.g. x1-x2) first?

If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
class1
else
class2
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.

Confused about sklearn’s implementation of OSVM

I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed
this example .
I apologize for the silly questions But I’m a bit confused about two things :
Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?
Which of labels predicted by the OSVM and represent outliers is it 1 or -1
Once again i apologize for those questions but for some reason i cannot find this documented anyware
As this example you reference is about novelty-detection, the docs say:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
Meaning: you should train on regular examples only.
The approach is based on:
Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.
Extract:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
The above docs also say:
Inliers are labeled 1, while outliers are labeled -1.
This can also be seen in your example code, extracted:
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
...
# all regular = inliers (defined above)
y_pred_test = clf.predict(X_test)
...
# -1 = outlier <-> error as assumed to be inlier
n_error_test = y_pred_test[y_pred_test == -1].size

Neuronal Network for 10 inputs and 10 outputs

I've got a physical problem: To construct a product 10 output parameters (width, length, material, etc.) are determined based on 10 input parameters (performance, temprature, capacity, etc..). The output parameters are obviously depended from the input parameters. But I don't know how. For example output parameter O1 could be dependend from input parameters I1, I2 and I3.
I've got the data of lets say 30k products with their input/output parameters. The data base looks like this:
----------------------------------------------
| Product| I1 | I2 | I3 | ... | O1 | O2 | 03 |
----------------------------------------------
| Prod A | 1.2| 2.3| 4.2| ... | 5.3| 6.2| 1.2|
----------------------------------------------
| Prod B | 2.3| 4.1| 1.2| ... | 8.2| 5.2| 5.0|
----------------------------------------------
| Prod C | 6.3| 3.7| 9.1| ... | 3.1| 4.1| 7.7|
----------------------------------------------
| ... | |
----------------------------------------------
So what I need to do is to find ouput parameters O 1-O 10 based on input parameters I 1 - I 10.
First Question: If I get it right, this is a regression problem, based on some input values I want to find some output values (in the data there is somewhere a function/formular to determin the correct values). Is this correct?
My idea is to use/train a neuronal network (using keras and tensorflow as backend)
How would such a neuronal network look like? What is the best practice?
This is what I have so far:
Input layer with 10 inputs, two full connected deep layers with 100 neurones and an layer with 10 outputs. In keras this looks like this:
def baseline_model(self, callback):
model = Sequential()
model.add(Dense(100, input_dim=10, activation="relu"))
model.add(Dense(100, activation="relu"))
model.add(Dense(10))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=["accuracy"])
model.fit(input_train, output_train, batch_size=5, epochs=2000, verbose=2, callbacks=[callback], shuffle=True, validation_data=(input_val,output_val))
scores = model.evaluate(input_val, output_val, verbose=1)
print("Scores:",scores)
Of course the model does not work like expected, thats why I'm asking for help... the training failes:
Epoch 1999/2000
7s - loss: 47634520366153.6016 - acc: 0.0000e+00 - val_loss: 9585392308285.4395 - val_acc: 0.0000e+00
Any suggestions what I should change? I thought about using "sigmoid" as activation and to normalize the Data to [0,1].
Thanks for any advice
If I get it right, this is a regression problem, based on some input values I want to find some output values
Yes, i think you are right.
How would such a neuronal network look like? What is the best practice?
It's very broad question. i think you should split your data into train and validation set, start from simplest network (maybe no hidden layer or only one hidden layer) and then make it more and more complicated (add more layers and hidden units) while your validaton error decreases. When your net become quite deep it's good idea to add Batch Normalization layers between your dense layers. You can also look at residual connections but not sure that you really need this.
Any suggestions what I should change? I thought about using "sigmoid" as activation and to normalize the Data to [0,1].
Activation function type depends on your outputs type. For categorical outputs sigmoid/softmax probably good choice, linear should be ok for floating numbers.
Also if one of your inputs is categorial (material type, for example) maybe it's better to split it into several binary inputs.
It's almost always good idea to normalize your inputs and outputs. Non normalized data could really hurt training process.
Plot error and check how it changes during time. loss: 47634520366153.6016 is really big but it tell us not so much about optimization. If it decreases maybe you can increase learning rate. If it grows try to decrease learning rate or try another optimization algorithm.
Check your gradients, if it too big try to use gradient clipping.
Also try to start from simple model. Maybe from linear regression.
Strongly speaking neural neutwork debugging is big and complicated field, and i am not sure that it's appropriate for stackoverflow discussion
PS Sorry for my English
As #Dark_davier has already said, this is a field where you need some experience. Is not really possible to answer without really doing some tests. But as guideline be careful with the size of your network. In your network you have roughly (some more) 10e4 parameters, and you said you have "only" 30k observations. So there is a high probability of overfitting... So you need to be careful. You would need to use more sophisticated techniques to avoid it (first cross validation to check, then possibly regularisation). But this require some experience in NN optimisation...

How to Implement "XOR" in Bayesian Networks?

In Graphical Models and Bayesian Networks, how do you implement XOR problem?
I read bayesian network vs bayes classifier here:
A Naive Bayes classifier is a simple model that describes particular class of Bayesian network - where all of the features are class-conditionally independent. Because of this, there are certain problems that Naive Bayes cannot solve (example below). However, its simplicity also makes it easier to apply, and it requires less data to get a good result in many cases.
Example: XOR You have a learning problem with binary features x_1, x_2 and a target variable y = x_1 XOR x_2.
In a Naive Bayes classifier, x_1 and x_2 must be treated independently - so you would compute things like "The probability that y = 1 given that x_1 = 1" - hopefully you can see that this isn't helpful, because x_1 = 1 doesn't make y = 1 any more or less likely. Since a Bayesian network does not assume independence, it would be able to solve such a problem.
I googled, but could not figure out how. Can someone give me a hint or good references? Thanks!
This is actually fairly simple.
The DAG of the model would look like
x1 -> XOR <- x2
The probability distribution for the XOR node can then be written
x1 x2 | P(XOR=1|x1,x2)
0 0 | 0
0 1 | 1
1 0 | 1
1 1 | 0

Resources