Dataset for Bayesian Network Structure Learning - machine-learning

After tons of researches, I didn't find a repository with the necessary material to test a algorithm able to learn the structure of a Bayesian Network. What I need are only 2 things:
the correct Bayesian Network
a Dataset related to the BN
My algorithm should be able to learn the structure from the dataset and then I could check how far from the right BN it is. Do you have any links? I've already found some dataset without the original BN and viceversa but I need both of them for my university project.
Thanks in advance
PS: if you are interested, I use Python for my project.

Try the bnlearn library. It contains structure learning, parameter learning, inference and various example datasets such as sprinkler, asia, alarm, and many more.
Various examples can be found here.
Blog about detecting causal relationships can be found here.
Example for structure learning and making inferences:
# Load library
import bnlearn as bn
# Load Asia DAG
DAG = bn.import_DAG('asia')
# plot ground truth
G = bn.plot(DAG)
# Sampling
df = bn.sampling(DAG, n=10000)
# Structure learning
model_sl = bn.structure_learning.fit(df, methodtype='hc', scoretype='bic')
# Plot based on structure learning of sampled data
bn.plot(model_sl, pos=G['pos'], interactive=True)
# Compare networks and make plot
# bn.compare_networks(model, model_sl, pos=G['pos'])

Related

what is the metric used to gauge the significance of ranking using SPSS's MLP-NN ? (is there something like Matthew's coeficient in SPSS?)

I have used SPSS's Multi-layer perceptron model to rank some variables according to their importance in contribution to a specified target.
My question is... what is the metric used to gauge the performance of the model?
In non-SPSS NN models, one would use something like the Matthew's coefficient to gauge the performance, is there a metric for the MLP-NN in SPSS?
The method used for computing the predictor importance in a predictive model is the same across many of the algorithms in IBM SPSS Modeler.
You can find the details in the chapter "Predictor Importance Algorithms" of the "IBM SPSS Modeler Algorithms Guide".
You can find a copy here.

Strategies to assign specific weights to training instances

I am working on a Machine Learning Classification Model in which the user can provide label instances that should help improve the model.
More relevance needs to be given to the latest instances given by the user than for those instances that were previously available for training.
In particular, I am developing my machine learning models in python using Sklearn libraries.
So far I've only found the strategy of oversampling particular instances as a possible solution to the problem. With this strategy I would create multiple copies of the instances for which I want to give higher relevance.
Other strategy that I've found, but it seems not help under these conditions is:
Strategies that focus on giving weights for each class. This strategy is highly used in multiple libraries like Sklearn by default. However, this generalizes the idea to a class level and doesn't help me to put focus on particular instances
I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
I read some suggestions to multiple the loss function by some factors for instances in tensor flow models, but this seems to be mostly applicable to neural network models in Tensor flow.
I wonder if anyone has information of other approaches that might helps with this problem
I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
This is not accurate; most scikit-learn classifiers provide a sample_weight argument in their fit methods, which does exactly that. For example, here is the documentation reference for Logistic Regression:
sample_weight : array-like, shape (n_samples,) optional
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
Similar arguments exist for most scikit-learn classifiers, e.g. decision trees, random forests etc, even for linear regression (not a classifier). Be sure to check the SVM: Weighted samples example in the docs.
The situation is roughly similar for other frameworks; see for example own answer in Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?
What's more, scikit-learn also provides a utility function to compute sample_weight in cases of imbalanced datasets: sklearn.utils.class_weight.compute_sample_weight

How to add a regression head after the fully connected layer in convolutional network using Tensorflow?

I am new to deep learning and Tensorflow and have to learn this topic due to a project I am currently working on. I am using convolutional network to detect and find the location of a single object in the image. I am using the method introduced in Standford CS231n class. The lecturer mentioned about connecting a regression head after the fully connected layer in the network to find the location of the object. I know there is DNNRegressor in Tensorflow. Should I use this as the regression head?
Before I modified Tensorflow's tutorial on using ConvNet to recognize handwritten digit for my case. I am not too sure how can I add the regression head to that program so that it can also find a bounding box for the object.
I just had the chance to touch machine learning and deep learning this week, apology if I asked a really silly question, but I really need to find a solution to my problem. Thank you very much.
First of all, in order to train a neural network for object localization task, you have to have a data set with localized objects. This answers your question whether you can work with MNIST data set or not. MNIST contains just a class label for each image, so you need to get another data set. Justin also talks about popular data sets at around 37:34.
The way object localization works is by learning to output 4 values per image, instead of class distribution. This four-valued vector is compared to the ground truth four-valued vector and the loss function is usually L1 or L2 norm of their difference. So in code, regression head is an ordinary regression layer, which can be implemented in tensorflow by a simple tf.reduce_mean call.
A small yet complete example that performs object localization can be found here. Also recommend to take a look at this question.
I was looking for this problem as well and I found the following part in the document.
Dense (fully connected) layers, which perform classification on the features extracted by the convolutional layers and downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.
Based on this quote, it seems you can't do regression but classification.
EDIT: After some research, I found out a way to use a fully-connected layer in tensorflow.
import tensorflow.contrib.slim as slim
#create your network **net**.
#In the last step, you should use
y_prime = slim.fully_connected(net, 1, activation_fn=None, reuse=reuse)
loss = tf.reduce_mean(tf.square(y_prime - y)) #L2 norm
lr = tf.placeholder(tf.float32)
opt = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)
You can add more fully connected layers before the last step which can have more nodes.

Machine Learning Text Classification technique

I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.

How to test a Restricted Boltzmann Machine implementation ?

I developed a simple binary Restricted Boltzmann Machine implementation and now I would like to test it. (Ultimately I'm gonna use it for a DBN, but I would like to test independently).
I saw that several people and papers are talking about testing it MNIST dataset, but I didn't found details on how to do that.
Do I have to add a new classification layer connected to the hidden units and then use back propagation to train it ? Isn't there another way ?
Some people are also plotting the weights (again in MNIST), but I have problems on how you can plot a weight and what does that represent...
Thanks
The "Tracking Progress" section in the RBM tutorial at deeplearning.net (http://deeplearning.net/tutorial/rbm.html) gives very good guidance:
Check that samples from the RBM look like the training data
(For image data) Check that latent variable values maxima look sort of like smooth gabor filter banks
Track the pseudolikelihood

Resources