VC Dimension and PAC Learning - machine-learning

Can someone explain the concept of VC dimension and PAC Learning with an example? I have seen some content online but couldn't actually relate to a good example.

They are different concepts that relate to each other. I will try to explain both terms and show their relation concisely:
PAC learning is a theoretical framework developed by Leslie Valiant in 1984 that seeks to bring ideas of Complexity Theory to learning problems. While in Complexity Theory you want to classify decision problems by bounds on the amount of computation they take (number of steps), in the PAC model you want to classify concept classes (tasks) by computational bounds and bounds on the number of samples required, given some tolerance to error, epsilon, and a confidence level, 1-delta. PAC learning offers guarantees to the absolute error, how different your hypothesis (the learned function) is from the concept (the target function, your task), given that you can only measure your empirical error, the one that you get from your training sample.
VC dimension (named after Vapnik and Chervonenkis) is a number that represents the complexity of a specific hypothesis class (learning algorithm). For instance, the perceptron hypothesis class is simpler than the multilayer perceptron; therefore, it has a smaller VC dimension.
In several PAC theorems, you will have the measure of the size of the hypothesis class, a measure of its capacity or complexity, as one of the terms. The VC dimension can be used as a number that represents this size.
I recommend taking a look at Mohri's "Foundations of ML" for an excellent example of PAC learning (the first chapter has a good one).
P.S. I guess this question is not very much in the scope of StackOverflow. For questions of this type, you may have a better response in CrossValidated.

Related

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

Topic model as a dimension reduction method for text mining -- what to do next?

My understanding of the work flow is to run LDA -> Extract keywards (e.g. the top few words for each topics), and hence reduce dimension -> some subsequent analysis.
My question is, if my overall purpose is to give topic to articles in an unsupervised way, or clustering similar documents together, then a running of LDA will take you directly to the goal. Why do you reduce the dimension and then pass it to subsequent analysis? If you do, what sort of subsequent analysis can you do after LDA?
Also, a bit unrelated question -- is it better to ask this question here or at cross validated?
I think cross validated is a better place for these kinds of questions. Anyhow, there are simple explanations about why we need dimension reduction:
Without dimension reduction, vector operations are not computable. Imagine a dot product between two vector with dimension in size of your dictionary! really?
Each number carry more dense amount of information after reducing the dimension. Which it usually leads to less noise. Intuitively, you only kept useful information.
You should rethink your approach, since you are mixing probabilistic methods (LDA) with Linear Algebra (dimensional reduction). When you feel more comfortable with Linear Algebra, consider Non Negative Matrix Factorisation.
Also note that your topics already constitute the reduced dimensions, there is no need to jump back to the extracted top words in the topics.

Document Classification using Naive Bayes classifier

I am making a document classifier in mahout using the simple naive bayes algorithm. Currently, 98% of the data(documents) I have is of Class A and only 2% is of class B. My question is, since there is such a wide gap in the percentage of Class A docs vs Class B docs, would the classifier be able to train accurately still?
What I'm thinking of doing is ignoring a whole bunch of Class A documents and "manipulating" the dataset I have so that there isn't such a wide gap in the composition of the documents. Thus, the dataset I'll end up having will consist 30% of Class B and 70% of Class A. But, are there any repercussions of doing that I am not aware of?
A lot of this gets into how good "accuracy" is as a measure of performance, and that depends on your problem. If misclassifying "A" as "B" is just as bad/ok as misclassifying "B" as "A", then there is little reason to do anything other than just mark everything as "A", since you know it will reliably get you a 98% accuracy (so long as that unbalanced distribution is representative of the true distribution).
Without knowing your problem (and if accuracy is the measure you should use), the best answer I could give is "it depends on the data set". It is possible that you could get past 99% accuracy with standard naive bays, though it may be unlikely. For Naive Bayes in particular, one thing you could do is to disable the use of priors (the prior is essentially the proportion of each class). This has the effect of pretending that every class is equally likely to occur, though the model parameters will have been learned from uneven amounts of data.
Your proposed solution is a common practice, it sometimes works well. Another practice is to create fake data for the smaller class (how would depend on your data, for text documents I'm not aware of any particularly good way). Another practice is to increase the weights of the data points in the under-represented classes.
You can search for "imbalanced classification" and find a lot more information about these types of problems (they are one of the harder ones).
If accuracy is not actually a good measure for your problem, you can search for more information about "cost sensitive classification" which should be helpful.
You should not necessarily sample dataset A to reduce its instances. Several methods are available for efficient learning from imbalanced datasets, such as Majority Undersampling (exactly what you did), Minority Oversampling, SMOTE, and etc. Here is an empirical comparison of these methods: http://machinelearning.org/proceedings/icml2007/papers/62.pdf
Alternatively, you may define a custom cost matrix for the classifier. In other words, assuming B=Positive class, you may define cost(False Positive) < cost(False Negative). In this case, the classifier's output will bias towards the positive class. Here is a very helpful tutorial: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.4418&rep=rep1&type=pdf

Does prior distribution matter in classification?

Currently I get a classification problem with two classes. what I want to do is that given a bunch of candidates, find out who will more likely to be the class 1. The problem is that class 1 is very rare (around 1%), which I guess makes my prediction quite inaccurate.
For training the dataset, can I sample half class 1 and half class 0? This will change the prior distribution, but I don't know whether the prior distribution affects the classification results?
Indeed, a very imbalanced dataset can cause problems in classification. Because by defaulting to the majority class 0, you can get your error rate already very low.
There are some workarounds that may or may not work for your particular problem, such as giving equal weight to the two classes (thus weighting instances from the rare class stronger), oversampling the rare class (i.e. learning each instance multiple times), producing slight variations of the rare objects to restore balance etc. SMOTE and so on.
You really should to grab some classification or machine learning book, and check the index for "imbalanced classification" or "unbalanced classification". If the book is any good, it will discuss this problem. (I just assume you did not know the term that they use.)
If you're forced to pick exactly one from a group, then the prior distribution over classes won't matter because it will be constant for all members of that group. If you must look at each in turn and make an independent decision as to whether they're class one or class two, the prior will potentially change the decision, depending on which method you choose to do the classification. I would suggest you get hold of as many examples of the rare class as possible, but beware that feeding a 50-50 split to a classifier as training blindly may make it implicitly fit a model that assumes this is the distribution at test time.
Sampling your two classes evenly doesn't change assumed priors unless your classification algorithm computes (and uses) priors based on the training data. You stated that your problem is "given a bunch of candidates, find out who will more likely to be the class 1". I read this to mean that you want to determine which observation is most likely to belong to class 1. To do this, you want to pick the observation $x_i$ that maximizes $p(c_1|x_i)$. Using Bayes' theorem, this becomes:
$$
p(c_1|x_i)=\frac{p(x_i|c_1)p(c_1)}{p(x_i)}
$$
You can ignore $p(c_1)$ in the equation above since it is a constant. However, computing the denominator will still involve using prior probabilities. Since your problem is really more of a target detection problem than a classification problem, an alternate approach for detecting low probability targets is to take the likelihood ratio of the two classes:
$$
\Lambda=\frac{p(x_i|c_1)}{p(x_i|c_0)}
$$
To pick which of your candidates is most likely to belong to class 1, pick the one with the highest value of $\Lambda$. If your two classes are described by multivariate Gaussian distributions, you can replace $\Lambda$ with its natural logarithm, resulting in a simpler quadratic detector. If you further assume that the target and background have the same covariance matrices, this results in a linear discriminant (http://en.wikipedia.org/wiki/Linear_discriminant_analysis).
You may want to consider Bayesian utility theory to re-weight the costs of different kinds of error to get away from the problem of the priors dominating the decision.
Let A be the 99% prior probability class, B be the 1% class.
If we just say that all errors incur the same cost (negative utility), then
it's possible that the optimal decision approach is to always declare "A". Many
classification algorithms (implicitly) assume this.
If instead, we declare that the cost of declaring "B" when, in fact, the instance
was "A" is much bigger than the cost of the opposite error, then the decision logic
becomes, in a sense, more sensitive to slighter differences in the features.
This kind of situation frequently comes up in fault detection -- faults in the monitored
system will be rare, but you want to be sure that if we see any data that points to
an error condition, action needs to be taken (even if it is just reviewing the data).

Neural networks - why so many learning rules?

I'm starting neural networks, currently following mostly D. Kriesel's tutorial. Right off the beginning it introduces at least three (different?) learning rules (Hebbian, delta rule, backpropagation) concerning supervised learning.
I might be missing something, but if the goal is merely to minimize the error, why not just apply gradient descent over Error(entire_set_of_weights)?
Edit: I must admit the answers still confuse me. It would be helpful if one could point out the actual difference between those methods, and the difference between them and straight gradient descent.
To stress it, these learning rules seem to take the layered structure of the network into account. On the other hand, finding the minimum of Error(W) for the entire set of weights completely ignores it. How does that fit in?
One question is how to apportion the "blame" for an error. The classic Delta Rule or LMS rule is essentially gradient descent. When you apply Delta Rule to a multilayer network, you get backprop. Other rules have been created for various reasons, including the desire for faster convergence, non-supervised learning, temporal questions, models that are believed to be closer to biology, etc.
On your specific question of "why not just gradient descent?" Gradient descent may work for some problems, but many problems have local minima, which naive gradient descent will get stuck in. The initial response to that is to add a "momentum" term, so that you might "roll out" of a local minimum; that's pretty much the classic backprop algorithm.
First off, note that "backpropagation" simply means that you apply the delta rule on each layer from output back to input so it's not a separate rule.
As for why not a simple gradient descent, well, the delta rule is basically gradient descent. However, it tends to overfit the training data and doesn't generalize as efficiently as techniques which don't try to decay the error margin to zero. This makes sense because "error" here simply means the difference between our samples and the output - they are not guaranteed to accurately represent all possible inputs.
Backpropagation and naive gradient descent also differ in computational efficiency. Backprop is basically taking the networks structure into account and for each weight only calculates the actually needed parts.
The derivative of the error with respects to the weights is splitted via the chainrule into: ∂E/∂W = ∂E/∂A * ∂A/∂W. A is the activations of particular units. In most cases, the derivatives will be zero because W is sparse due to the networks topology. With backprop, you get the learning rules on how to ignore those parts of the gradient.
So, from a mathematical perspective, backprop is not that exciting.
there may be problems which for example make backprop run into local minima. Furthermore, just as an example, you can't adjust the topology with backprop. There are also cool learning methods using nature-inspired metaheuristics (for instance, evolutionary strategies) that enable adjusting weight AND topology (even recurrent ones) simultaneously. Probably, I will add one or more chapters to cover them, too.
There is also a discussion function right on the download page of the manuscript - if you find other hazzles that you don't like about the manuscript, feel free to add them to the page so I can change things in the next edition.
Greetz,
David (Kriesel ;-) )

Resources