How does Naive Bayes assumption make segmentation less computationally intensive? - machine-learning

I was going through a video of Udacity's Intro to AI Class and I can't seem to wrap one idea around my head.
It is stated that for a string of length n 2n-1 segmentations are possible. When we take the Naive Bayes assumption the best segmentation s* can be defined as the one that maximizes
product(P(wi))
It is possible to write the best same as:
s* = argmaxs P(first_word) * s*(rest_of_words)
I understand why the above is True. The instructor said that due to the above equation we do not have to enumerate all 2n-1 cases. I am not able to understand the reason for this.
I also understand that finding P(single_word) is simple than learning the same prob for n-grams and that would help computationally too.

Since we are working with single words, we have to choose one word per time and not all their combination, thus reducing the search space. Consider the string:
"Iliketennis"
The string has 11 chars, thus 2^11=2048 cases. If we start looking at the most probable first word it could be:
"I", "Il", "Ili", "Ilik" and so on. 11 possible cases. Now that we have all the possible first words, we look for the most probable:
P("I")=0.4,
P("Il")=0.0001,
P("Ili")=0.002,
P("Ilik")=0.00003
...
and so on.
Finding out that the most probable is "I", we take it as the first word and now we can focus on the remaining 10 chars/cases:
"liketennis"
Repeating the same process you will have now 10 possible cases for the word, with probability:
P("l")=0.05,
P("lI")=0.0001,
P("lik")=0.0002,
P("lik")=0.00003
P("like")=0.3
...
and so on.
So we pick "like". Now the search is repeated for the last 6 chars. Without writing again the process, "tennis" is picked up and no chars are left, so the segmentation is ended.
Since we have made an analysis word-wise, the possibilities we have considered are
11+10+6=27
much much less than spanning over the 2048 possible splits.

I suggest a video by Mathematicalmonk,
this video: https://youtu.be/qX7n53NWYI4?t=9m43s
He explains that without conditional independence assumption (Naive Bayes), you need much more samples to estimate probabilities when you learn from data. But if you assume (even if it's incorrect) the independence between features, with less training data you can estimate the probability distribution.
Why? let's make it simple, without naive assumption, the probability of a 2-dimensional feature vector for a prediction y would be:
By assuming only binary values for x_1 and x_2 features, you need to store these value per y, learnt from sample data:
P(x_1=0|y), P(x_1=1|y), P(x_2=0|x_1=0,y), P(x_2=0|x_1=1,y), P(x_2=1|x_1=0,y), P(x_2=1|x_1=1,y)
In another word, you need to store parameters. You can generalize it to d-dimensional binary feature vector:
If you take the naive assumption and assume these features are independent, you will have this formula:
which means you only need to store these parameters per y, in order to predict all possible X:
P(x_1=0|y), P(x_1=1|y), P(x_2=0|y), P(x_2=1|y)
Or generalize it to:

Related

Supervised learning linear regression

I am confused about how linear regression works in supervised learning. Now I want to generate a evaluation function for a board game using linear regression, so I need both the input data and output data. Input data is my board condition, and I need the corresponding value for this condition, right? But how can I get this expected value? Do I need to write an evaluation function first by myself? But I thought I need to generate an evluation function by using linear regression, so I'm a little confused about this.
It's supervised-learning after all, meaning: you will need input and output.
Now the question is: how to obtain these? And this is not trivial!
Candidates are:
historical-data (e.g. online-play history)
some form or self-play / reinforcement-learning (more complex)
But then a new problem arises: which output is available and what kind of input will you use.
If there would be some a-priori implemented AI, you could just take the scores of this one. But with historical-data for example you only got -1,0,1 (A wins, draw, B wins) which makes learning harder (and this touches the Credit Assignment problem: there might be one play which made someone lose; it's hard to understand which of 30 moves lead to the result of 1). This is also related to the input. Take chess for example and take a random position from some online game: there is the possibility that this position is unique over 10 million games (or at least not happening often) which conflicts with the expected performance of your approach. I assumed here, that the input is the full board-position. This changes for other inputs, e.g. chess-material, where the input is just a histogram of pieces (3 of these, 2 of these). Now there are much less unique inputs and learning will be easier.
Long story short: it's a complex task with a lot of different approaches and most of this is somewhat bound by your exact task! A linear evaluation-function is not super-uncommon in reinforcement-learning approaches. You might want to read some literature on these (this function is a core-component: e.g. table-lookup vs. linear-regression vs. neural-network to approximate the value- or policy-function).
I might add, that your task indicates the self-learning approach to AI, which is very hard and it's a topic which somewhat gained additional (there was success before: see Backgammon AI) popularity in the last years. But all of these approaches are highly complex and a good understanding of RL and the mathematical-basics like Markov-Decision-Processes are important then.
For more classic hand-made evaluation-function based AIs, a lot of people used an additional regressor for tuning / weighting already implemented components. Some overview at chessprogramming wiki. (the chess-material example from above might be a good one: assumption is: more pieces better than less; but it's hard to give them values)

Features of vector form of sentences for opinion finding.

I want to find the opinion of a sentence either positive or negative. For example talk about only one sentence.
The play was awesome
If change it to vector form
[0,0,0,0]
After searching through the Bag of words
bad
naughty
awesome
The vector form becomes
[0,0,0,1]
Same for other sentences. Now I want to pass it to the machine learning algorithm for training it. How can I train the network using these multiple vectors? (for finding the opinion of unseen sentences) Obviously not! Because the input is fix in neural network. Is there any way? The above procedure is just my thinking. Kindly correct me if I am wrong. Thanks in advance.
Since your intuitive input format is "Sentence". Which is, indeed, a string of tokens with arbitrary length. Abstracting sentences as token series is not a good choice for many existing algorithms only works on determined format of inputs.
Hence, I suggest try using tokenizer on your entire training set. This will give you vectors of length of the dictionary, which is fixed for given training set.
Because when the length of sentences vary drastically, then size of the dictionary always keeps stable.
Then you can apply Neural Networks(or other algorithms) to the tokenized vectors.
However, vectors generated by tokenizer is extremely sparse because you only work on sentences rather than articles.
You can try LDA (supervised, not PCA), to reduce the dimension as well as amplify the difference.
That will keep the essential information of your training data as well as express your data at fixed size, while this "size" is not too large.
By the way, you may not have to label each word by its attitude since the opinion of a sentence also depends on other kind of words.
Simple arithmetics on number of opinion-expressing words many leave your model highly biased. Better label the sentences and leave the rest job to classifiers.
For the confusions
PCA and LDA are Dimensional Reduction techniques.
difference
Let's assume each tuple of sample is denoted as x (1-by-p vector).
p is too large, we don't like that.
Let's find a matrix A(p-by-k) in which k is pretty small.
So we get reduced_x = x*A, and most importantly, reduced_x must
be able to represent x's characters.
Given labeled data, LDA can provide proper A that can maximize
distance between reduced_x of different classes, and also minimize
the distance within identical classes.
In simple words: compress data, keep information.
When you've got
reduced_x, you can define training data: (reduced_x|y) where y is
0 or 1.

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

Text categorization using Naive Bayes

I am doing the text categorization machine learning problem using Naive Bayes. I have each word as a feature. I have been able to implement it and I am getting good accuracy.
Is it possible for me to use tuples of words as features?
For example, if there are two classes, Politics and sports. The word called government might appear in both of them. However, in politics I can have a tuple (government, democracy) whereas in the class sports I can have a tuple (government, sportsman). So, if a new text article comes in which is politics, the probability of the tuple (government, democracy) has more probability than the tuple (government, sportsman).
I am asking this is because by doing this am I violating the independence assumption of the Naive Bayes problem, because I am considering single words as features too.
Also, I am thinking of adding weights to features. For example, a 3-tuple feature will have less weight than a 4-tuple feature.
Theoretically, are these two approaches not changing the independence assumptions on the Naive Bayes classifier? Also, I have not started with the approach I mentioned yet but will this improve the accuracy? I think the accuracy might not improve but the amount of training data required to get the same accuracy would be less.
Even without adding bigrams, real documents already violate the independence assumption. Conditioned on having Obama in a document, President is much more likely to appear. Nonetheless, naive bayes still does a decent job at classification, even if the probability estimates it gives are hopelessly off. So I recommend that you go on and add more complex features to your classifier and see if they improve accuracy.
If you get the same accuracy with less data, that is basically equivalent to getting better accuracy with the same amount of data.
On the other hand, using simpler, more common features works better as you decrease the amount of data. If you try to fit too many parameters to too little data, you tend to overfit badly.
But the bottom line is to try it and see.
No, from a theoretical viewpoint, you are not changing the independence assumption. You are simply creating a modified (or new) sample space. In general, once you start using higher n-grams as events in your sample space, data sparsity becomes a problem. I think using tuples will lead to the same issue. You will probably need more training data, not less. You will probably also have to give a little more thought to the type of smoothing you use. Simple Laplace smoothing may not be ideal.
Most important point, I think, is this: whatever classifier you are using, the features are highly dependent on the domain (and sometimes even the dataset). For example, if you are classifying sentiment of texts based on movie reviews, using only unigrams may seem to be counterintuitive, but they perform better than using only adjectives. On the other hand, for twitter datasets, a combination of unigrams and bigrams were found to be good, but higher n-grams were not useful. Based on such reports (ref. Pang and Lee, Opinion mining and Sentiment Analysis), I think using longer tuples will show similar results, since, after all, tuples of words are simply points in a higher-dimensional space. The basic algorithm behaves the same way.

Binarization in Natural Language Processing

Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.
Do you mind if I ask what specific task you're working on?
Binarization is the act of
transforming colorful features of
an entity into vectors of numbers,
most often binary vectors, to make
good examples for classifier
algorithms.
I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.
As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies

Resources