I'm using Swift (even if my question is not about language) and Python to test my ML logic. I have training data:
("add a new balloon", "add-balloon")
("add a balloon", "add-balloon")
("get last balloon", "get-balloon")
("update balloon color to red", "update-balloon")
When I try use Naive Bayes to classify a new sentence like
classify("could you add a new balloon")
// Return add-balloon
classify("could you update the balloon color")
// Return add-balloon
classify("update the balloon color")
// Return add-balloon
My data set has a lot of observations about adding a balloon (about 50) but not a lot to update or get (about 5-6). Is Naive Bayes sensitive to the number of training observations? I don't understand why the classification is not performing well even if given a sentence it saw during training.
Naive Bayes is sensitive to class priors (distribution of examples among classes). So if you have way more add-balloon than other categories, it will have a bias towards this class. It is normally helpful since suppose you don't know anything (no posterior information), your best bet is to try the class which is the most likely.
If your distribution is heavily skewed, you data sets are not large, your documents are short or lack very informative words (or contains many ambiguous ones) though, this can cause undesired results such as what you are reporting.
Initially, Naive Bayes depends on the size of data, but if we keep adding more, after a certain level, it's performance plateaus and a further increase in training data doesn't increase the performance of Naive Bayes classifier.
But coming to your case, the data is too small for the model to accurately learn about "update-balloon" and is predicting "add-balloon". Try adding more examples for classes which have less data and see if the accuracy improves.
In case your data is skewed and there's not much you can do about it, you can try other Classifiers or try some tricks as mentioned here and here.
Related
I am completely new to Machine Learning algorithms and I have a quick question with respect to Classification of a dataset.
Currently there is a training data that consists of two columns Message and Identifier.
Message - Typical message extracted from Log containing timestamp and some text
Identifier - Should classify the category based on the message content.
The training data was prepared by extracting a particular category from the tool and labelling it accordingly.
Now the test data contains just the message and I am trying to obtain the Category accordingly.
Which approach is most helpful in this scenario ? Is it the Supervised or Unsupervised Learning ?
I have a trained dataset and I am trying to predict the Category for the Test Data.
Thanks in advance,
Adam
If your labels are exact then you can classify using ANN, SVM etc. But labels are not exact you have to cluster data with respect to the features you have in data. K-means or nearest neighbour can be the starting point for clustering.
It is supervised learning, and a classification problem.
However, obviously you do not have the label column (the to-be-predicted value) for your testset. Thus, you cannot calculate error measures (such as False Positive Rate, Accuracy etc) for that test set.
You could, however, split the set of labeled training data that you do have into a smaller training set and a validation set. Split it 70%/30%, perhaps. Then build a prediction model from your smaller 70% training dataset. Then tune it on your 30% validation set. When accuracy is good enough, then apply it on your testset to obtain/predict the missing values.
Which techniques / algorithms to use is a different question. You do not give enough information to answer that. And even if you did you still need to tune the model yourself.
You have labels to predict, and training data.
So by definition it is a supervised problem.
Try any classifier for text, such as NB, kNN, SVM, ANN, RF, ...
It's hard to predict which will work best on your data. You willhave to try and evaluate several.
Given any image I want my classifier to tell if it is Sunflower or not. How can I go about creating the second class ? Keeping the set of all possible images - {Sunflower} in the second class is an overkill. Is there any research in this direction ? Currently my classifier uses a neural network in the final layer. I have based it upon the following tutorial :
https://github.com/torch/tutorials/tree/master/2_supervised
I am taking images with 254x254 as the input.
Would SVM help in the final layer ? Also I am open to using any other classifier/features that might help me in this.
The standard approach in ML is that:
1) Build model
2) Try to train on some data with positive\negative examples (start with 50\50 of pos\neg in training set)
3) Validate it on test set (again, try 50\50 of pos\neg examples in test set)
If results not fine:
a) Try different model?
b) Get more data
For case #b, when deciding which additional data you need the rule of thumb which works for me nicely would be:
1) If classifier gives lots of false positive (tells that this is a sunflower when it is actually not a sunflower at all) - get more negative examples
2) If classifier gives lots of false negative (tells that this is not a sunflower when it is actually a sunflower) - get more positive examples
Generally, start with some reasonable amount of data, check the results, if results on train set or test set are bad - get more data. Stop getting more data when you get the optimal results.
And another thing you need to consider, is if your results with current data and current classifier are not good you need to understand if the problem is high bias (well, bad results on train set and test set) or if it is a high variance problem (nice results on train set but bad results on test set). If you have high bias problem - more data or more powerful classifier will definitely help. If you have a high variance problem - more powerful classifier is not needed and you need to thing about the generalization - introduce regularization, remove couple of layers from your ANN maybe. Also possible way of fighting high variance is geting much, MUCH more data.
So to sum up, you need to use iterative approach and try to increase the amount of data step by step, until you get good results. There is no magic stick classifier and there is no simple answer on how much data you should use.
It is a good idea to use CNN as the feature extractor, peel off the original fully connected layer that was used for classification and add a new classifier. This is also known as the transfer learning technique that has being widely used in the Deep Learning research community. For your problem, using the one-class SVM as the added classifier is a good choice.
Specifically,
a good CNN feature extractor can be trained on a large dataset, e.g. ImageNet,
the one-class SVM can then be trained using your 'sunflower' dataset.
The essential part of solving your problem is the implementation of the one-class SVM, which is also known as anomaly detection or novelty detection. You may refer http://scikit-learn.org/stable/modules/outlier_detection.html for some insights about the method.
I am considering using random forest for a classification problem. The data comes in sequences. I plan to use first N(500) to train the classifier. Then, use the classifier to classify the data after that. It will make mistakes and the mistakes sometimes can be recorded.
My question is: can I use those mis-classified data to retrain the original classifier and how? If I simply add the mis-classified ones to original training set with size N, then the importance of the mis-classified ones will be exaggerated as the corrected classified ones are ignored. Do I have to retrain the classifier using all data? What other classifiers can do this kind of learning?
What you describe is a basic version of the Boosting meta-algorithm.
It's better if your underlying learner have a natural way to handle samples weights. I have not tried boosting random forests (generally boosting is used on individual shallow decision trees with a depth limit between 1 and 3) but that might work but will likely be very CPU intensive.
Alternatively you can train several independent boosted decision stumps in parallel with different PRNG seed values and then aggregate the final decision function as you would do with a random forests (e.g. voting or averaging class probability assignments).
If you are using Python, you should have a look at the scikit-learn documentation on the topic.
Disclaimer: I am a scikit-learn contributor.
Here is my understanding of your problem.
You have a dataset and create two subdata set with it say, training dataset and evaluation dataset. How can you use the evaluation dataset to improve classification performance ?
The point of this probleme is'nt to find a better classifier but to find a good way for the evaluation, then have a good classifier in the production environnement.
Evaluation purpose
As the evaluation dataset has been tag for evaluation there is now way yo do this. You must use another way for training and evaluation.
A common way to do is cross-validation;
Randomize your samples in your dataset. Create ten partitions from your initial dataset. Then do ten iteration of the following :
Take all partitions but the n-th for training and do the evaluation with the n-th.
After this take the median of the errors of the ten run.
This will give you the errors rate of yours classifiers.
The least run give you the worst case.
Production purpose
(no more evaluation)
You don't care anymore of evaluation. So take all yours samples of all your dataset and give it for training to your classifier (re-run a complet simple training). The result can be use in production environnement, but can't be evaluate any more with any of yours data. The result is as best as the worst case in previous partitions set.
Flow sample processing
(production or learning)
When you are in a flow where new samples are produce over time. You will face case where some sample correct errors case. This is the wanted behavior because we want the system to
improve itself. If you just correct in place the leaf in errors, after some times your
classifier will have nothing in common with the original random forest. You will be doing
a form of greedy learning, like meta taboo search. Clearly we don't wanna this.
If we try to reprocess all the dataset + the new sample every time a new sample is available we will experiment terrible low latency. The solution is like human, sometime
a background process run (when service is on low usage), and all data get a complet
re-learning; and at the end swap old and new classifier.
Sometime the sleep time is too short for a complet re-learning. So you have to use node computing clusturing like that. It cost lot of developpement because you probably need to re-write the algorithms; but at that time you already have the bigest computer you could have found.
note : Swap process is very important to master. You should already have it in your production plan. What do you do if you want to change algorithms? backup? benchmark? power-cut? etc...
I would simply add the new data and retrain the classifier periodically if it weren't too expensive.
A simple way to keep things in balance is to add weights.
If you weigh all positive samples by 1/n_positive and all negative samples by 1/n_negative ( including all the new negative samples you're getting ), then you don't have to worry about the classifier getting out of balance.
I am doing the text categorization machine learning problem using Naive Bayes. I have each word as a feature. I have been able to implement it and I am getting good accuracy.
Is it possible for me to use tuples of words as features?
For example, if there are two classes, Politics and sports. The word called government might appear in both of them. However, in politics I can have a tuple (government, democracy) whereas in the class sports I can have a tuple (government, sportsman). So, if a new text article comes in which is politics, the probability of the tuple (government, democracy) has more probability than the tuple (government, sportsman).
I am asking this is because by doing this am I violating the independence assumption of the Naive Bayes problem, because I am considering single words as features too.
Also, I am thinking of adding weights to features. For example, a 3-tuple feature will have less weight than a 4-tuple feature.
Theoretically, are these two approaches not changing the independence assumptions on the Naive Bayes classifier? Also, I have not started with the approach I mentioned yet but will this improve the accuracy? I think the accuracy might not improve but the amount of training data required to get the same accuracy would be less.
Even without adding bigrams, real documents already violate the independence assumption. Conditioned on having Obama in a document, President is much more likely to appear. Nonetheless, naive bayes still does a decent job at classification, even if the probability estimates it gives are hopelessly off. So I recommend that you go on and add more complex features to your classifier and see if they improve accuracy.
If you get the same accuracy with less data, that is basically equivalent to getting better accuracy with the same amount of data.
On the other hand, using simpler, more common features works better as you decrease the amount of data. If you try to fit too many parameters to too little data, you tend to overfit badly.
But the bottom line is to try it and see.
No, from a theoretical viewpoint, you are not changing the independence assumption. You are simply creating a modified (or new) sample space. In general, once you start using higher n-grams as events in your sample space, data sparsity becomes a problem. I think using tuples will lead to the same issue. You will probably need more training data, not less. You will probably also have to give a little more thought to the type of smoothing you use. Simple Laplace smoothing may not be ideal.
Most important point, I think, is this: whatever classifier you are using, the features are highly dependent on the domain (and sometimes even the dataset). For example, if you are classifying sentiment of texts based on movie reviews, using only unigrams may seem to be counterintuitive, but they perform better than using only adjectives. On the other hand, for twitter datasets, a combination of unigrams and bigrams were found to be good, but higher n-grams were not useful. Based on such reports (ref. Pang and Lee, Opinion mining and Sentiment Analysis), I think using longer tuples will show similar results, since, after all, tuples of words are simply points in a higher-dimensional space. The basic algorithm behaves the same way.
I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?
In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.
I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.
Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.
Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post
keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,
Select features which have less correlation between them. And try using different combination of features at a time.