scikit learn classifies stopwords - machine-learning

Here is the example where there is step by step procedure to make system learn and classify input data.
It classifies correctly for given 5 datasets domains. Additionally it also classifies stopwords.
e.g
Input : docs_new = ['God is love', 'what is where']
Output :
'God is love' => soc.religion.christian
'what is where' => soc.religion.christian
Here what is where should not be classified as it contains only stopwords. How scikit learn functions in this scenario?

I am not sure what classifier you are using. But let's assume you use a Naive Bayes classifier.
In this case, the sample is labeled as the class for which the posterior probability is maximum given a particular pattern of words.
And the posterior probability is calculated as
posterior = likelihood x prior
Note that the evidence term was dropped since it is constant). Additionally, there is an additive smoothening to avoid scenarios where the likelihood is zero.
Anyway, if you have only stop words in your input text, the likelihood is constant for all classes and the posterior probability is entirely determined by your prior probability. So, what basically happens is that a Naive Bayes classifier (if the priors were estimated from the training data) will assign the class label that occurs most often in the training data.

A classifier always predicts one of the classes that it saw during its training phase, by definition. I don't know what you did to produce the classifier, but most likely it's just predicting the majority class for any sample without interesting features; that what naive Bayes, linear SVMs and other typical text classifiers do.

Standard text classification uses TfidfVectorizer to transform text to tokens and to vectors of features as input to classifier.
One of the init parameters is stop_words, in case stop_words='english' the vectorizer will produce no features for the sentence 'what is where'.
Stop words are matched lexically against every input token using a built in english stop words list you can examine here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py

Related

Balanced corpus for Naive Bayes Classifier

I'm working with sentiment analysis using NB classifier. I've found some information (blogs, tutorials etc) that training corpus should be balanced:
33.3% Positive;
33.3% Neutral
33.3% Negative
My question is:
Why corspus should be balanced? The Bayes theorem is based on propability of reason/case. So for training purpose isn't it important that in real world for example negative tweets are only 10% not 33.3%?
You are correct, balancing data is important for many discriminative models, but not really for NB.
However, it might be still more beneficial to bias P(y) estimators to get better predictive performance (since due to various simplifications models use, probability assigned to minority class can be heaviy underfitted). For NB it is not about balancing data, but literally modifying the estimated P(y) so that on the validation set accuracy is maximised.
In my opinion the best dataset for training purposes if a sample of the real world data that your classifier will be used with.
This is true for all classifiers (but some of them are indeed not suitable to unbalanced training sets in which cases you don't really have a choice to skew the distribution), but particularly for probabilistic classifiers such as Naive Bayes. So the best sample should reflect the natural class distribution.
Note that this is important not only for the class priors estimates. Naive Bayes will calculate for each feature the likelihood of predicting the class given the feature. If your bayesian classifier is built specifically to classify texts, it will use global document frequency measures (the number of times a given word occurs in the dataset, across all categories). If the number of documents per category in the training set doesn't reflect their natural distribution, the global term frequency of terms usually seen in unfrequent categories will be overestimated, and that of frequent categories underestimated. Thus not only the prior class probability will be incorrect, but also all the P(category=c|term=t) estimates.

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

How to do text classification with label probabilities?

I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?
In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.
There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.
Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.
I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.
If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.
Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.

Model in Naive Bayes

When we train a training set using decision tree classifier, we will get a tree model. And this model can be converted to rules and can be incorporated into a java code.
Now if I train the training set using Naive Bayes, in what form is the model? And how can I incorporated the model into my java code?
If there is no model resulted from the training, then what is the difference between Naive Bayes and lazy learner (ex. kNN)?
Thanks in advance.
Naive Bayes constructs estimations of conditional probabilities P(f_1,...,f_n|C_j), where f_i are features and C_j are classes, which, using bayes rule and estimation of priors (P(C_j)) and evidence (P(f_i)) can be translated into x=P(C_j|f_1,...,f_n), which can be roughly read as "Given features f_i I think, that their describe object of class C_j and my certainty is x". In fact, NB assumes that festures are independent, and so it actualy uses simple propabilities in form of x=P(f_i|C_j), so "given f_i I think that it is C_j with probability x".
So the form of the model is set of probabilities:
Conditional probabilities P(f_i|C_j) for each feature f_i and each class C_j
priors P(C_j) for each class
KNN on the other hand is something completely different. It actually is not a "learned model" in a strict sense, as you don't tune any parameters. It is rather a classification algorithm, which given training set and number k simply answers question "For given point x, what is the major class of k nearest points in the training set?".
The main difference is in the input data - Naive Bayes works on objects that are "observations", so you simply need some features which are present in classified object or absent. It does not matter if it is a color, object on the photo, word in the sentence or an abstract concept in the highly complex topological object. While KNN is a distance-based classifier which requires you to classify object which you can measure distance between. So in order to classify abstract objects you have to first come up with some metric, distance measure, which describes their similarity and the result will be highly dependent on those definitions. Naive Bayes on the other hand is a simple probabilistic model, which does not use the concept of distance at all. It treats all objects in the same way - they are there or they aren't, end of story (of course it can be generalised to the continuous variables with given density function, but it is not the point here).
The Naive Bayes will construct/estimate the probability distribution from which your training samples have been generated.
Now, given this probability distribution for all your output classes, you take a test sample, and depending on which class has the highest probability of generating this sample, you assign the test sample to that class.
In short, you take the test sample and run it through all the probability distributions (one for each class) and calculate the probability of generating this test sample for that particular distribution.

Ways to improve the accuracy of a Naive Bayes Classifier?

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?
In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.
I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.
Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.
Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post
keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,
Select features which have less correlation between them. And try using different combination of features at a time.

Resources