Calculating a score from multiple classifiers - machine-learning

I'm trying to determine the similarity between pairs of items taken among a large collection. The items have several attributes and I'm able to calculate a discrete similarity score for each attribute, between 0 and 1. I use various classifiers depending on the attribute: TF-IDF cosine similarity, Naive Bayes Classifier, etc.
I'm stuck when it comes to compiling all that information into a final similarity score for all the items. I can't just take an unweighted average because 1) what's a high score depends on the classifier and 2) some classifiers are more important than others. In addition, some classifiers should be considered only for their high scores, i.e. a high score points to a higher similarity but lower scores have no meaning.
So far I've calculated the final score with guesswork but the increasing number of classifiers makes this a very poor solution. What techniques are there to determine an optimal formula that will take my various scores and return just one? It's important to note that the system does receive human feedback, which is how some of the classifiers work to begin with.
Ultimately I'm only interested in ranking, for each item, the ones that are most similar. The absolute scores themselves are meaningless, only their ordering is important.

There is a great book on the topic of ensemble classifier. It is online on: Combining Pattern Classifiers
There are two chapters (ch4 & ch5) in this book on Fusion of Label Outputs and how to get a single decision value.
A set of methods are defined in the chapter including:
1- Weighted Majority Vote
2- Naive Bayes Combination
3- ...
I hope that this is what you were looking for.

Get a book on ensemble classification. There has been a lot of work on how to learn a good combination of classifiers. There are numerous choices. You can of course learn weights and do a weighted average. Or you can use error correcting codes. etc. pp.
Anyway, read up on "ensemble classification", that is the keyword you need.

Related

Creating word embedings from bert and feeding them to random forest for classification

I have used bert base pretrained model with 512 dimensions to generate contextual features. Feeding those vectors to random forest classifier is providing 83 percent accuracy but in various researches i have seen that bert minimal gives 90 percent.
I have some other features too like word2vec, lexicon, TFIDF and punctuation features.
Even when i merged all the features i got 83 percent accuracy. The research paper which i am using as base paper mentioned an accuracy score of 92 percent but they have used an ensemble based approach in which they classified through bert and trained random forest on weights.
But i was willing to do some innovation thus didn't followed that approach.
My dataset is biased to positive reviews so according to me the accuracy is less as model is also biased for positive labels but still I am looking for an expert advise
Code implementation of bert
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Bert_Features.ipynb
Random forest on all features independently
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/RandomForestClassifier.ipynb
Random forest on all features jointly
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Merging_Feature.ipynb
Regarding the "no improvements despite adding more features" - some researchers believe that the BERT word embeddings already contain all the available information presented in text, so then it doesn't matter how fancy a classification head you add to it, doesn't matter if it is a linear model that uses the embeddings, or a complicated ML algorithm with a number of other features, they will not provide significant improvements in many tasks. They argue, that since BERT is a context-aware, bidirectional language model - that is trained extensively on MLM and NSP tasks, it already grasps most of the things that additional features for punctuation, word2vec and tfidf could convey. The lexicon could probably help a little in the sentiment task, if it is relevant, but the one or two extra variables, that you likely use to represent it, probably get drowned in all the other features.
Other than that, the accuracy of BERT-based models depends on the dataset used, sometimes the data is simply too diverse to obtain a perfect score, e.g. if there are some instances of observations that are very similar, but with different class labels etc. You can see in the BERT papers, that the accuracy widely depends on the task, e.g. in some tasks it is indeed 90+%, but for some tasks, e.g. Masked Language Modeling, where the model needs to choose a particular word from a vocab of over 30K words, the accuracy of 20% could be impressive in some cases. So in order to obtain a reliable comparison with bert papers, you'd need to pick a dataset that they've used and then compare.
Regarding the dataset balance, for deep learning models in general, the rule of thumb is that the training set should be more or less balanced w.r.t. the fraction of data covered by each class label. So if you have 2 labels, should be ~50-50, if 5 labels, then each should be at around 20% of training dataset, etc.
That is because most NN's work in batches, where they update the model weights based on the feedback from each batch. So if you have too many values of one class, the batch updates will be dominated by that one class, effectively worsening the quality of your training.
So, if you want to improve the accuracy of your model, balancing the dataset could be an easy fix. And if you have e.g. 5 ordered classes with differing sizes, you may consider merging some of them (e.g. reviews from 1-2 as bad, 3 as neutral, 4-5 as good) and then rebalancing, if still necessary.
(Unless it's a situation where e.g. 1 class has 80% of data, and 4 classes share the remaining 20%. In such a case you should probably consider some more advanced options, such as partitioning the algo to two parts, one predicting whether or not an instance is in class 1 (so a binary classifier), the other to distinguish between the 4 underrepresented classes. )

How to estimate the accuracy on a large dataset?

Given that I have a deep learning model(handover from former colleague). For some reason, the train/dev set was missing.
In my situation, I want to classify my dataset into 100 categories. The dataset is extremely imbalanced. The dataset size is about tens of millions
First of all, I run the model and got the prediction on the whole dataset.
Then, I sample 100 records per category(according to the prediction) and got a 10,000 test set.
Next, I labeled the ground truth of each record for the test set and calculate the precision, recall, f1 for each category and got F1-micro and F1-macro.
How to estimate the accuracy or other metrics on the whole dataset? Is it correct that I use the weighted sum of each category's precision(the weight is the proportion of prediction on the whole) to estimate?
Since the distribution of prediction category is not same as the distribution of real category, I guess the weighted approach does not work. Any one can explain it?
The issue if you take a weighted average is that if your classifier performs well on the majority class, but poorly on minority classes (which is the typical scenario), it will not be reflected in the score.
One of the recommended approaches is rather to use the balanced accuracy score (see here for the scikit learn implementation). Basically, it is an average of all recall scores: for each observation in a class, it looks at how many of were correctly classified, and averages this across all classes. This will give you a sensible overall score to report.

Balanced corpus for Naive Bayes Classifier

I'm working with sentiment analysis using NB classifier. I've found some information (blogs, tutorials etc) that training corpus should be balanced:
33.3% Positive;
33.3% Neutral
33.3% Negative
My question is:
Why corspus should be balanced? The Bayes theorem is based on propability of reason/case. So for training purpose isn't it important that in real world for example negative tweets are only 10% not 33.3%?
You are correct, balancing data is important for many discriminative models, but not really for NB.
However, it might be still more beneficial to bias P(y) estimators to get better predictive performance (since due to various simplifications models use, probability assigned to minority class can be heaviy underfitted). For NB it is not about balancing data, but literally modifying the estimated P(y) so that on the validation set accuracy is maximised.
In my opinion the best dataset for training purposes if a sample of the real world data that your classifier will be used with.
This is true for all classifiers (but some of them are indeed not suitable to unbalanced training sets in which cases you don't really have a choice to skew the distribution), but particularly for probabilistic classifiers such as Naive Bayes. So the best sample should reflect the natural class distribution.
Note that this is important not only for the class priors estimates. Naive Bayes will calculate for each feature the likelihood of predicting the class given the feature. If your bayesian classifier is built specifically to classify texts, it will use global document frequency measures (the number of times a given word occurs in the dataset, across all categories). If the number of documents per category in the training set doesn't reflect their natural distribution, the global term frequency of terms usually seen in unfrequent categories will be overestimated, and that of frequent categories underestimated. Thus not only the prior class probability will be incorrect, but also all the P(category=c|term=t) estimates.

Machine learning, emphasize certain observations?

I have a multi-class machine learning problem for which I will try different methods on such as logistic regression, decision trees, multilayer perceptron etc.
The observations in the data set have an attribute which is an index from 1-5 which defines how important it is that a certain observation gets correctly classified (index 1 very important, 5 not important at all). My questions are:
Question 1: How should I emphasize to the models that the lower index observations have greater importance? I am thinking of duplicating these observations so the models fit the lower index observations more well, what other approaches are possible?
Question 2: What performance evaluation criterias can I use to find the models that predict these low index observations well? (appart from calculating the distribution of indexes among the correctly predicted instances.)
Regards,
Answer 1: Presenting the important patterns of the training set more often is the standard approach for this. If your training algorithm has something like a leraning rate (for example if you use backpropagation), you could also increase this parameter for the high priority patterns.
Answer 2: I would use a weighted mean square error and give the errors of the high priority patterns a larger weight.

Ways to improve the accuracy of a Naive Bayes Classifier?

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?
In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.
I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.
Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.
Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post
keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,
Select features which have less correlation between them. And try using different combination of features at a time.

Resources