Methods to ignore missing word features on test data - machine-learning

I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards

There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.

Related

Is it possible to have a class feature with several values?

I have a dataset in which the class has several values. For example, a dataset of face recognition where the class could be a tuple (man, old, Chinese).
Is it possible to have such data, if yes what ML classifier should I use?
I beleive this questions must be moved to another paltform like the https://datascience.stackexchange.com/
What you ask for is called Mutli-label Classification
In multiple label classification tasks, the model is trained to provide the probabilities or likelihood of more than one label for a given sample.
You can wether use the Multi-lable classification, or you can use multiple binary classifiers for the prediction of each feature. Like one binary classification for predicting Man or Woman, the other for Old or Young and etc. But you must be cautious that yoru labels be semantically mutual exclusive. I mean if you have labels like "sky" and "outdoor", the binary classification might be noisy if your labels are not carefully made. i.e if for a sample you have "sky" label, but no "outdoor" label, that will cause some noises during your training

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.
This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!
It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Ways to improve the accuracy of a Naive Bayes Classifier?

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?
In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.
I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.
Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.
Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post
keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,
Select features which have less correlation between them. And try using different combination of features at a time.

Resources