Classifier or heuristics? - machine-learning

I need to classify questions asking user to specify brand.
I has some set of samples featuring word "brand".
Positives like:
"What is your favorite cosmetic brand?",
"Which fragrance brand (if any) do you think this advert is for?"...
and negatives like:
"Is there any particular reason why you chose this brand?"
Of cause, it's possible to train 2-class classifier based on concrete samples. However precision and recall will be poor. Is there any way to construct something having good precision based on variety of positive samples?

Precision and recall does not have to be poor. You should try and build a binary classifier (I would recommend SVM or decision tree for this purpose). I would recommend extracting features like the number of occurrences of each word in a sample (or tf-idf) or the length of the words and sentences. I guess that the question word in the sentence will have a major impact on the classification.
In addition, please note that a good precision value is very easy to get when you do not care about recall.

Choosing a set of words as features using tf-idf and training a tree algorithm seems the easiest way to go but I would also suggest to also try k-means clustering in the case that noe or more categories of answers considered as "neutral" emerge. This will possible help you decide which of these you consider positive or negative in order to re-factor your feature vector and subsequently your algorithm.
I am also a huge fan of HMM variants (I have used them to perform energy disaggregation) and I suggest you have a look at the following. It might give you some extra ideas:
http://www.merl.com/publications/docs/TR2004-085.pdf

Related

Computing a similarity score for a set of sentences

My team does a lot of chatbot training, and I'm trying to come up with some tools to improve the quality of our work. In chatbot training, it is really important to train intents with diverse utterances that phrase the same intent in very different ways. Ideally, there would be very little similarity in the syntax of the utterances in the set.
Here's an example for an intent inquiring about medical insurance coverage
Bad set of utterances
Is my daughter covered by insurance?
Is my son covered by medical insurance?
Will my son be covered by insurance?
Decent set of utterances
How can I look up whether we have insurance coverage for the whole family?
Seeking details on eligibility for medical coverage
Is there a document that details who is protected under our medical insurance policy?
I want to be able to take all of the utterances associated to an intent and analyze them for similarity. I would expect my set of bad utterances to have a high similarity score and my set of decent utterances to have a low similarity score.
I've tried playing around with a few doc2vec tutorials, but I feel like I'm missing something. I keep seeing stuff like this:
Train a set of data and then measure the similarity of a new sentence to your set of data
Measure the similarity between two sentences
I need to have an array of sentences and understand how similar they are to each other.
Any advice on achieving this?
Answering some questions:
What makes the bad utterances bad?The utterances themselves are not bad, it is the lack of variety between them. If most of the training had been like the “bad” set, then real user utterances of greater variety will not be recognized correctly.
Are you trying to discover new intents? No, this is for prerelease training, trying to improve the effectiveness of it.
Why do bad utterances have high similarity scores and decent utterances have low similarity scores? This is a hypothesis. I know how varied real user utterances are, and I have found my trainers fall into ruts when training, asking things the same way, and not seeing good accuracy results. Improving the variety in the utterances tends to result in better accuracy.
What will I do with this info? I’ll use it to assess the training quality of an intent, to determine if more training is likely necessary. In the future we might build real time tools as utterances are being added to let trainers know if they’re being too repetitive.
Most applications of text vectors benefit from the vectors capturing the "essential meaning" of a text, **without* regard to variances in word choice.
That is, it's considered a feature, not a flaw, if two completely different wordings with similar meaning have nearly the same vector. (Or, if some similarity-measure indicates they are totally similar.)
For example, to contrive an example similar to yours, consider the two phrasings:
"health coverage for brother"
"male sibling medical insurance"
There's no reuse of words, but the likely intended meaning is the same – so a good text-vectorization for typical purposes would create very similar vectors. And a similarity-measure using those vectors, or otherwise using the words/word-vectors as input, would indicate very high similarity.
But from your clarifying answers, it seems you actually want a more superficial "similarity" measure. You'd like a measure that reveals when certain phrasings show variety/contrast in their wording. (And specifically, you already know form other factors, like how they were hand-crafted, that groups of these phrasings are semantically related.)
What you want this similarity measure to show is actually a behavior that many projects using text-vectors would consider a failure of the vectors. So semantic methods like those in Word2Vec, Paragraph Vectors (aka "Doc2Vec"), etc are likely the wrong tool for your goal.
You could probably do well with a simpler measure based just on the words, or perhaps character-n-grams, of the texts.
For example, for two texts A and B, you could just tally the number of shared words (that appear in both A and B), and divide by the total number of unique words in both A and B, to get a 0.0 to 1.0 "word choice similarity" number.
And, when considering a new text against a set of prior texts, if its average similarity to the prior texts is low, it'd be "good" for your purposes.
Rather than just words, you could also use all n-character substrings ("n-grams") of your texts – which might help better highlight differences in word-forms, or common typos, which may also be useful variances for your purposes.
In general, I'd look at the scikit-learn text-vectorization functionality for ideas:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.

Text categorization using Naive Bayes

I am doing the text categorization machine learning problem using Naive Bayes. I have each word as a feature. I have been able to implement it and I am getting good accuracy.
Is it possible for me to use tuples of words as features?
For example, if there are two classes, Politics and sports. The word called government might appear in both of them. However, in politics I can have a tuple (government, democracy) whereas in the class sports I can have a tuple (government, sportsman). So, if a new text article comes in which is politics, the probability of the tuple (government, democracy) has more probability than the tuple (government, sportsman).
I am asking this is because by doing this am I violating the independence assumption of the Naive Bayes problem, because I am considering single words as features too.
Also, I am thinking of adding weights to features. For example, a 3-tuple feature will have less weight than a 4-tuple feature.
Theoretically, are these two approaches not changing the independence assumptions on the Naive Bayes classifier? Also, I have not started with the approach I mentioned yet but will this improve the accuracy? I think the accuracy might not improve but the amount of training data required to get the same accuracy would be less.
Even without adding bigrams, real documents already violate the independence assumption. Conditioned on having Obama in a document, President is much more likely to appear. Nonetheless, naive bayes still does a decent job at classification, even if the probability estimates it gives are hopelessly off. So I recommend that you go on and add more complex features to your classifier and see if they improve accuracy.
If you get the same accuracy with less data, that is basically equivalent to getting better accuracy with the same amount of data.
On the other hand, using simpler, more common features works better as you decrease the amount of data. If you try to fit too many parameters to too little data, you tend to overfit badly.
But the bottom line is to try it and see.
No, from a theoretical viewpoint, you are not changing the independence assumption. You are simply creating a modified (or new) sample space. In general, once you start using higher n-grams as events in your sample space, data sparsity becomes a problem. I think using tuples will lead to the same issue. You will probably need more training data, not less. You will probably also have to give a little more thought to the type of smoothing you use. Simple Laplace smoothing may not be ideal.
Most important point, I think, is this: whatever classifier you are using, the features are highly dependent on the domain (and sometimes even the dataset). For example, if you are classifying sentiment of texts based on movie reviews, using only unigrams may seem to be counterintuitive, but they perform better than using only adjectives. On the other hand, for twitter datasets, a combination of unigrams and bigrams were found to be good, but higher n-grams were not useful. Based on such reports (ref. Pang and Lee, Opinion mining and Sentiment Analysis), I think using longer tuples will show similar results, since, after all, tuples of words are simply points in a higher-dimensional space. The basic algorithm behaves the same way.

Resources