Is there any benchmarks for keyword extraction?

Is there any benchmarks for keyword extraction? - machine-learning

I will do keyword extraction from documents. But I can't how to evaluate the accuracy(or precision, recall), because I don't know about ground truths of data.
I want to do evaluate the accuracy(or precision, recall) for my model. Is there any benchmarks?

For a high-quality "keyword extraction" you do not necessarily need a "ground truth". There are also algorithms, which work without a ground truth. Here is a comparison of different algorithms and their performance
https://monkeylearn.com/blog/keyword-extraction-tools/

Related

Classifier or heuristics?

I need to classify questions asking user to specify brand.
I has some set of samples featuring word "brand".
Positives like:
"What is your favorite cosmetic brand?",
"Which fragrance brand (if any) do you think this advert is for?"...
and negatives like:
"Is there any particular reason why you chose this brand?"
Of cause, it's possible to train 2-class classifier based on concrete samples. However precision and recall will be poor. Is there any way to construct something having good precision based on variety of positive samples?

Precision and recall does not have to be poor. You should try and build a binary classifier (I would recommend SVM or decision tree for this purpose). I would recommend extracting features like the number of occurrences of each word in a sample (or tf-idf) or the length of the words and sentences. I guess that the question word in the sentence will have a major impact on the classification.
In addition, please note that a good precision value is very easy to get when you do not care about recall.

Choosing a set of words as features using tf-idf and training a tree algorithm seems the easiest way to go but I would also suggest to also try k-means clustering in the case that noe or more categories of answers considered as "neutral" emerge. This will possible help you decide which of these you consider positive or negative in order to re-factor your feature vector and subsequently your algorithm.
I am also a huge fan of HMM variants (I have used them to perform energy disaggregation) and I suggest you have a look at the following. It might give you some extra ideas:
http://www.merl.com/publications/docs/TR2004-085.pdf

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?

I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.

Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop

There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

What evaluation classifiers? Precision & recall?

I have some labeled data which classifies datasets as positive or negative. Now i have an algorithm that does the same automatically and I want to compare the results.
I was said to use precision and recall, but I'm not sure whether those are appropriate because the true negatives don't even appear in the formulas. I'd rather tend to use a general "prediction rate" for both, positives and negatives.
How would be a good way to evaluate the algorithm? Thanks!!

There is no general "best" method of evaluation, everything depends on what is your aim, as each method captures different phenomena:
Accuracy is the simple measure, well suited for multi-label classification and rather well balanced data
F1-score captures precision/recall tradeoff
MCC is a good measure which is well suited for dataset with large dissproportion in the class sizes

unigrams & bigrams (tf-idf) less accurate than just unigrams (ff-idf)?

This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.

It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate "new" but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can't remember which chapter and I don't have my copy handy).
In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel's feature space, and {1,2,3}-grams of that of the cubic kernel.
Exactly what is happening depends on your training set; it might simply be too small.

As larsmans said, adding more variables / features makes it easier for the model to overfit hence lose in test accuracy. In the master branch of scikit-learn there is now a min_df parameter to cut-off any feature with less than that number of occurrences. Hence min_df==2 to min_df==5 might help you get rid of spurious bi-grams.
Alternatively you can use L1 or L1 + L2 penalized linear regression (or classification) using either the following classes:
sklearn.linear_model.Lasso (regression)
sklearn.linear_model.ElasticNet (regression)
sklearn.linear_model.SGDRegressor (regression) with penalty == 'elastic_net' or 'l1'
sklearn.linear_model.SGDClassifier (classification) with penalty == 'elastic_net' or 'l1'
This will make it possible to ignore spurious features and lead to a sparse model with many zero weights for noisy features. Grid Searching the regularization parameters will be very important though.
You can also try univariate feature selection such as done the text classification example of scikit-learn (check the SelectKBest and chi2 utilities.

When should I use support vector machines as opposed to artificial neural networks?

I know SVMs are supposedly 'ANN killers' in that they automatically select representation complexity and find a global optimum (see here for some SVM praising quotes).
But here is where I'm unclear -- do all of these claims of superiority hold for just the case of a 2 class decision problem or do they go further? (I assume they hold for non-linearly separable classes or else no-one would care)
So a sample of some of the cases I'd like to be cleared up:
Are SVMs better than ANNs with many classes?
in an online setting?
What about in a semi-supervised case like reinforcement learning?
Is there a better unsupervised version of SVMs?
I don't expect someone to answer all of these lil' subquestions, but rather to give some general bounds for when SVMs are better than the common ANN equivalents (e.g. FFBP, recurrent BP, Boltzmann machines, SOMs, etc.) in practice, and preferably, in theory as well.

Are SVMs better than ANN with many classes? You are probably referring to the fact that SVMs are in essence, either either one-class or two-class classifiers. Indeed they are and there's no way to modify a SVM algorithm to classify more than two classes.
The fundamental feature of a SVM is the separating maximum-margin hyperplane whose position is determined by maximizing its distance from the support vectors. And yet SVMs are routinely used for multi-class classification, which is accomplished with a processing wrapper around multiple SVM classifiers that work in a "one against many" pattern--i.e., the training data is shown to the first SVM which classifies those instances as "Class I" or "not Class I". The data in the second class, is then shown to a second SVM which classifies this data as "Class II" or "not Class II", and so on. In practice, this works quite well. So as you would expect, the superior resolution of SVMs compared to other classifiers is not limited to two-class data.
As far as i can tell, the studies reported in the literature confirm this, e.g., In the provocatively titled paper Sex with Support Vector Machines substantially better resolution for sex identification (Male/Female) in 12-square pixel images, was reported for SVM compared with that of a group of traditional linear classifiers; SVM also outperformed RBF NN, as well as large ensemble RBF NN). But there seem to be plenty of similar evidence for the superior performance of SVM in multi-class problems: e.g., SVM outperformed NN in protein-fold recognition, and in time-series forecasting.
My impression from reading this literature over the past decade or so, is that the majority of the carefully designed studies--by persons skilled at configuring and using both techniques, and using data sufficiently resistant to classification to provoke some meaningful difference in resolution--report the superior performance of SVM relative to NN. But as your Question suggests, that performance delta seems to be, to a degree, domain specific.
For instance, NN outperformed SVM in a comparative study of author identification from texts in Arabic script; In a study comparing credit rating prediction, there was no discernible difference in resolution by the two classifiers; a similar result was reported in a study of high-energy particle classification.
I have read, from more than one source in the academic literature, that SVM outperforms NN as the size of the training data decreases.
Finally, the extent to which one can generalize from the results of these comparative studies is probably quite limited. For instance, in one study comparing the accuracy of SVM and NN in time series forecasting, the investigators reported that SVM did indeed outperform a conventional (back-propagating over layered nodes) NN but performance of the SVM was about the same as that of an RBF (radial basis function) NN.
[Are SVMs better than ANN] In an Online setting? SVMs are not used in an online setting (i.e., incremental training). The essence of SVMs is the separating hyperplane whose position is determined by a small number of support vectors. So even a single additional data point could in principle significantly influence the position of this hyperplane.
What about in a semi-supervised case like reinforcement learning? Until the OP's comment to this answer, i was not aware of either Neural Networks or SVMs used in this way--but they are.
The most widely used- semi-supervised variant of SVM is named Transductive SVM (TSVM), first mentioned by Vladimir Vapnick (the same guy who discovered/invented conventional SVM). I know almost nothing about this technique other than what's it is called and that is follows the principles of transduction (roughly lateral reasoning--i.e., reasoning from training data to test data). Apparently TSV is a preferred technique in the field of text classification.
Is there a better unsupervised version of SVMs? I don't believe SVMs are suitable for unsupervised learning. Separation is based on the position of the maximum-margin hyperplane determined by support vectors. This could easily be my own limited understanding, but i don't see how that would happen if those support vectors were unlabeled (i.e., if you didn't know before-hand what you were trying to separate). One crucial use case of unsupervised algorithms is when you don't have labeled data or you do and it's badly unbalanced. E.g., online fraud; here you might have in your training data, only a few data points labeled as "fraudulent accounts" (and usually with questionable accuracy) versus the remaining >99% labeled "not fraud." In this scenario, a one-class classifier, a typical configuration for SVMs, is the a good option. In particular, the training data consists of instances labeled "not fraud" and "unk" (or some other label to indicate they are not in the class)--in other words, "inside the decision boundary" and "outside the decision boundary."
I wanted to conclude by mentioning that, 20 years after their "discovery", the SVM is a firmly entrenched member in the ML library. And indeed, the consistently superior resolution compared with other state-of-the-art classifiers is well documented.
Their pedigree is both a function of their superior performance documented in numerous rigorously controlled studies as well as their conceptual elegance. W/r/t the latter point, consider that multi-layer perceptrons (MLP), though they are often excellent classifiers, are driven by a numerical optimization routine, which in practice rarely finds the global minimum; moreover, that solution has no conceptual significance. On the other hand, the numerical optimization at the heart of building an SVM classifier does in fact find the global minimum. What's more that solution is the actual decision boundary.
Still, i think SVM reputation has declined a little during the past few years.
The primary reason i suspect is the NetFlix competition. NetFlix emphasized the resolving power of fundamental techniques of matrix decomposition and even more significantly t*he power of combining classifiers. People combined classifiers long before NetFlix, but more as a contingent technique than as an attribute of classifier design. Moreover, many of the techniques for combining classifiers are extraordinarily simple to understand and also to implement. By contrast, SVMs are not only very difficult to code (in my opinion, by far the most difficult ML algorithm to implement in code) but also difficult to configure and implement as a pre-compiled library--e.g., a kernel must be selected, the results are very sensitive to how the data is re-scaled/normalized, etc.

I loved Doug's answer. I would like to add two comments.
1) Vladimir Vapnick also co-invented the VC dimension which is important in learning theory.
2) I think that SVMs were the best overall classifiers from 2000 to 2009, but after 2009, I am not sure. I think that neural nets have improved very significantly recently due to the work in Deep Learning and Sparse Denoising Auto-Encoders. I thought I saw a number of benchmarks where they outperformed SVMs. See, for example, slide 31 of
http://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf
A few of my friends have been using the sparse auto encoder technique. The neural nets build with that technique significantly outperformed the older back propagation neural networks. I will try to post some experimental results at artent.net if I get some time.

I'd expect SVM's to be better when you have good features to start with. IE, your features succinctly capture all the necessary information. You can see if your features are good if instances of the same class "clump together" in the feature space. Then SVM with Euclidian kernel should do the trick. Essentially you can view SVM as a supercharged nearest neighbor classifier, so whenever NN does well, SVM should do even better, by adding automatic quality control over the examples in your set. On the converse -- if it's a dataset where nearest neighbor (in feature space) is expected to do badly, SVM will do badly as well.

- Is there a better unsupervised version of SVMs?
Just answering only this question here. Unsupervised learning can be done by so-called one-class support vector machines. Again, similar to normal SVMs, there is an element that promotes sparsity. In normal SVMs only a few points are considered important, the support vectors. In one-class SVMs again only a few points can be used to either:
"separate" a dataset as far from the origin as possible, or
define a radius as small as possible.
The advantages of normal SVMs carry over to this case. Compared to density estimation only a few points need to be considered. The disadvantages carry over as well.

Are SVMs better than ANNs with many classes?
SVMs have been designated for discrete classification. Before moving to ANNs, try ensemble methods like Random Forest , Gradient Boosting, Gaussian Probability Classification etc
What about in a semi-supervised case like reinforcement learning?
Deep Q learning provides better alternatives.
Is there a better unsupervised version of SVMs?
SVM is not suited for unsupervised learning. You have other alternatives for unsupervised learning : K-Means, Hierarchical clustering, TSNE clustering etc
From ANN perspective, you can try Autoencoder, General adversarial network
Few more useful links:
towardsdatascience
wikipedia

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart