Natural language generation evaluation - machine-learning

I was making a natural language generator using LSTM networks but now I am stuck in the part , how to evaluate my output. Suppose i have a input training data-set that consists of a dialogue act representation and the correct output for that particular dialogue act. Now suppose i generate a output sentence y from my LSTM network, so how to evaluate that sentence in comparison to the one in the data-set. I mean is there any way to compare output so that I can use gradient descent to train my weights.

As soon as you find the answer, you'll be able to write a nice paper about it since that's kind of an open research question right now. :)
To my best knowledge, your evaluation has to combine syntactic and semantic plausibility of the output, context-coherence, personality consistency and discourse dynamic progression. There's no consensus on how to optimally measure these, but there's plenty of current papers on the topic.
Related introductory read by Liu et al: https://arxiv.org/abs/1603.08023

Related

Recent methods for finding semantic similarity between two short sentences or articles (on a concept level)

I'm working on finding similarities between short sentences and articles. I used many existing methods such as tf-idf, word2vec etc but the results are just okay. The most relevant measure which I found was word moving distance, however, its results are not that better than the other measures. I know it's a challenging problem, however, I am wondering if there are any new methods to find an approximate similarity more on a higher or concept level than just matching words. Especially, any alternative new methods like word moving distance which looks at slightly higher semantic of a sentence or article?
This is the most recent basing on a paper published 4 months ago.
Step 1:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
Step 2 : Computing the sentence vector
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
This is the most simple and efficient method to compute the semantic similarity of sentences.
Obviously, this is a huge and busy research area, but I'd say there are two broad types of approaches you could look into:
First, there are some methods that learn sentence embeddings in an unsupervised manner, such as Le and Mikolov's (2014) Paragraph Vectors, which are implemented in gensim, or Kiros et al.'s (2015) SkipThought vectors, with an implementation on Github.
Then there also exist supervised methods that learn sentence embeddings from labelled data. The most recent one is Conneau et al.'s (2017), which trains sentence embeddings on the Stanford Natural Language Inference dataset, and shows these embeddings can be used successfully across a range of NLP tasks. The code is available on Github.
You might also find some inspiration in a blog post I wrote earlier this year on the topic of embeddings.
To be honest the best thing I know to use for this at the moment is AMR:
About AMR here: https://amr.isi.edu/
Documentation here: https://github.com/amrisi/amr-guidelines/blob/master/amr.md
You can use a system like JAMR (see here: https://github.com/jflanigan/jamr) to generate AMRs for your sentence and then you can use Smatch (see here: https://amr.isi.edu/eval/smatch/tutorial.html) to compare the similarity of the two generated AMRs.
What you are trying to do is very difficult and is an active ongoing area of research.
You can use semantic similarity with WordNet for each pair of nouns.
To have a quick look you can enter bird-noun-1 and chair-noun-1 and select wordnet at http://labs.fc.ul.pt/dishin/ it gives you:
Resnik 0.315625756544
Lin 0.0574161071905
Jiang&Conrath 0.0964964414156
The Python code is at: https://github.com/lasigeBioTM/DiShIn

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

How to finding relation between input parameter and output parameter by machine learning?

I have 20 numeric input parameters (or more) and single output parameter and I have thousands of these data. I need to find the relation between input parameters and output parameter. Some input parameters might not relate to output parameter or all input parameters might not relate to output parameter. I want some magic system that can statistically calculate output parameter when I provide all input parameters and it much be better if this system also provide confident rate with output result.
What’s technique (in machine learning) that I need to use to solve this problem? I think it should be Neural network, genetic algorithm or other related thing. But I don't sure. More than that, I need to know the limitation of this technique.
Thanks,
Your question seems to simply define the regression problem. Which can be solved by numerous algorithms and models, not just neural networks.
Support Vector Regression
Neural Networks
Linear regression (and many modifications and generalizations) using for example OLS method
Nearest Neighbours Regression
Decision Tree Regression
many, many more!
Simply look for "regression methods", "regression models" etc. in particular, sklearn library implements many of such methods.
I would recommend Genetic Programming (GP), which is genetic-based machine learning approach where the learnt model is a single mathematical expression/equation that best fits your data. Most GP packages out there come with a standard regression suite which you can run "as is" with your data, and with minimal setup costs.

machine learning from words found in text

I would like to use a supervised machine learning algorithm to predict a binary function (true or false) for a set of sentences based on the presence or absence of words in the sentences.
Ideally, I would like to avoid having to hardcode the set of words used to decide on the output so that the algorithm automatically learns which words are (together ?) most likely to trigger specific outputs.
http://shop.oreilly.com/product/9780596529321.do (Programming Collective Intelligence) has a nice section in chapter 4 titled "Learning From Clicks" which describes how to do this by using 1 layer of hiden nodes in a neural network with one new hidden node for each new combination of input words.
Similarly, it is possible to create a feature for each word in the training data set and train pretty much any classic machine learning algorithm using these features. Adding new training data will generate new features which will require me to re-train the algorithm from scratch.
Which brings me to my questions:
is it actually a problem if I have to retrain everything from scratch whenever the training data set is extended ?
what kind of algorithm would more experience machine learning users recommend to use for this kind of problem ?
what criteria should I use in picking an algorithm versus another ? (other than actually trying them all and see which perform better with precision/recall metrics)
if you have worked on similar problems, what about extending the features with 2-grams (1 if a specific 2-gram is present, 0 if not) ? 3-grams ?
You could look into the general area of topic modelling if you want to find words which are generally found together.
The most simple approach would be to use latent semantic analysis ( http://en.wikipedia.org/wiki/Latent_semantic_analysis ), which is just applying SVD to a term document matrix. You'd then need to do some additional post hoc analysis to fit this to your particular outcome.
A more involved, and much more complex approach would be to use latent dirichlet allocation ( http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation )
In terms of just adding new features (words) that is fine as long as you are going to retrain. You can also use TF/IDF to give that particular word a value when representing the matrix (Instead of just a 1 or 0).
I don't know what programming language you are trying to do this in, but I know there are libraries out there in Java and Pythont hat do all of the above.

Unsupervised Sentiment Analysis

I've been reading a lot of articles that explain the need for an initial set of texts that are classified as either 'positive' or 'negative' before a sentiment analysis system will really work.
My question is: Has anyone attempted just doing a rudimentary check of 'positive' adjectives vs 'negative' adjectives, taking into account any simple negators to avoid classing 'not happy' as positive? If so, are there any articles that discuss just why this strategy isn't realistic?
A classic paper by Peter Turney (2002) explains a method to do unsupervised sentiment analysis (positive/negative classification) using only the words excellent and poor as a seed set. Turney uses the mutual information of other words with these two adjectives to achieve an accuracy of 74%.
I haven't tried doing untrained sentiment analysis such as you are describing, but off the top of my head I'd say you're oversimplifying the problem. Simply analyzing adjectives is not enough to get a good grasp of the sentiment of a text; for example, consider the word 'stupid.' Alone, you would classify that as negative, but if a product review were to have '... [x] product makes their competitors look stupid for not thinking of this feature first...' then the sentiment in there would definitely be positive. The greater context in which words appear definitely matters in something like this. This is why an untrained bag-of-words approach alone (let alone an even more limited bag-of-adjectives) is not enough to tackle this problem adequately.
The pre-classified data ('training data') helps in that the problem shifts from trying to determine whether a text is of positive or negative sentiment from scratch, to trying to determine if the text is more similar to positive texts or negative texts, and classify it that way. The other big point is that textual analyses such as sentiment analysis are often affected greatly by the differences of the characteristics of texts depending on domain. This is why having a good set of data to train on (that is, accurate data from within the domain in which you are working, and is hopefully representative of the texts you are going to have to classify) is as important as building a good system to classify with.
Not exactly an article, but hope that helps.
The paper of Turney (2002) mentioned by larsmans is a good basic one. In a newer research, Li and He [2009] introduce an approach using Latent Dirichlet Allocation (LDA) to train a model that can classify an article's overall sentiment and topic simultaneously in a totally unsupervised manner. The accuracy they achieve is 84.6%.
I tried several methods of Sentiment Analysis for opinion mining in Reviews.
What worked the best for me is the method described in Liu book: http://www.cs.uic.edu/~liub/WebMiningBook.html In this Book Liu and others, compared many strategies and discussed different papers on Sentiment Analysis and Opinion Mining.
Although my main goal was to extract features in the opinions, I implemented a sentiment classifier to detect positive and negative classification of this features.
I used NLTK for the pre-processing (Word tokenization, POS tagging) and the trigrams creation. Then also I used the Bayesian Classifiers inside this tookit to compare with other strategies Liu was pinpointing.
One of the methods relies on tagging as pos/neg every trigrram expressing this information, and using some classifier on this data.
Other method I tried, and worked better (around 85% accuracy in my dataset), was calculating the sum of scores of PMI (punctual mutual information) for every word in the sentence and the words excellent/poor as seeds of pos/neg class.
I tried spotting keywords using a dictionary of affect to predict the sentiment label at sentence level. Given the generality of the vocabulary (non domain dependent), the results were just about 61%. The paper is available in my homepage.
In a somewhat improved version, negation adverbs were considered. The whole system, named EmoLib, is available for demo:
http://dtminredis.housing.salle.url.edu:8080/EmoLib/
Regards,
David,
I'm not sure if this helps but you may want to look into Jacob Perkin's blog post on using NLTK for sentiment analysis.
There are no magic "shortcuts" in sentiment analysis, as with any other sort of text analysis that seeks to discover the underlying "aboutness," of a chunk of text. Attempting to short cut proven text analysis methods through simplistic "adjective" checking or similar approaches leads to ambiguity, incorrect classification, etc., that at the end of the day give you a poor accuracy read on sentiment. The more terse the source (e.g. Twitter), the more difficult the problem.

Resources