Doing a hierarchical sentiment analysis with LingPipe - machine-learning

This is in the context of doing sentiment analysis using LingPipe machine learning tool. I have to classify if a sentence in a big paragraph has a positive/negative sentiment. I know of the following approach in LingPipe
Classify if the complete paragraph based on its polarity - negative or positive.
Here, I yet don't know the polarity at the sentence level. We are still at the paragraph level. How do I determine the polarity at the sentence level of a paragraph, of whether a sentence in a paragraph is a positive/negative sentence? I know that LingPipe is capable of classifying if a sentence is subjective/objective. So using this approach,,,,
,,,, should I
First train LingPipe on a large set of sentences that are subjective/objective.
Use the trained model to extract all subjective sentences out of a test paragraph.
Train a LingPipe classifier based on the extracted subjective sentences for polarity by manually labeling them as positive/negative.
Now used the trained polarity model and feed a test subjective sentence (that is done by passing a sentence through the trained subjective/objective) model, and then determine if the statement is positive/negative?
Does the above approach work? In the above proposed approach, we know that LingPipe is capable of accepting a large textual content (paragraph) for polarity classification. Will it do a good job if we just pass a single subjective sentence for polarity classification? I am confused!

You might want to take a look at the multi-level analysis approaches in the literature, e.g.
Li, S., et al. (2010). "Exploiting Combined Multi-level Model for Document Sentiment Analysis," 2010 International Conference on Pattern Recognition.
Yessenalina, A., et al. (2010). "Multi-level Structured Models for Document-level Sentiment Classification," Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1046–1056,MIT, Massachusetts, USA, 9-11 October 2010.
Multi-level analysis approaches are quite common in information retrieval, as in content indexing for vector space similarity search.
Environments such as Ling Pipe are a good way to get started but eventually you need to employ lower level, finer grained tools such as yura suggested.

Most machine leraning libraries including lingpipe are row based(object with planar features) . So if you want do some hierarchical classification with it you should denormolize you data. for example you can have features of paragrahp and sentence at same feature set. If you use by word only clasification you can create such features PARGRAPH_WORDX=true, SENTENCE_WORDX=true.
Some other toolkits allow you to express you model withot denormalisation, it is so called graphical models exampels are CRF, ACRF, Markov Models etc implementation of those you can find in mallet and Factorie.

Related

Natural language generation evaluation

I was making a natural language generator using LSTM networks but now I am stuck in the part , how to evaluate my output. Suppose i have a input training data-set that consists of a dialogue act representation and the correct output for that particular dialogue act. Now suppose i generate a output sentence y from my LSTM network, so how to evaluate that sentence in comparison to the one in the data-set. I mean is there any way to compare output so that I can use gradient descent to train my weights.
As soon as you find the answer, you'll be able to write a nice paper about it since that's kind of an open research question right now. :)
To my best knowledge, your evaluation has to combine syntactic and semantic plausibility of the output, context-coherence, personality consistency and discourse dynamic progression. There's no consensus on how to optimally measure these, but there's plenty of current papers on the topic.
Related introductory read by Liu et al: https://arxiv.org/abs/1603.08023

Recent methods for finding semantic similarity between two short sentences or articles (on a concept level)

I'm working on finding similarities between short sentences and articles. I used many existing methods such as tf-idf, word2vec etc but the results are just okay. The most relevant measure which I found was word moving distance, however, its results are not that better than the other measures. I know it's a challenging problem, however, I am wondering if there are any new methods to find an approximate similarity more on a higher or concept level than just matching words. Especially, any alternative new methods like word moving distance which looks at slightly higher semantic of a sentence or article?
This is the most recent basing on a paper published 4 months ago.
Step 1:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
Step 2 : Computing the sentence vector
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
This is the most simple and efficient method to compute the semantic similarity of sentences.
Obviously, this is a huge and busy research area, but I'd say there are two broad types of approaches you could look into:
First, there are some methods that learn sentence embeddings in an unsupervised manner, such as Le and Mikolov's (2014) Paragraph Vectors, which are implemented in gensim, or Kiros et al.'s (2015) SkipThought vectors, with an implementation on Github.
Then there also exist supervised methods that learn sentence embeddings from labelled data. The most recent one is Conneau et al.'s (2017), which trains sentence embeddings on the Stanford Natural Language Inference dataset, and shows these embeddings can be used successfully across a range of NLP tasks. The code is available on Github.
You might also find some inspiration in a blog post I wrote earlier this year on the topic of embeddings.
To be honest the best thing I know to use for this at the moment is AMR:
About AMR here: https://amr.isi.edu/
Documentation here: https://github.com/amrisi/amr-guidelines/blob/master/amr.md
You can use a system like JAMR (see here: https://github.com/jflanigan/jamr) to generate AMRs for your sentence and then you can use Smatch (see here: https://amr.isi.edu/eval/smatch/tutorial.html) to compare the similarity of the two generated AMRs.
What you are trying to do is very difficult and is an active ongoing area of research.
You can use semantic similarity with WordNet for each pair of nouns.
To have a quick look you can enter bird-noun-1 and chair-noun-1 and select wordnet at http://labs.fc.ul.pt/dishin/ it gives you:
Resnik 0.315625756544
Lin 0.0574161071905
Jiang&Conrath 0.0964964414156
The Python code is at: https://github.com/lasigeBioTM/DiShIn

Unsure about word embeddings, POS re: using a Neural Net for NLP classification.

Im planning on using an NN for sarcasm detection on a number of tweets. Im unsure of how to prepare the word embeddings I will train the NN on. If I tokenize the strings and tag emoticons, capitalisation, user tags, hashtags etc, how do i then combine the resulting strings with word embeddings? do i train the word embeddings on the resulting corpus of tweets?
You can start by reading some papers on sarcasm detection in twitter, e.g. Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon, which uses patterns of content words and high frequency words, or closer to your question Sarcastic or Not: Word Embeddings to Predict the Literal or Sarcastic
Meaning of Words which uses word2vec. The latter views the sarcasm detection problem as disambiguation problem of literal and sarcastic meanings of the same word. Perhaps you can employ this approach using the recently published sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.
Try to use the techniques used in the papers, and when you encounter a specific problem ask a question with a minimal working example.

Features for sentiment analysis using Maxent model

I want to implement my own sentiment analysis using maximum entropy model. without using any Api. what could be the best features f(c,d) for my maximum entropy model. I have three classes positive, negative and neutral
Some of the most used and effective features in Sentiment Analysis are unigrams. Bigrams can also be employed, but it is quite controversial whether they are really useful or not.
Note that using frequency values of unigrams/bigrams does not significantly improve results in Sentiment Analysis; it is therefore generally sufficient to extract word types and use a boolean value to express their presence/absence in a text.
The important thing is how you preprocess text before you extract these features. For example, apart from lower-casing your tokens, handling negation scopes can improve your results when extracting unigram features.
In any case, Sentiment Analysis is a wide field. You will find that different feature extraction strategies could yield different results depending on the specific type of analysis you need to perform (e.g. feature-based analysis, subjectivity analysis, polarity analysis, etc.).
You can find almost everything you need to get started here:
http://sentiment.christopherpotts.net
Liu, Bing. "Sentiment analysis and opinion mining." Synthesis Lectures on Human Language Technologies 5.1 (2012): 1-167.
Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis." Foundations and trends in information retrieval 2.1-2 (2008): 1-135.

Machine Learning Text Classification technique

I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.

Resources