Pre-processing before running sentiment analysis - twitter

Sentiment analysis helps us gauge sentiment of tweets, however many of the tweets we get from the api might really not be 'classifiable' into some sentiment.
Does anyone know of any api/literature that talks about pre-processing a tweet before running any kind of classifier over it (e.g. remove #, remove #name's etc).
Also, what topics/api/literature can i look up if i want determine if it makes sense to run sentiment analysis on a tweet (say as a movie review), before i even begin to run a sentiment analyzer over it?

Maybe you should read:
The Role of Pre-processing in Twitter Sentiment Analysis by Yanwei Bao, Changqin Quan, Lijuan Wang and Fuji Ren
Preprocessing the Informal Text for efficient Sentiment Analysis by I. Hemalatha, G.P. Saradhi Varma and A. Govardhan
(Then in Python, tweet = re.sub(old_pattern, new_pattern, tweet) for each modification to perform.)

I am using TextBlob Library for classifying my dataset.
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Features:
-Noun phrase extraction
-Part-of-speech tagging
-Sentiment analysis
-Classification (Naive Bayes, Decision Tree)
-Language translation and detection powered by Google Translate
-Tokenization (splitting text into words and sentences)
-Word and phrase frequencies
-Parsing
-n-grams
-Word inflection (pluralization and singularization) and lemmatization
-Spelling correction
-Add new models or languages through extensions
-WordNet integration
Get it now:
$ pip install -U textblob
$ python -m textblob.download_corpora
Reference: https://textblob.readthedocs.org/en/dev/
*** I cannot tell you the result because this is a part of my thesis and I am still working on.

Actually you'd better do the dirty work by yourself. Regular Expression is easy to remove #,# or url. Punctuation marks and emojis are quite import for the sentiment analysis. I recommend using Tag of Speech trained by CMU NLP group(http://www.cs.cmu.edu/~ark/TweetNLP/) to express these characters.
For basic features like bag of words and tf-idf scores, I'd like to use Scikit-learn(http://scikit-learn.org/stable/).
For single word sentiment, you can use Stanford Nlp sentiment analysis.(http://nlp.stanford.edu/sentiment/)

Related

How can we use the dependency parser output to text embeddings? or Feature extractions from text?

Knowing the dependencies between various parts of the sentence
can add some information to existing knowledge from raw texts, Now the question is how can we use this to get a good feature representation, which can be fed into classifier such as logistic regression, sim etc. just as TfIdfvectorizer gives us a vector representation, for text documents. I'd like to know what different methods are there to get these kind of representation using output of dependency parser?

Unsure about word embeddings, POS re: using a Neural Net for NLP classification.

Im planning on using an NN for sarcasm detection on a number of tweets. Im unsure of how to prepare the word embeddings I will train the NN on. If I tokenize the strings and tag emoticons, capitalisation, user tags, hashtags etc, how do i then combine the resulting strings with word embeddings? do i train the word embeddings on the resulting corpus of tweets?
You can start by reading some papers on sarcasm detection in twitter, e.g. Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon, which uses patterns of content words and high frequency words, or closer to your question Sarcastic or Not: Word Embeddings to Predict the Literal or Sarcastic
Meaning of Words which uses word2vec. The latter views the sarcasm detection problem as disambiguation problem of literal and sarcastic meanings of the same word. Perhaps you can employ this approach using the recently published sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.
Try to use the techniques used in the papers, and when you encounter a specific problem ask a question with a minimal working example.

Features for sentiment analysis using Maxent model

I want to implement my own sentiment analysis using maximum entropy model. without using any Api. what could be the best features f(c,d) for my maximum entropy model. I have three classes positive, negative and neutral
Some of the most used and effective features in Sentiment Analysis are unigrams. Bigrams can also be employed, but it is quite controversial whether they are really useful or not.
Note that using frequency values of unigrams/bigrams does not significantly improve results in Sentiment Analysis; it is therefore generally sufficient to extract word types and use a boolean value to express their presence/absence in a text.
The important thing is how you preprocess text before you extract these features. For example, apart from lower-casing your tokens, handling negation scopes can improve your results when extracting unigram features.
In any case, Sentiment Analysis is a wide field. You will find that different feature extraction strategies could yield different results depending on the specific type of analysis you need to perform (e.g. feature-based analysis, subjectivity analysis, polarity analysis, etc.).
You can find almost everything you need to get started here:
http://sentiment.christopherpotts.net
Liu, Bing. "Sentiment analysis and opinion mining." Synthesis Lectures on Human Language Technologies 5.1 (2012): 1-167.
Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis." Foundations and trends in information retrieval 2.1-2 (2008): 1-135.

Comparison of binary vs tfidf Ngram features in sentiment analysis / classification tasks?

Simple question again: Is it better to use Ngrams (unigram/ bigrams etc) as simple binary features or rather use their Tfidf scores in ML models such as Support Vectory Machines for performing NLP tasks such as sentiment analysis or text categorization/classification?
As Steve mentioned in the comment, the best answer (and the ML-style way) is to try !
That being said, I'd start with binary features. The goal of your ML model like SVM is to determine the "weight" of these features, so if it is efficient, you don't have to try to set this weight in advance (with TFIDF or other).

Doing a hierarchical sentiment analysis with LingPipe

This is in the context of doing sentiment analysis using LingPipe machine learning tool. I have to classify if a sentence in a big paragraph has a positive/negative sentiment. I know of the following approach in LingPipe
Classify if the complete paragraph based on its polarity - negative or positive.
Here, I yet don't know the polarity at the sentence level. We are still at the paragraph level. How do I determine the polarity at the sentence level of a paragraph, of whether a sentence in a paragraph is a positive/negative sentence? I know that LingPipe is capable of classifying if a sentence is subjective/objective. So using this approach,,,,
,,,, should I
First train LingPipe on a large set of sentences that are subjective/objective.
Use the trained model to extract all subjective sentences out of a test paragraph.
Train a LingPipe classifier based on the extracted subjective sentences for polarity by manually labeling them as positive/negative.
Now used the trained polarity model and feed a test subjective sentence (that is done by passing a sentence through the trained subjective/objective) model, and then determine if the statement is positive/negative?
Does the above approach work? In the above proposed approach, we know that LingPipe is capable of accepting a large textual content (paragraph) for polarity classification. Will it do a good job if we just pass a single subjective sentence for polarity classification? I am confused!
You might want to take a look at the multi-level analysis approaches in the literature, e.g.
Li, S., et al. (2010). "Exploiting Combined Multi-level Model for Document Sentiment Analysis," 2010 International Conference on Pattern Recognition.
Yessenalina, A., et al. (2010). "Multi-level Structured Models for Document-level Sentiment Classification," Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1046–1056,MIT, Massachusetts, USA, 9-11 October 2010.
Multi-level analysis approaches are quite common in information retrieval, as in content indexing for vector space similarity search.
Environments such as Ling Pipe are a good way to get started but eventually you need to employ lower level, finer grained tools such as yura suggested.
Most machine leraning libraries including lingpipe are row based(object with planar features) . So if you want do some hierarchical classification with it you should denormolize you data. for example you can have features of paragrahp and sentence at same feature set. If you use by word only clasification you can create such features PARGRAPH_WORDX=true, SENTENCE_WORDX=true.
Some other toolkits allow you to express you model withot denormalisation, it is so called graphical models exampels are CRF, ACRF, Markov Models etc implementation of those you can find in mallet and Factorie.

Resources