Problems in Unsupervised Aspect Based Sentiment Analysis - machine-learning

I'm working on unsupervised aspect based sentiment analysis. I tried using Vader for it, which gave me good result but the problem is if the topic is negative like 'food waste' then the sentiment is always coming as negative even though content is saying 'and i really hate food waste'.
Can someone help me in tackling this issue, or even suggest me a method better than Vader.
I've also tried using 'Flair' but its' results are not as promising as Vader.

Probably the rule-based model VADER uses is not a good approach in this case, in that phrase you have 3 words that will surely get a negative score (hate food waste), remember that VADER is optimized for succint social media data and it cannot get a very good grasp of the "context" of the phrases.
A similar approach to VADER is us in TextBlob, wich you can try withuot much work: https://textblob.readthedocs.io/en/dev/
Usually the supervised route gives better results, but you need a good pretrained model and good data.
Naive Bayes classifier in scikit-learn:
https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python
Random Forest approach, always using scikit-learn:
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
Here is a recap on various approaches to sentiment analysis:
https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4

Related

Different performance by different ML classifiers, what can I deduce?

I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
Edition:
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team http://scikit-learn.org/stable/tutorial/machine_learning_map
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)

General questions regarding text-classification

I'm new to Topic Models, Classification, etc… now I'm already a while doing a project and read a lot of research papers. My dataset consists out of short messages that are human-labeled. This is what I have come up with so far:
Since my data is short, I read about Latent Dirichlet Allocation (and all it's variants) that is useful to detect latent words in a document.
Based on this I found a Java implementation of JGibbLDA http://jgibblda.sourceforge.net but since my data is labeled, there is an improvement of this called JGibbLabeledLDA https://github.com/myleott/JGibbLabeledLDA
In most of the research papers, I read good reviews about Weka so I messed around with this on my dataset
However, again, my dataset is labeled and therefore I found an extension of Weka called Meka http://sourceforge.net/projects/meka/ that had implementations for Multi-labeled data
Reading about multi-labeled data, I know most used approaches such as one-vs-all and chain classifiers...
Now the reason me being here is because I hope to have an answer to following questions:
Is LDA a good approach for my problem?
Should LDA be used together with a classifier (NB, SVM, Binary Relevance, Logistic Regression, …) or is LDA 'enough' to function as a classifier/estimator for new, unseen data?
How do I need to interpret the output coming from JGibbLDA / JGibbLabeledLDA. How do I get from these files to something which tells me what words/labels are assigned to the WHOLE message (not just to each word)
How can I use Weka/Meka do get to what I want in previous question (in case LDA is not what I'm looking for)
I hope someone, or more than one person, can help me figure out how I need to do this. The general idea of all is not the issue here, I just don't know how to go from literature to practice. Most of the papers don't give enough description of how they perform their experiments OR are too technical for my background about the topics.
Thanks!

Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.

Text categorization using Naive Bayes

I am doing the text categorization machine learning problem using Naive Bayes. I have each word as a feature. I have been able to implement it and I am getting good accuracy.
Is it possible for me to use tuples of words as features?
For example, if there are two classes, Politics and sports. The word called government might appear in both of them. However, in politics I can have a tuple (government, democracy) whereas in the class sports I can have a tuple (government, sportsman). So, if a new text article comes in which is politics, the probability of the tuple (government, democracy) has more probability than the tuple (government, sportsman).
I am asking this is because by doing this am I violating the independence assumption of the Naive Bayes problem, because I am considering single words as features too.
Also, I am thinking of adding weights to features. For example, a 3-tuple feature will have less weight than a 4-tuple feature.
Theoretically, are these two approaches not changing the independence assumptions on the Naive Bayes classifier? Also, I have not started with the approach I mentioned yet but will this improve the accuracy? I think the accuracy might not improve but the amount of training data required to get the same accuracy would be less.
Even without adding bigrams, real documents already violate the independence assumption. Conditioned on having Obama in a document, President is much more likely to appear. Nonetheless, naive bayes still does a decent job at classification, even if the probability estimates it gives are hopelessly off. So I recommend that you go on and add more complex features to your classifier and see if they improve accuracy.
If you get the same accuracy with less data, that is basically equivalent to getting better accuracy with the same amount of data.
On the other hand, using simpler, more common features works better as you decrease the amount of data. If you try to fit too many parameters to too little data, you tend to overfit badly.
But the bottom line is to try it and see.
No, from a theoretical viewpoint, you are not changing the independence assumption. You are simply creating a modified (or new) sample space. In general, once you start using higher n-grams as events in your sample space, data sparsity becomes a problem. I think using tuples will lead to the same issue. You will probably need more training data, not less. You will probably also have to give a little more thought to the type of smoothing you use. Simple Laplace smoothing may not be ideal.
Most important point, I think, is this: whatever classifier you are using, the features are highly dependent on the domain (and sometimes even the dataset). For example, if you are classifying sentiment of texts based on movie reviews, using only unigrams may seem to be counterintuitive, but they perform better than using only adjectives. On the other hand, for twitter datasets, a combination of unigrams and bigrams were found to be good, but higher n-grams were not useful. Based on such reports (ref. Pang and Lee, Opinion mining and Sentiment Analysis), I think using longer tuples will show similar results, since, after all, tuples of words are simply points in a higher-dimensional space. The basic algorithm behaves the same way.

Unsupervised Sentiment Analysis

I've been reading a lot of articles that explain the need for an initial set of texts that are classified as either 'positive' or 'negative' before a sentiment analysis system will really work.
My question is: Has anyone attempted just doing a rudimentary check of 'positive' adjectives vs 'negative' adjectives, taking into account any simple negators to avoid classing 'not happy' as positive? If so, are there any articles that discuss just why this strategy isn't realistic?
A classic paper by Peter Turney (2002) explains a method to do unsupervised sentiment analysis (positive/negative classification) using only the words excellent and poor as a seed set. Turney uses the mutual information of other words with these two adjectives to achieve an accuracy of 74%.
I haven't tried doing untrained sentiment analysis such as you are describing, but off the top of my head I'd say you're oversimplifying the problem. Simply analyzing adjectives is not enough to get a good grasp of the sentiment of a text; for example, consider the word 'stupid.' Alone, you would classify that as negative, but if a product review were to have '... [x] product makes their competitors look stupid for not thinking of this feature first...' then the sentiment in there would definitely be positive. The greater context in which words appear definitely matters in something like this. This is why an untrained bag-of-words approach alone (let alone an even more limited bag-of-adjectives) is not enough to tackle this problem adequately.
The pre-classified data ('training data') helps in that the problem shifts from trying to determine whether a text is of positive or negative sentiment from scratch, to trying to determine if the text is more similar to positive texts or negative texts, and classify it that way. The other big point is that textual analyses such as sentiment analysis are often affected greatly by the differences of the characteristics of texts depending on domain. This is why having a good set of data to train on (that is, accurate data from within the domain in which you are working, and is hopefully representative of the texts you are going to have to classify) is as important as building a good system to classify with.
Not exactly an article, but hope that helps.
The paper of Turney (2002) mentioned by larsmans is a good basic one. In a newer research, Li and He [2009] introduce an approach using Latent Dirichlet Allocation (LDA) to train a model that can classify an article's overall sentiment and topic simultaneously in a totally unsupervised manner. The accuracy they achieve is 84.6%.
I tried several methods of Sentiment Analysis for opinion mining in Reviews.
What worked the best for me is the method described in Liu book: http://www.cs.uic.edu/~liub/WebMiningBook.html In this Book Liu and others, compared many strategies and discussed different papers on Sentiment Analysis and Opinion Mining.
Although my main goal was to extract features in the opinions, I implemented a sentiment classifier to detect positive and negative classification of this features.
I used NLTK for the pre-processing (Word tokenization, POS tagging) and the trigrams creation. Then also I used the Bayesian Classifiers inside this tookit to compare with other strategies Liu was pinpointing.
One of the methods relies on tagging as pos/neg every trigrram expressing this information, and using some classifier on this data.
Other method I tried, and worked better (around 85% accuracy in my dataset), was calculating the sum of scores of PMI (punctual mutual information) for every word in the sentence and the words excellent/poor as seeds of pos/neg class.
I tried spotting keywords using a dictionary of affect to predict the sentiment label at sentence level. Given the generality of the vocabulary (non domain dependent), the results were just about 61%. The paper is available in my homepage.
In a somewhat improved version, negation adverbs were considered. The whole system, named EmoLib, is available for demo:
http://dtminredis.housing.salle.url.edu:8080/EmoLib/
Regards,
David,
I'm not sure if this helps but you may want to look into Jacob Perkin's blog post on using NLTK for sentiment analysis.
There are no magic "shortcuts" in sentiment analysis, as with any other sort of text analysis that seeks to discover the underlying "aboutness," of a chunk of text. Attempting to short cut proven text analysis methods through simplistic "adjective" checking or similar approaches leads to ambiguity, incorrect classification, etc., that at the end of the day give you a poor accuracy read on sentiment. The more terse the source (e.g. Twitter), the more difficult the problem.

Resources