I am trying to implement Sentiment analysis using perceptron to get a better accuracy in python. I am lost in the maths that sorounds it and need easy explanation on how to port it to be used for sentiment analysis. There is already a paper published on the same : http://aclweb.org/anthology/P/P11/P11-1015.pdf
Would anyone here be able to explain in detail and clarity ? I have a training datatset and test dataset of 5000 reviews each and am getting an accuracy of 78 percent with bag of words. I have been told perceptron will give me an accuracy of 88% and am curious to implement it.
Perceptron is just a simple binary classifier, that works on fixed size vectors from R^n as input data. So in order to use it you have to encode each of your documents in such a real-valued vector. It could be for example a bag-of-words representation (where each dimension corresponds to one wor, and the value to number of occurences), or any "more complex" representation (one of which is described in the attached paper).
So in order to "port" perceptron to sentiment analysis, you have to figure out some function f, that feeded with document returns real-valued vector, and then train you perceptron on pairs
(f(x),0) for negative reviews
(f(x),1) for positive reviews
Related
I am working on a classification problem with Tweeter data. User labeled tweets (relevant, not relevant) are used to train a machine learning classifier to predict if an unseen tweet is relevant or not to the user.
I use a simple preprocessing techniques like removal of stopwords, stemming etc and a sklearn Tfidfvectorizer to convert the words into numbers before feeding them into a classifier e.g. SVM, kernel SVM , Naïve Bayes.
I would like to determine which words (features) have the higher predictive power. What is the best way to do so?
I have tried wordcloud but it just shows the words with highest frequency in the sample.
UPDATE:
The following approach along with sklearns feature_selection seem to provide the best answer so far to my problem:
top features Any other suggestions?
Have you tried using tfidf? It creates a weighted matrix providing greater weight to the more semantically meaningful words of each text. It compares the individual text( in this case a tweet) to all of the texts (all of the tweets). It is much more helpful than using raw term counts for classification and other tasks. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
I am trying to do a Sentiment Analysis on Song Lyrics using Python.
After studying many simple classification problems, with known labels (such as Email classification Spam/Not Spam), I thought that the Lyrics Sentiment Analysis lies on the Classification field.
While actually coding it, i discovered that I had to compute the sentiment for each song's lyrics, and probably adding a column to the original dataset, marking it positive or negative, or using the actual sentiment score.
Couldn't this be done using a clustering approach? Since we don't know each song's class in the first place (positive sentiment / negative sentiment) the algorithm will cluster the data using sentiment analysis.
Clustering usually won't produce sentiments.
It a more likely to produce e.g., a cluster for rap and one for non-rap. Or one for lyrics with an even song length, and one for odd length.
There is more in the data than sentiment. So why would clustering produce sentiment clusters?
If you want particular labels (positive sentiment, negative sentiment) then you need to provide training data and use a supervised approach.
You are thinking of Clustering without supervision i.e, unsupervised clustering which might result in low accuracy results because you actually dont know what is the threshold value of score which seperates the positive and negative classes.So first try to find the threshold which will be your parameter which seperates your classes.Use supervised learning to find the threshold
I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)
I am new to Text Mining. I am working on Spam filter. I did text cleaning, removed stop words. n-grams are my features. So I build a frequency matrix and build model using Naive Bayes. I have very limited set of training data, so I am facing the following problem.
When a sentence comes to me for classification and if none of its features match with the existing features in training then my frequency vector has only zeros.
When I send this vector for classification, I obviously get a useless result.
What can be ideal size of training data to expect better results?
Generally, the more data you have, the better. You will get diminishing returns at some point. It is often a good idea to see if your training set size is a problem by plotting the cross validation performance while varying the size of the training set. In scikit-learn has an example of this type of "learning curve."
Scikit-learn Learning Curve Example
You may consider bringing in outside sample posts to increase the size of your training set.
As you grow your training set, you may want to try reducing the bias of your classifier. This could be done by adding n-gram features, or switching to a logistic regression or SVM model.
When a sentence comes to me for classification and if none of its features match with the existing features in training then my frequency vector has only zeros.
You should normalize your input so that it forms some kind of rough distribution around 0. A common method is to do this tranformation:
input_signal = (feature - feature_mean) / feature_stddev
Then all zeroes would only happen if all features were exactly at the mean.
I have a cyclic method running which collects a data set of 15.000 feature vectors with 30 dimensions (every 200ms). My current setup simply feeds all raw feature vectors to a SVM with RBF (Radial basis function). The classification result is rather unconvincing as being costly in terms of time. I know that the dataset isn't that big, so classification in real-time could be possible with the right subsampling feature vector or so. The goal is to speed up the entire classification process (training/prediction) to reach a few milliseconds. To obtain an unsupervised classification approach, I currently run k-means to label the feature vectors. I pick a few cluster results and assign them class 1 and all others class 0.
The idea now the following:
collect all 15.000 (N) feature vectors with 30 (D) dimensions
PCA on all N feature vectors
use the eigenvalues to determine a feature vector with (d) dimensions (d < D)
Fed the new set of (n < N)
feature vectors
or: the eigenvectors ?
to train the svm
Maybe instead of SVM a KNN approach would result in similar result?
Does this approach makes sense?
Any ideas to improve the process or change it in order to speed it up?
How do I determine the best number of d?
The classification accuracy shouldn't suffer too much from the time reduction.
EDIT: Data stream mining
I was just reading about Data Stream Mining. I think this topic fits my setup quite well since I have to extract knowledge structures from continuous, rapid data records. Maybe I should replace the SVM with a Gradient Boosted Tree?
Thanks!