General questions regarding text-classification - machine-learning

I'm new to Topic Models, Classification, etc… now I'm already a while doing a project and read a lot of research papers. My dataset consists out of short messages that are human-labeled. This is what I have come up with so far:
Since my data is short, I read about Latent Dirichlet Allocation (and all it's variants) that is useful to detect latent words in a document.
Based on this I found a Java implementation of JGibbLDA but since my data is labeled, there is an improvement of this called JGibbLabeledLDA
In most of the research papers, I read good reviews about Weka so I messed around with this on my dataset
However, again, my dataset is labeled and therefore I found an extension of Weka called Meka that had implementations for Multi-labeled data
Reading about multi-labeled data, I know most used approaches such as one-vs-all and chain classifiers...
Now the reason me being here is because I hope to have an answer to following questions:
Is LDA a good approach for my problem?
Should LDA be used together with a classifier (NB, SVM, Binary Relevance, Logistic Regression, …) or is LDA 'enough' to function as a classifier/estimator for new, unseen data?
How do I need to interpret the output coming from JGibbLDA / JGibbLabeledLDA. How do I get from these files to something which tells me what words/labels are assigned to the WHOLE message (not just to each word)
How can I use Weka/Meka do get to what I want in previous question (in case LDA is not what I'm looking for)
I hope someone, or more than one person, can help me figure out how I need to do this. The general idea of all is not the issue here, I just don't know how to go from literature to practice. Most of the papers don't give enough description of how they perform their experiments OR are too technical for my background about the topics.


Finding contradictory semantic sentences through natural language processing

I'm working on a project that aims to find conflicting Semantic Sentences (NLP - Semantic Search )
For example
Our text is: "I ate today. The lunch was very tasty. I was an honest guest."
Query: "I had lunch with my friend"
Do we want to give the query model and find the meaning of the sentences with a certain point in terms of synonyms and antonyms?
The solution that came to my mind was to first find the synonymous sentences and extract the key words from the synonymous sentences and then get the semantic opposite words and then find the semantic synonymous sentences based on these opposite words.
Do you think this idea is possible? If you have a solution or experience in this area, please reply
You have not mentioned the exact use case for your problem so I am not sure if the solution I know will help your cause. But there is an approach in NLP (using Deep learning) which helps to find whether two sentences are correlated, unrelated or contradictory.
Below is the information about the pretrained model which is trained specifically for this task ->
The dataset on which the above model is trained is given here ->
You can check the dataset to verify if your use case is related to the classification task performed on the dataset.
Since the model is already pretrained, you do not need to perform any training and can jump straight to evaluation. Once you can somewhat satisfied with the results, you can fine tune the model a bit for your specific problem.
We can talk in comments if you need more clarification.

Different performance by different ML classifiers, what can I deduce?

I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

Simple statistical yes/no classifier in WEKA

In order for me to compare my results of my research in labeled text classification, I need to have a baseline to compare with. One of my colleagues told me one solution would be to make the most easiest and dumbest classifier possible. The classifier makes a decision based on the frequency of a particular label.
This means that, when in my dataset I have a total of 100 samples and when it knows 80% of these samples have the label A, it will classify a sample as 'A' in 80% of the time. Since my entire research is using the Weka API, I have looked into the documentation but unfortunatly haven't found anything about this.
So my question is, is it possible in Weka to implement such a classifier and yes, could someone point out how this is possible? This question is pure informative since I looked into this thing but did not find anything, here is where I hope to find an answer.
That classifier is already implemented in Weka, it is called ZeroR and simply predicts the most frequent class (in the case of nominal class attributes) or the mean (in the case of numeric class attributes). If you want to know how to implement such a classifier yourself, look at the ZeroR source code.

Unsupervised Sentiment Analysis

I've been reading a lot of articles that explain the need for an initial set of texts that are classified as either 'positive' or 'negative' before a sentiment analysis system will really work.
My question is: Has anyone attempted just doing a rudimentary check of 'positive' adjectives vs 'negative' adjectives, taking into account any simple negators to avoid classing 'not happy' as positive? If so, are there any articles that discuss just why this strategy isn't realistic?
A classic paper by Peter Turney (2002) explains a method to do unsupervised sentiment analysis (positive/negative classification) using only the words excellent and poor as a seed set. Turney uses the mutual information of other words with these two adjectives to achieve an accuracy of 74%.
I haven't tried doing untrained sentiment analysis such as you are describing, but off the top of my head I'd say you're oversimplifying the problem. Simply analyzing adjectives is not enough to get a good grasp of the sentiment of a text; for example, consider the word 'stupid.' Alone, you would classify that as negative, but if a product review were to have '... [x] product makes their competitors look stupid for not thinking of this feature first...' then the sentiment in there would definitely be positive. The greater context in which words appear definitely matters in something like this. This is why an untrained bag-of-words approach alone (let alone an even more limited bag-of-adjectives) is not enough to tackle this problem adequately.
The pre-classified data ('training data') helps in that the problem shifts from trying to determine whether a text is of positive or negative sentiment from scratch, to trying to determine if the text is more similar to positive texts or negative texts, and classify it that way. The other big point is that textual analyses such as sentiment analysis are often affected greatly by the differences of the characteristics of texts depending on domain. This is why having a good set of data to train on (that is, accurate data from within the domain in which you are working, and is hopefully representative of the texts you are going to have to classify) is as important as building a good system to classify with.
Not exactly an article, but hope that helps.
The paper of Turney (2002) mentioned by larsmans is a good basic one. In a newer research, Li and He [2009] introduce an approach using Latent Dirichlet Allocation (LDA) to train a model that can classify an article's overall sentiment and topic simultaneously in a totally unsupervised manner. The accuracy they achieve is 84.6%.
I tried several methods of Sentiment Analysis for opinion mining in Reviews.
What worked the best for me is the method described in Liu book: In this Book Liu and others, compared many strategies and discussed different papers on Sentiment Analysis and Opinion Mining.
Although my main goal was to extract features in the opinions, I implemented a sentiment classifier to detect positive and negative classification of this features.
I used NLTK for the pre-processing (Word tokenization, POS tagging) and the trigrams creation. Then also I used the Bayesian Classifiers inside this tookit to compare with other strategies Liu was pinpointing.
One of the methods relies on tagging as pos/neg every trigrram expressing this information, and using some classifier on this data.
Other method I tried, and worked better (around 85% accuracy in my dataset), was calculating the sum of scores of PMI (punctual mutual information) for every word in the sentence and the words excellent/poor as seeds of pos/neg class.
I tried spotting keywords using a dictionary of affect to predict the sentiment label at sentence level. Given the generality of the vocabulary (non domain dependent), the results were just about 61%. The paper is available in my homepage.
In a somewhat improved version, negation adverbs were considered. The whole system, named EmoLib, is available for demo:
I'm not sure if this helps but you may want to look into Jacob Perkin's blog post on using NLTK for sentiment analysis.
There are no magic "shortcuts" in sentiment analysis, as with any other sort of text analysis that seeks to discover the underlying "aboutness," of a chunk of text. Attempting to short cut proven text analysis methods through simplistic "adjective" checking or similar approaches leads to ambiguity, incorrect classification, etc., that at the end of the day give you a poor accuracy read on sentiment. The more terse the source (e.g. Twitter), the more difficult the problem.
