Mahout LDA how to predict the topic on test data set? - mahout

From the apache Mahout website https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html I am able to see the procedure to fit an LDA model and output the computed topic in the form of P("word"|"topic number"). However, there is no information on how the trained model can be applied on a test data to predict the topic distribution. Or should we write our own program to use the output of conditional probablities to find the topics over a test data set?

Please have a look at publication by 2009 Wallach et. al. titled 'Evaluation Methods for Topic Models' here. Have a look at section 4, it mentions three methods to calculate P(z|w), one based on importance sampling and other two called 'Chib-style estimator' and 'left-to-right estimator'.
Mallet has implementation of left-to-right estimator method.

Related

Metrics for monitoring LDA Model

We use LDA for topic-modelling in production. I was wondering if there are any metrics which we could use to monitor the quality of this model to understand when model starts to perform poorly and we need to retrain it (for example,if we have too many new topics).
We consider to calculate the ratio of number of words from top-topic(topic which has the highest probability for a document) corpus,which were found in the document, to the general number of words(after all processing) in the document with some theshold, but may be someone can share their experience.
You can calculate its coherence value and compare it with previous one. See Michael Roeder, Andreas Both and Alexander Hinneburg: “Exploring the space of topic coherence measures, and if you're using gensim with python, check its implementation at CoherenceModel.

Natural language generation evaluation

I was making a natural language generator using LSTM networks but now I am stuck in the part , how to evaluate my output. Suppose i have a input training data-set that consists of a dialogue act representation and the correct output for that particular dialogue act. Now suppose i generate a output sentence y from my LSTM network, so how to evaluate that sentence in comparison to the one in the data-set. I mean is there any way to compare output so that I can use gradient descent to train my weights.
As soon as you find the answer, you'll be able to write a nice paper about it since that's kind of an open research question right now. :)
To my best knowledge, your evaluation has to combine syntactic and semantic plausibility of the output, context-coherence, personality consistency and discourse dynamic progression. There's no consensus on how to optimally measure these, but there's plenty of current papers on the topic.
Related introductory read by Liu et al: https://arxiv.org/abs/1603.08023

Summarization Algo for novels : Supervised learning

I want to write a Learning Algo which can automatically create summary of articles .
e.g, there are some fiction novels(one category considering it as a filter) in PDF format. I want to make an automated process of creating its summary.
We can provide some sample data to implement it in supervised learning approach.
Kindly suggest me how can i implement this properly.
I am a beginner & am pursuing Andrew Ng course and aware of some common algorithms(linear reg, logistic , neural net) + Udacity Statistics courses and ready to dive more into NLP , Deep learning etc. but motive is to solve this. :)
Thanks in advance
The keyword is Automatic Summarization.
Generally, there are two approaches to automatic summarization: extraction and abstraction.
Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary.
Abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate.
Abstractive summarization is a lot more difficult. An interesting, approach is described in A Neural Attention Model for Abstractive Sentence Summarization by Alexander M. Rush, Sumit Chopra, Jason Weston (source code based on the paper here).
A "simple" approach is used in Word (AutoSummary Tool):
AutoSummarize determines key points by analyzing the document and assigning a score to each sentence. Sentences that contain words used frequently in the document are given a higher score. You then choose a percentage of the highest-scoring sentences to display in the summary.
You can select whether to highlight key points in a document, insert an executive summary or abstract at the top of a document, create a new document and put the summary there, or hide everything but the summary.
If you choose to highlight key points or hide everything but the summary, you can switch between displaying only the key points in a document (the rest of the document is hidden) and highlighting them in the document. As you read, you can also change the level of detail at any time.
Anyway automatic data (text) summarization is an active area of machine learning / data mining with many ongoing researches. You should start reading some good overviews:
Summarization evaluation: an overview by Inderjeet Mani.
A Survey on Automatic Text Summarization by Dipanjan Das André F.T. Martins (emphasizes extractive approaches to summarization using statistical methods).

How to output resultant documents from Weka text-classification

So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".

Automatically create topic with ( LDA,HDP) ?

I am working on CV (Curriculum Vitae) for classification, I have used LDA.
My result over 3 different concepts of CV (Marketing, Computer, Communication) by setting (N=3) was good.
Now the question is, how can I create new Topic (of course by adding it to the existing topics) for new CV with concept of Finance (or maybe other concept)?
In fact my aim is to generate new topic each time to get new concept.
I'm getting different CV every day with different concept and I have doubt on choosing which algorithm (HDP, On_Line LDA) could be useful to do my classification automatic.
LDA or other topic models are not classification methods. They should be seen as dimensionality reduction/preprocessing/synonym discovery methods in the context of supervised learning: instead of representing a document to a classifier as a bag of words, you represent it as its posterior over the topics. Don't assume that because you have 3 classes in your classification task you choose 3 topics for LDA. Topic model parameters should be set to best model the documents (as measured by perplexity, or some other quality metric of the topic model, check David Mimno's recent work for other possibilities), and the vector of topic probabilities/posterior parameters (or whatever you think is useful) should then be fed to a supervised learning method.
You'll see this is exactly the experimental set up followed by Blei et al in the original LDA paper.

Resources