I have a dataset of SMS messages which is ill formatted and sparse. I tried to use topic modeling to get all the possible topics in each message with the probability of each associated topic. I need the probability to be able to arrange or rank each message's topic.
What I am thinking about as an alternative solution is to label my dataset manually and use a supervised classification algorithm such as Naiive Bayes.
Here is a sample of my SMS messages which are sparse and contain spam content so that's why I assume topic modeling did not work well:
The challenges I am facing:
Is the alternative approach, using a supervised classification method, reasonable or should I rather keep an unsupervised method like topic modeling?
How should I process the data set: Should each message possess 1 category as label or can I assign multiple categories?
Is this a multi-label or multi-class classification problem?
If you know what the topics are, then use supervised Naive Bayes. Unsupervised learning can be used for class discovery.
Assigning multiple topics to a sample is not a problem.
Naive Bayes assigns a label to a sample based on the topic with the highest probability. Naturally, you can use the highest x probabilities (perhaps with a threshold) to assign multiple topics.
Related
I need to cluster a group of documents based on the intent they have and I am planning to use LDA(Latent Dirichlet Allocation - Topic Modeling).
Can i get intents to group the documents from topic modeling ? are there any other algorithms that cluster the documents based on the intents they have. Is this approach of using topic-modeling for intent clustering is good ?
I have been trying LDA Algorithm in topic modeling and able to get list of topics but not sure whether i can consider topics as intents itself.
Expecting an approach which clusters the group of documents based on the intents they have.
as stated here LDA disregards the structure of how words interact between each other, it will not be suited for intent modeling
As a bag-of-words model is used to represent the documents, LDA can suffer from the same disadvantages as the bag-of-words model. The LDA model learns a document vector that predicts words inside of that document while disregarding any structure or how these words interact on a local level.
Consider the following to sentences:
This is his sister's dog (statement)
Is this his sister's dog (question)
Same words, different order, different intent.
You will probably need labeled data, and the use of neural networks such as CNNs or LSTMs.
I have read the answer here. But, I can't apply it on one of my example so I probably still don't get it.
Here is my example:
Suppose that my program is trying to learn PCA (principal component analysis).
Or diagonalization process.
I have a matrix, and the answer is it's diagonalization:
A = PDP-1
If I understand correctly:
In supervised learning I will have all tries with it's errors
My question is:
What will I have in unsupervised learning?
Will I have error for each trial as I go along in trials and not all errors in advance? Or is it something else?
First of all, PCA is neither used for classification, nor clustering. It is an analysis tool for data where you find the principal components in the data. This can be used for e.g. dimensionality reduction. Supervised and unsupervised learning has no relevance here.
However, PCA can often be applied to data before a learning algorithm is used.
In supervised learning, you have (as you say) a labeled set of data with "errors".
In unsupervised learning you don't have any labels, i.e, you can't validate anything at all. All you can do is to cluster the data somehow. The goal is often to achieve clusters that internally are more homogeneous. Success can be measured, e.g., using the within-cluster variance metric.
Supervised Learning:
-> You give variously labeled example data as input along with correct answer.
-> This algorithm will learn form it and start predicting correct result based on input.
example: email spam filter
Unsupervised Learning:
-> You gave just data and don't tell anything like label or correct answer.
-> Algorithm automatically analyse pattern in the data.
example: google news
In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.
I am working on CV (Curriculum Vitae) for classification, I have used LDA.
My result over 3 different concepts of CV (Marketing, Computer, Communication) by setting (N=3) was good.
Now the question is, how can I create new Topic (of course by adding it to the existing topics) for new CV with concept of Finance (or maybe other concept)?
In fact my aim is to generate new topic each time to get new concept.
I'm getting different CV every day with different concept and I have doubt on choosing which algorithm (HDP, On_Line LDA) could be useful to do my classification automatic.
LDA or other topic models are not classification methods. They should be seen as dimensionality reduction/preprocessing/synonym discovery methods in the context of supervised learning: instead of representing a document to a classifier as a bag of words, you represent it as its posterior over the topics. Don't assume that because you have 3 classes in your classification task you choose 3 topics for LDA. Topic model parameters should be set to best model the documents (as measured by perplexity, or some other quality metric of the topic model, check David Mimno's recent work for other possibilities), and the vector of topic probabilities/posterior parameters (or whatever you think is useful) should then be fed to a supervised learning method.
You'll see this is exactly the experimental set up followed by Blei et al in the original LDA paper.
I have a classification problem and I need to figure out the best approach to solve it. I have a set of training documents, where some the sentences and/or paragraphs within the documents are labeled with some tags. Not all sentences/paragraphs are labeled. A sentence or paragraph may have more than one tag/label. What I want to do is make some model, where given a new documents, it will give me suggested labels for each of the sentences/paragraphs within the document. Ideally, it would only give me high-probability suggestions.
If I use something like nltk NaiveBayesClassifier, it gives poor results, I think because it does not take into account the "unlabeled" sentences from the training documents, which will contain many similar words and phrases as the labeled sentences. The documents are legal/financial in nature and are filled with legal/financial jargon most of which should be discounted in the classification model.
Is there some better classification algorithm that Naive Bayes, or is there some way to push the unlabelled data into naive bayes, in addition to the labelled data from the training set?
Here's what I'd do to slightly modify your existing approach: train a single classifier for each possible tag, for each sentence. Include all sentences not expressing that tag as negative examples for the tag (this will implicitly count unlabelled examples). For a new test sentence, run all n classifiers, and retain classes scoring above some threshold as the labels for the new sentence.
I'd probably use something other than Naive Bayes. Logistic regression (MaxEnt) is the obvious choice if you want something probabilistic: SVMs are very strong if you don't care about probabilities (and I don't think you do at the moment).
This is really a sequence labelling task, and ideally you'd fold in predictions from nearby sentences too... but as far as I know, there's no principled extension to CRFs/StructSVM or other sequence tagging approaches that lets instances have multiple labels.
is there some way to push the unlabelled data into naive bayes
There is no distinction between "labeled" and "unlabeled" data, Naive Bayes builds simple conditional probabilities, in particular P(label|attributes) and P(no label|attributes) so it is heavily based on used processing pipeline but I highly doubt that it actually ignores the unlabelled parts. If it does so for some reason, and you do not want to modify the code, you can also introduce some artificial label "no label" to all remaining text segments.
Is there some better classification algorithm that Naive Bayes
Yes, NB is in fact the most basic model, and there are dozens better (stronger, more general) ones, which achieve better results in text tagging, including:
Hidden Markov Models (HMM)
Conditional Random Fields (CRF)
in general -Probabilistic Graphical Models (PGM)