Good training data for text classification by LDA? - machine-learning

I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above

a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.

I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.

a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults

Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html

Related

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

unsupervised learning on sentences

I have a data that represents comments from the operator on various activities performed on a industrial device. The comments, could reflect either a routine maintainence/replacement activity or could represent that some damage occured and it had to be repaired to rectify the damage.
I have a set of 200,000 sentences that needs to be classified into two buckets - Repair/Scheduled Maintainence(or undetermined). These have no labels, hence looking for an unsupervised learning based solution.
Some sample data is as shown below:
"Motor coil damaged .Replaced motor"
"Belt cracks seen. Installed new belt"
"Occasional start up issues. Replaced switches"
"Replaced belts"
"Oiling and cleaning done".
"Did a preventive maintainence schedule"
The first three sentences have to be labeled as Repair while the second three as Scheduled maintainence.
What would be a good approach to this problem. though I have some exposure to Machine learning I am new to NLP based machine learning.
I see many papers related to this https://pdfs.semanticscholar.org/a408/d3b5b37caefb93629273fa3d0c192668d63c.pdf
https://arxiv.org/abs/1611.07897
but wanted to understand if there is any standard approach to such problems
Seems like you could use some reliable keywords (verbs it seems in this case) to create training samples for an NLP Classifier. Or you could use KMeans or KMedioids clustering and use 2 as K, which would do a pretty good job of separating the set. If you want to get really involved, you could use something like Latent Derichlet Allocation, which is a form of unsupervised topic modeling. However, for a problem like this, on the small amount of data you have, the fancier you get the more frustrated with the results you will become IMO.
Both OpenNLP and StanfordNLP have text classifiers for this, so I recommend the following if you want to go the classification route:
- Use key word searches to produce a few thousand examples of your two categories
- Put those sentences in a file with a label based on the OpenNLP format (label |space| sentence | newline )
- Train a classifier with the OpenNLP DocumentClassifier, and I recommend stemming for one of your feature generators
- after you have the model, use it in Java and classify each sentence.
- Keep track of the scores, and quarantine low scores (you will have ambiguous classes I'm sure)
If you don't want to go that route, I recommend using a text indexing technology de-jeur like SOLR or ElasticSearch or your favorite RDBMS's text indexing to perform a "More like this" type function so you don't have to play the Machine learning continuous model updating game.

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

Machine learning what approach to use when the dataset contain only one-class instances?

I have a dataset of a particular domain (say sports - 1 class). What I want to do is when I fed a web page to the classifier/clusterer I want to get a result whether that instance (web page) is related to sports or not.
Most of the classifiers in weka are not capable of dealing with unary class datasets except the LibSVM (wrapper). I did some tests with the LibSVM, but the problem is during tests on a unrelated dataset, I get all of them correctly classified, even if the instances are empty! Any suggestions?
What if I use the cosine similarity measure here?
Have you seen this thread unary class text classification in weka? and this post https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2007-October/011631.html ?
I'm assuming you meant that when you run the classifier against another dataset that is not "sports" it gets the results incorrectly classified (i.e. false positives) e.g. "this is sports".
Are you certain your dataset only contains one class? Did you make sure the dataset does not contain any empty instances? (don't mock, this has happened to me before).
In the comments of the previously mentioned thread there is a linked to a PDF on tuning SVM: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf - I would say SVMs are a bit harder than other common classifiers.
As an alternative, can't you switch the problem to binary classification? It's much easier to get good results and for most problems there are plenty of examples of things that are not in that class e.g. sports websites vs funny image web sites, programming websites, etc ...
PS: you can use other algorithms for outlier detection: http://en.wikipedia.org/wiki/Outlier_detection

Resources