Bag of words Classification - machine-learning

I need find words training words and their classification. Simple classification such as . Sports Entertainment and Politics things like that.
Where Can i find the words and their classifications. I know many universities have done Bag of words classifications. Is there any repository of training examples ?

This is not exactly what you are looking for but you might find http://labs.google.com/sets interesting.
You can put in a bunch of words, and it will spit out a list of related words, which you could recursively throw back into the first page to get even more related words..
Alternatively, download a huge chunk of wikipedia articles (where you already know the category of each page [ http://en.wikipedia.org/wiki/Special:Categories ]) and write a simple script to pick words which have high frequency in articles from one category but very low frequency in articles from other categories

You can use 20 newsgroup data http://people.csail.mit.edu/jrennie/20Newsgroups for finding such words per topic. Run a Support Vector Machine on the data, it will give you weights of words for each class. You can use top 20 or 50 words. The data-set has 20 classes like religion, politics, sports etc. Hope it helps you.

I do not know such list of words, but can suggest to use a copy of Wikipedia and wiki classification. You can parse the XML version of Wikipedia (i have done that) and collect words from different topics.

Related

Retrieving the top 5 sentences- Algorithm if any present

I am new to Data Science. This could be a dumb question, but just want to know opinions and confirm if I could enhance it well.
I have a question getting the most common/frequent 5 sentences from the database. I know I could gather all the data (sentences) into a list and using the Counter library - I could fetch the most occurring 5 sentences, but I am interested to know if any algorithm (ML/DL/NLP) is present for such a requirement. All the sentences are given by the user. I need to know his top 5 (most occurring/frequent) sentences (not phrases please)!!
Examples of sentences -
"Welcome to the world of Geeks"
"This portal has been created to provide well written subject"
"If you like Geeks for Geeks and would like to contribute"
"to contribute at geeksforgeeks org See your article appearing on "
"to contribute at geeksforgeeks org See your article appearing on " (occurring for the second time)
"the Geeks for Geeks main page and help thousands of other Geeks."
Note: All my sentences in my database are distinct (contextual wise and no duplicates too). This is just an example for my requirement.
Thanks in Advance.
I'd suggest you to start with sentence embeddings. Briefly, it returns a vector for a given sentence and it roughly represents the meaning of the sentence.
Let's say you have n sentences in your database and you found the sentence embeddings for each sentence so now you have n vectors.
Once you have the vectors, you can use dimensionality reduction techniques such as t-sne to visualize your sentences in 2 or 3 dimensions. In this visualization, sentences that have similar meanings should ideally be close to each other. That may help you pinpoint the most-frequent sentences that are also close in meaning.
I think one problem is that it's still hard to draw boundaries to the meanings of sentences since meaning is intrinsically subjective. You may have to add some heuristics to the process I described above.
Adding to MGoksu's answer, Once you get sentence embeddings, you can apply LSH(Locality Sensitive Hashing) to group the embeddings into clusters.
Once you get the clusters of embeddings. It would be a trivial to get the clusters with highest number of vectors.

Improving Article Classifier Accuracy

I've built an article classifier based on Wikipedia data that I fetch, which comes from 5 total classifications.
They are:
Finance (15 articles) [1,0,0,0,0]
Sports (15 articles) [0,1,0,0,0]
Politics (15 articles) [0,0,1,0,0]
Science (15 articles) [0,0,0,1,0]
None (15 random articles not pertaining to the others) [0,0,0,0,1]
I went to wikipedia and grabbed about 15 pretty lengthy articles from each of these categories to build my corpus that I could use to train my network.
After building a lexicon of about 1000 words gathered from all of the articles, I converted each article to a word vector, along with the correct classifier label.
The word vector is a hot array, while the label is a one hot array.
For example, here is the representation of one article:
[
[0,0,0,1,0,0,0,1,0,0,... > 1000], [1,0,0,0] # this maps to Finance
]
So, in essence, I have this randomized list of word vectors mapped to their correct classifiers.
My network is a 3 layer, deep neural net that contains 500 nodes on each layer. I pass through the network over 30 epochs, and then just display how accurate my model is at the end.
Right now, Im getting about 53% to 55% accuracy. My question is, what can I do to get this up into the 90's? Is it even possible, or am I going to go crazy trying to train this thing?
Perhaps additionally, what is my main bottleneck so to speak?
edited per comments below
Neural networks aren't really designed to run best on single machines, they work much better if you have a cluster, or at least a production-grade machine. It's very common to eliminate the "long tail" of a corpus - if a term only appears in one document one time, then you may want to eliminate it. You may also want to apply some stemming so that you don't capture multiples of the same word. I strongly advise to you try applying TFIDF transformation to your corpus before pruning.
Network size optimization is a field unto itself. Basically, you try adding more/less nodes and see where that gets you. See the following for a technical discussion.
https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
It is impossible to know without seeing the data.
Things to try:
Transform your word vector to TFIDF. Are you removing stop words? You can add bi-grams/tri-grams to your word vector.
Add more articles - it could be difficult to separate them in such a small corpus. The length of a specific document doesn't necessarily help, you want to have more articles.
30 epochs feels very low to me.

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

Classification of single sentence

I have 4 different categories and I also have around 3000 words which belong to each of these categories. Now if a new sentence comes, I am able to break the sentence into words and get more words related to it. So say for each new sentence I can get 20-30 words generated from the sentence.
Now what is the best way to classify this sentence in above mentioned category? I know bag of words works well.
I also looked at LDA, but it works with documents, where as I have a list of words as a training corpus. In LDA it looks at the position of word in document. So I could not get meaningful results from LDA.
I'm not sure if I fully understand what your question is exactly.
Bag of words works well for some purposes, but in a lot of cases it throws away a lot of potentially useful information (which could be taken from word order, for example).
And assuming that you get a grammatical sentence as input, why not use your sentence as document and still use LDA? The position of a word in your sentence can still be verymeaningful.
There are plenty of classification methods available. Which one is best depens largely on your purpose. If you're neew to this area, this may be interesting to have a look at: https://www.coursera.org/course/ml
Like, Igor, I am also a bit confused regarding your problem. Be it a document or a sentence, the terms will be part of the feature set for categorization, in some form. You can find out the most relevant terms of each category and using this knowledge, do a better classification of the new sentences. For example, if your sentence is as follows-" There is a stray dog near our layout which bites everyone who goes near to it". If you take the useful keywords from this sentence, removing stopwords, they are a few in number ( stray, dog, layout, bites, near ). You can categorize it into a bucket, "animals_issue". If you train your system with a larger set of example, this bag of words model can help. Otherwise, you can go for LDA/ other topic modelling approaches.

Binary classification of webpages where data in categories are very similar

I am working on binary classification of webpages related to a topic of my interest. I want to classify whether the webpage belongs to a certain category or not. I have manually labelled dataset with 2 categories positive and negative. However, my concern here is when I look at bag-of-words from each of the categories, the features are very similar. The positive and negative webpages are indeed very close (content wise).
Some more info - the content is in English, we are also doing stopwords removal.
How can I go about this task? Is there a different approach that can be applied to this problem?
Thanks !
You can use pairs of consecutive words instead of single words (bag of pairs of words). The hope is that pair of words may capture better the concept you 're after. Triplets of words could come next. The issue is that dimensionality goes really high (N^2). If you can't afford it an idea is use the hashing trick (check literature on random projections/hashing) on the pairs of words to bound the dimensionality.

Resources