Better text documents clustering than tf/idf and cosine similarity? - machine-learning

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.
The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:
1- The website Stackoverflow is a nice place.
2- Stackoverflow is a website.
The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:
1- The website Stackoverflow is a nice place.
2- I visit Stackoverflow regularly.
Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.
My question: is there better techniques to cluster documents?

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.
Topic models such as LDA might work even better.

As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.
If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.
While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.
Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.
"Steve Jobs was the CEO at Apple" is clearly about the company
"I'm eating the most delicious apple" is clearly about the fruit
"I'm going to the big apple when I travel to the USA" is most likely about visiting New York

Long answer:
TfxIdf is currently one of the most famous search method. What you need are some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for english (for example the lib 'nltk' in python).
You must use the NLP analysis both on your querys (questions) and on yours documents before indexing.
The point is : while tfxidf (or tfxidf^2 like in lucene) is good, you should use it on annotated resource with meta-linguistics information. That can be hard and require extensive knowledge about your core search engine, grammar analysis (syntax) and the domain of document.
Short answer : The better technique is to use TFxIDF with light grammar NLP annotations, and both re-write query and indexing.

Related

Using NLP or machine learning to extract keywords off a sentence

I'm new to the ML/NLP field so my question is what technology would be most appropriate to achieve the following goal:
We have a short sentence - "Where to go for dinner?" or "What's your favorite bar?" or "What's your favorite cheap bar?"
Is there a technology that would enable me to train it providing the following data sets:
"Where to go for dinner?" -> Dinner
"What's your favorite bar?" -> Bar
"What's your favorite cheap restaurant?" -> Cheap, Restaurant
so that next time we have a similar question about an unknown activity, say, "What is your favorite expensive [whatever]" it would be able to extract "expensive" and [whatever]?
The goal is if we can train it with hundreds of variations(or thousands) of the question asked and relevant output data expected, so that it can work with everyday language.
I know how to make it even without NLP/ML if we have a dictionary of expected terms like Bar, Restaurant, Pool, etc., but we also want it to work with unknown terms.
I've seen examples with Rake and Scikit-learn for classification of "things", but I'm not sure how would I feed text into those and all those examples had predefined outputs for training.
I've also tried Google's NLP API, Amazon Lex and Wit to see how good they are at extracting entities, but the results are disappointing to say the least.
Reading about summarization techniques, I'm left with the impression it won't work with small, single-sentence texts, so I haven't delved into it.
As #polm23 mentioned for simple stuff you can use the POS tagging to do the extraction. The services you mentioned like LUIS, Dialog flow etc. , uses what is called Natural Language Understanding. They make uses of intents & entities(detailed explanation with examples you can find here). If you are concerned that your data is going online or sometimes you have to go offline, you always go for RASA.
Things you can do with RASA:
Entity extraction and sentence classification. Mention which particular term to be extracted from the sentence by tagging the word position with a variety of sentence. So if any different word comes other than what you had given in the training set it will be detected.
Uses rule-based learning and also keras LSTM for detection.
One downside when comparing with the online services is that you have to manually tag the position numbers in the JSON file for training as opposed to the click and tag features in the online services.
You can find the tutorial here.
I am having pain in my leg.
Eg I have trained RASA with a variety of sentences for identifying body part and symptom (I have limited to 2 entities only, you can add more), then when an unknown sentence (like the one above) appears it will correctly identify "pain" as "symptom" and "leg" as "body part".
Hope this answers your question!
Since "hundreds to thousands" sound like you have very little data for training a model from scratch. You might want to consider training (technically fine-tuning) a DialogFlow Agent to match sentences ("Where to go for dinner?") to intents ("Dinner"), then integrating via API calls.
Alternatively, you can invest time in fine-tuning a small pre-trained model like "Distilled BERT classifier" from "HuggingFace" as you won't need the 100s of thousands to billions of data samples required to train a production-worthy model. This can also be assessed offline and will equip you to solve other NLP problems in the future without much low-level understanding of the underlying statistics.

Find startup's industry from its description

I am using AngelList DB to categorize startups based on their industries since these startups are categorized based on community input which is misleading most of the time.
My business objective is to extract keywords that indicate to which industry this specific startup belongs to then map it to one of the industries specified in LinkedIn sheet https://developer.linkedin.com/docs/reference/industry-codes
I experimented with Azure Machine learning, where I pushed 300 startups descriptions and analyzed the keyword extraction was pretty bad and was not even close to what I am trying to achieve.
I would like to know how data scientists will approach this problem? where should I look? and where I should not? is keyword analysis tools (like Google Adwords keyword planner is a viable option)
Using Text Classification...
To be able to treat this as a classification problem, you need a training set, which is a set of AngelList entries that are labeled with correct LinkedIn categories. This can be done manually, or you can hire some Mechanical Turks to do the job for you.
Since you have ~150 categories, I'd imagine you need at least 20-30* AngelList entries for each of them. So your training set will be {input: angellist_description, result: linkedin_id}
After that, you need to dig through text classification techniques to try and optimize the accuracy/precision of your results. The book "Taming Text" has a full chapter on text classification. And a good tool to implement a text-based classifier would be Apache Solr or Apache Lucene.
* 20-30 is a quick personal estimate and not based on a scientific method. You can look up some methods online for a good estimation method.
Using Text Clustering.
Step #1
Use text clustering to extract main 'topics' from all the descriptions. (Carrot2 can be helpful here)
Input corpus of all descriptions
Process: Text Clustering using Carrot2
Output each document will be labeled with a topic
Step #2
Manually map the extracted topics into LinkedIn's categories.
Step #3
Use the output of the first two steps to traverse from company -> extracted topic -> linkedin category

Clustering of Twitter Feeds

I am new to clustering, just implemented a couple of algorithms before.
I need to cluster tweets according to their similarity.
One way is to use only hash tags, but I don't think it would be that informative. So complete tweets should be analyzed.
Moreover I was searching the web for the algorithms for clustering feeds.
One I encountered is TF-IDF. I want to know are there better algorithms which can be implemented in few hours and are better than TF-IDF.Also I would be intersetd in some informatics source about the clustering of twitter feeds.
PS: No. of tweets : 10^5
As Anony Mousse pointed out in his comment above, TF/IDF is only a normalization measure to make sure words that are overly popular among all documents don't gain too much important.
For data preparation, I'd recommend reading this and the second part of it too (linked via the above link), if you haven't already done so. It is very important to get a vector of numbers from each tweet. In general, in machine learning, it is important to get a feature vector because that way, you can apply mathematical algorithms to your data then.
Now that you have a feature vector for each tweet in your collection, things get a bit simple. There are two clustering algorithms that come to my mind that you can whip up in a couple of hours each, with maybe extensive testing taking a weekend.
K-Means Clustering
Hierarchical Clustering With Single Linkage
With 100,000 tweets only, you should actually be able to implement these algorithms on a single computer (i.e. this is not big data -- no need for cluster computing), using your favorite language (C++, Java, Python, MATLAB, etc.). Personally, I think it's easier to implement K-Means Clustering (which I have done before) compared to Hierarchical Clustering (which I have also done before).
EDIT: Please follow the below comments only if you have labeled training data, i.e. you have tweets say, with labeled sentiments (happy-user, ok-ok, bad product, angry-user, abusive-user) and the question you want to answer is: Given a new tweet, what is it's sentiment?
Here is one very good resource you should look at, to get a better understanding of K-Nearest Neighbors:
Laszlo Kozma's Slides
In general, for the other two algorithms, there are ample resources, with Wikipedia articles the best way to start. Personally, I feel K-Nearest Neighbors (shorthand k-NN) is the easiest of the three to implement and will give you quick results.

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
Existing Implementations of Naive Bayes
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
Bootstrapping Classification from Keywords
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

Resources