Semantic analysis of tweets - twitter

I have know how to communicate with twitter and how to retrieve tweets but I am looking for further working on these tweets.
I have two categories food and sports. Now I want to categorize tweets into food and sports. Can anyone please suggest me how to categorize on basis of computer algorithm?
regards
Gaurav

I've been doing some work recently with Latent Dirichlet Allocation. The general idea is that documents contain words that are generated from topics. What you could try doing is loading a corpus of documents known to be about the topics you are interested in, update with the tweets of interest, and then select tweets that have strong probabilities for the same topics as your known documents.
I use R for LDA (package:topicmodels and package:lda), but I think there are some prebuilt python tools for this too. I would probably steer away from trying to write your own unless you have a solid grounding in Bayesian statistics.
Here's the documentation for the topicmodels package: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf

I doubt that a set of algorithm could possibly categorize tweets in open domain. In other words I don't think a set of rules can possibly categorizes open domain tweets. You need to parse tweets into a semantic representation customized for the categorization.

Related

How to form documents for LDA on twitter data

We have a requirement to do topic modelling on the twitter tweets on the live stream, the input makes to spark streaming and stores the data to HDFS. A batch job runs on the collected data. The batch job is to find the underlying topics in the tweets. For this we are using Latent Dirichlet Allocation (LDA) alogrithm to find out the topics. We receive data as tweets of max characters 140 and are stored as one row in HDFS.
I'm new to the LDA algorithm and have basic understanding on that, as the topic model are derived based on word co-occurrences across n documents
I understood two options to input the data to the LDA.
Option 1: Use one row tweet as one single document for the LDA ?.
Option 2: Group the rows and form documents pass these documents to LDA ?.
I want to understand how the distribution of the vocabulary(words) to topic is effected for each option. Which option should be considered for better topic modelling.
Also please let me know if any better solution is required to do topic modelling on the twitter data other than these otpions.
Note: When I ran the both options and displayed on the word cloud, I could see the distribution of words to the topics(3) is different for the both.
Any help appreciated.
Thanks in advance.
Using LDA with short document is a bit tricky since LDA assign a topic per word and multiple topic for each document. Doing it with short text means that few words will belong to a same topic, though mostly a tweet will contain only one topic, which will usually yield garbage topics distribution. (This is your option 1)
I know that there's a paper and java tool for topic modeling for short text but I have never used it. Here's the to the github repo link
For option 2, I think it will be possible to use LDA and get coherent topics but you need to find some semantic structure for grouping, i.e. per source, date, keyword, hashtag ..
I will be really interested by the results you get if you apply any of the proposed options any soon.

Find startup's industry from its description

I am using AngelList DB to categorize startups based on their industries since these startups are categorized based on community input which is misleading most of the time.
My business objective is to extract keywords that indicate to which industry this specific startup belongs to then map it to one of the industries specified in LinkedIn sheet https://developer.linkedin.com/docs/reference/industry-codes
I experimented with Azure Machine learning, where I pushed 300 startups descriptions and analyzed the keyword extraction was pretty bad and was not even close to what I am trying to achieve.
I would like to know how data scientists will approach this problem? where should I look? and where I should not? is keyword analysis tools (like Google Adwords keyword planner is a viable option)
Using Text Classification...
To be able to treat this as a classification problem, you need a training set, which is a set of AngelList entries that are labeled with correct LinkedIn categories. This can be done manually, or you can hire some Mechanical Turks to do the job for you.
Since you have ~150 categories, I'd imagine you need at least 20-30* AngelList entries for each of them. So your training set will be {input: angellist_description, result: linkedin_id}
After that, you need to dig through text classification techniques to try and optimize the accuracy/precision of your results. The book "Taming Text" has a full chapter on text classification. And a good tool to implement a text-based classifier would be Apache Solr or Apache Lucene.
* 20-30 is a quick personal estimate and not based on a scientific method. You can look up some methods online for a good estimation method.
Using Text Clustering.
Step #1
Use text clustering to extract main 'topics' from all the descriptions. (Carrot2 can be helpful here)
Input corpus of all descriptions
Process: Text Clustering using Carrot2
Output each document will be labeled with a topic
Step #2
Manually map the extracted topics into LinkedIn's categories.
Step #3
Use the output of the first two steps to traverse from company -> extracted topic -> linkedin category

Twitter data topical classification

So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.

Clustering of Twitter Feeds

I am new to clustering, just implemented a couple of algorithms before.
I need to cluster tweets according to their similarity.
One way is to use only hash tags, but I don't think it would be that informative. So complete tweets should be analyzed.
Moreover I was searching the web for the algorithms for clustering feeds.
One I encountered is TF-IDF. I want to know are there better algorithms which can be implemented in few hours and are better than TF-IDF.Also I would be intersetd in some informatics source about the clustering of twitter feeds.
PS: No. of tweets : 10^5
As Anony Mousse pointed out in his comment above, TF/IDF is only a normalization measure to make sure words that are overly popular among all documents don't gain too much important.
For data preparation, I'd recommend reading this and the second part of it too (linked via the above link), if you haven't already done so. It is very important to get a vector of numbers from each tweet. In general, in machine learning, it is important to get a feature vector because that way, you can apply mathematical algorithms to your data then.
Now that you have a feature vector for each tweet in your collection, things get a bit simple. There are two clustering algorithms that come to my mind that you can whip up in a couple of hours each, with maybe extensive testing taking a weekend.
K-Means Clustering
Hierarchical Clustering With Single Linkage
With 100,000 tweets only, you should actually be able to implement these algorithms on a single computer (i.e. this is not big data -- no need for cluster computing), using your favorite language (C++, Java, Python, MATLAB, etc.). Personally, I think it's easier to implement K-Means Clustering (which I have done before) compared to Hierarchical Clustering (which I have also done before).
EDIT: Please follow the below comments only if you have labeled training data, i.e. you have tweets say, with labeled sentiments (happy-user, ok-ok, bad product, angry-user, abusive-user) and the question you want to answer is: Given a new tweet, what is it's sentiment?
Here is one very good resource you should look at, to get a better understanding of K-Nearest Neighbors:
Laszlo Kozma's Slides
In general, for the other two algorithms, there are ample resources, with Wikipedia articles the best way to start. Personally, I feel K-Nearest Neighbors (shorthand k-NN) is the easiest of the three to implement and will give you quick results.

Better text documents clustering than tf/idf and cosine similarity?

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.
The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:
1- The website Stackoverflow is a nice place.
2- Stackoverflow is a website.
The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:
1- The website Stackoverflow is a nice place.
2- I visit Stackoverflow regularly.
Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.
My question: is there better techniques to cluster documents?
In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.
Topic models such as LDA might work even better.
As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.
If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.
While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.
Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.
"Steve Jobs was the CEO at Apple" is clearly about the company
"I'm eating the most delicious apple" is clearly about the fruit
"I'm going to the big apple when I travel to the USA" is most likely about visiting New York
Long answer:
TfxIdf is currently one of the most famous search method. What you need are some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for english (for example the lib 'nltk' in python).
You must use the NLP analysis both on your querys (questions) and on yours documents before indexing.
The point is : while tfxidf (or tfxidf^2 like in lucene) is good, you should use it on annotated resource with meta-linguistics information. That can be hard and require extensive knowledge about your core search engine, grammar analysis (syntax) and the domain of document.
Short answer : The better technique is to use TFxIDF with light grammar NLP annotations, and both re-write query and indexing.

Resources