Identify trending topics in Twitter - twitter

I am using spark streaming to stream real time tweets (filter, only english tweets) and store them in Cassandra, then I am planning to run K-means/ LSI algo (using spark MLib) to identify trending topics.
I need hints on how to represent these tweets in a matrix (vector) representation. Further, I want to know if is it right to train the model with stored data and then run the model with the streamed data?

It all depends on the features you are using and the language you are using.
You could represent it as a vector with all the words as columns and each value between 1 and 0 using some kind of metric like TFIDF. Then perform the k-means on a regular RDD (or sparse)
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
https://spark-summit.org/2014/wp-content/uploads/2014/07/sparse_data_support_in_mllib1.pdf

Related

Is it possible to cluster data with grouped rows of data in unsupervised learning?

I am working to setup data for an unsupervised learning algorithm. The goal of the project is to group (cluster) different customers together based on their behavior on the website. Obviously, some sort of clustering algorithm is best for discovering patterns in the data we can't see as humans.
However, the database contains multiple rows for each customer (in chronological order) for each action the customer took on the website for that visit. For example customer with ID# 123 clicked on page 1 at time X and that would be a row in the database, and then the same customer clicked another page at time Y. That would make another row in the database.
My question is what algorithm or approach would you use for clustering in this given scenario? K-means is really popular for this type of problem, but I don't know if it's possible to use in this situation because of the grouping. Is it somehow possible to do cluster analysis around one specific ID that includes multiple rows?
Any help/direction of unsupervised learning I should take is appreciated.
In short,
Learn a fixed-length embedding (representation) of each event;
Learn a way to combine a sequence of such embeddings into a single representation for each event, then use your favorite unsupervised methods.
For (1), you can do it either manually or use an encoder/decoder;
For (2), there is a range of things you can do, ranging from just simply averaging embeddings from each event, to training an encoder-decoder on reconstructing the original sequence of events and take the intermediate representation (that the decoder uses to reconstruct the original sequence).
A good read on this topic (though a bit old; you now also have the option of Transformer Network):
Representations for Language: From Word Embeddings to Sentence Meanings

Online clustering of news articles

Is there a common online algorithm to classify news dynamically? I have a huge data set of news classified by topics. I consider each of that topics a cluster. Now I need to classify breaking news. Probably, I will need to generate new topics, or new clusters, dynamically.
The algorithm I'm using is the following:
1) I go through a group of feeds from news sites and I recognize news links.
2) For each new link, I extract the content using dragnet, and then I tokenize it.
3) I find the vector representation of all the old news and the last one using TfidfVectorizer from sklearn.
4) I find the nearest neighbor in my dataset computing euclidean distance from the last news vector representation and all the vector representations of the old news.
5) If that distance is smaller than a threshold, I put it in the cluster that the neighbor belongs. Otherwise, I create a new cluster, with the breaking news.
Each time a news arrive, I re-fit all the data using a TfidfVectorizer, because new dimensions can be founded. I can't wait to re-fit once per day, because I need to detect breaking events, which can be related to unknown topics. Is there a common approach more efficient than the one I am using?
If you build the vectorization yourself, adding new data will be much easier.
You can trivially add new words as new columns that are simply 0 for all earlier documents.
Don't apply the idf weights, but use them as dynamic weights only.
There are well known, and very fast, implementations of this.
For example Apache Lucene. It can add new documents online, and it uses a variant of tfidf for search.

Using Text Sentiment as feature in machine learning model?

I am researching what features I'll have for my machine learning model, with the data I have. My data contains a lot of textdata, so I was wondering how to extract valuable features from it. Contrary to my previous belief, this often consists of representation with Bag-of-words, or something like word2vec: (http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
Because my understanding of the subject is limited, I dont understand why I can't analyze the text first to get numeric values. (for example: textBlob.sentiment =https://textblob.readthedocs.io/en/dev/, google Clouds Natural Language =https://cloud.google.com/natural-language/)
Are there problems with this, or could I use these values as features for my machine learning model?
Thanks in advance for all the help!
Of course, you can convert text input single number with sentiment analysis then use this number as a feature in your machine learning model. Nothing wrong with this approach.
The question is what kind of information you want to extract from text data. Because sentiment analysis convert text input to a number between -1 to 1 and the number represents how positive or negative the text is. For example, you may want sentiment information of the customers' comments about a restaurant to measure their satisfaction. In this case, it is fine to use sentiment analysis to preprocess text data.
But again, sentiment analysis is only given an idea about how positive or negative text is. You may want to cluster text data and sentiment information is not useful in this case since it does not provide any information about the similarity of texts. Thus, other approaches such as word2vec or bag-of-words will be used for the representation of text data in those tasks. Because those algorithms provide vector representation of the text instance of a single number.
In conclusion, the approach depends on what kind of information you need to extract from data for your specific task.

Binary recommendation algorithms

I'm currently doing some research for a school assignment. I have two data streams, one is user ratings and the other is search, click and order history (binary data) of a webshop.
I found that collaborative filtering is the best family of algorithms if you are using rating data. I found and researched these algorithms:
Memory-based
user-based
pearson correlation
constrainted pearson
vector similaritys (cosinus)
Mean squared difference
weighted pearson
correlation threshold
max number of neighbours
weighted by correlation
Z-score normalization
item-based
adjusted cosine
maximum number of neighbours
similarity fusion
model based
regression based
slope one
lsi/svd
regularized svd (rsvd/rsvd2/nsvd2/svd++)
integrated neighbor based
cluster based smoothing
Now I'm looking for a way to use the binary data, but I'm having a hard time figuring out if it is possible to use binary data instead of rating data with these algorithms or is there a different family of algorithms I should be looking at ?
I apologize in advance for spelling errors since I have dyslexia and am not a native writer.Thanks marc_s for helping.
Take a look at data mining algorithms such as association rule mining (aka market basket analysis). You've come upon a tough problem in recommendation systems: unary and binary data are common but the best algorithms for personalization don't work well with them. Rating data can represent preference for a single user-item pair; e.g., I rate this movie 4 stars out of 5. But with binary data, we have the least granular type of rating data: I either like or don't like something, or have or have not consumed it. Be careful not to confuse binary and unary data: unary data means that you have information that a user consumed something (which is coded as 1, much like binary data), but you have no information about whether a user didn't like or consume something (which is coded as NULL instead of binary data's 0). For instance, you may know that a person viewed 10 web pages, but you don't have any idea what she would have thought of other pages had she known they were available. That's unary data. You can't assume any preference information from NULL.

Is training data required for collaborative filtering methods?

I'm about to start writing a recommender system for videos, mostly based on collaborative filtering as video metadata is pretty sparse, with a bit of content-based filtering as well. However, I'm not sure what to do about training data. Is training data something of importance in recommender systems, specifically in collaborative methods? If so, how can I generate that kind of data, or what type of data should I look for?
Any ML algorithm needs data. Take Matrix Factorization approach, for example.
It receives (incomplete) matrix of rates: rows represent users, columns represent items and a cell contains rate that particular user rated particular item. Then by factorizing this matrix you obtain latent vector representation for each user and each item, thus allowing you to predict future rates. Obviously, unseen items with highest rate are most interesting to the user, according to the model.
Essentially, Matrix Factorization learns predicting new rates for known users and items.

Resources