Am working on this problem where I need to cluster search phrase based on what they are looking for (for now, let's assume they are looking for only places, such as bookstore, supermarket, ..)
"Where can I find a cheesecake ?"
could get clustered probabilistically to 'desserts', 'restaurants', ...
"Where can I buy groceries ?"
could get clustered probabilistically to 'supermarkets', 'vegetables', ...
Assume for beginning with, a set of what the search phrases could get classified to, already exists.
I looked into topic modeling but I feel like I might be heading the wrong direction. Any suggestions on how to get started off / what to look into would be highly helpful.
Thanks a lot.
Topic modelling certainly provides one possible solution. Induce a topic model from a large corpus, as representative as possible of the texts you're indexing and searching with. Then represent each query as the posterior over the topics given the query. If you want to obtain a clustering of queries, you could then do so on this reduced set, or if you're doing IR you could use the resulting vectors instead of the original bag of words.
If this isn't what you want, can you elaborate on the problem? What do you hope to do with the clustered queries?
Related
I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query
I am building an Item Based Recommender System for 10 millions users who
rate categories over 20 possible categories (news categories like politic,
sport etc...)
I would like for each one of them to be recommended at least another
category which they don't know (no rating).
I runned a GenericUserBasedRecommender and asked for recommendations for
each user but It looks extremely long: maybe 1000 user proceeded per minute.
My questions are:
1- Can I run this same GenericUserBasedRecommender on hadoop and would it
really be faster? I saw and run an ItemBasedRecommender with command line on
a cluster, but I would rather run a User Based one.
1,5 - I saw many users not having a single recommendations. What is the alogrithm criterium to determine if a user get a recommendation? I thought It could be that the user who don't get recommendations are the one who only give a single rating, but I don't understand why.
2- Is there another smarter way to deal with my problem? Maybe some clustering
solution instead of recommendation? I don't exactly see how.
3- Finally, am I right when I say that the algorithms who have no command line
are not to be used with hadoop?
Thank you for your answers.
Sometimes you won't get recommendations for certain items or users because there are few items over which they overlap. It could also be a case where the user data may be 'enough', but his behaviour/use patterns are very unique and/or disagreement with popular trends in the data.
You could perhaps try LogLikelihood or Tanimoto based ItemSimilarity.
Another thing you could look into is a Matrix Factorization based model. You could use the ALSWR Factorizer to generate recommendations. this method decomposes the original User-Item matrix, to a User-Feature, Item-Feature and Diagonal matrix,--> then reduces the dimensionality-->and then recronstructs the matrix which is closest to the original matrix with same rank. You might lose some data this method, but the missing values in the user-item matrix are imputed and you get estimate preference/recommendation values.
If you have the features and not just implicit ratings, you could probably experiment with clustering techniques, perhaps start with Hierarchical Clustering.
I did not quite get your last question.
It's known how collaborative filtering (CF) is used for movie, music, book recommendations. In the paper 'Collaborative Topic Modeling for Recommending Scientiļ¬c Articles' among other things authors show an example of collaborative filtering applied to ~5,500 users and ~17,000 scientific articles. With ~200,000 user-item pairs, the user-article matrix is obviously highly sparse.
What if you do collaborative filtering with matrix factorization for, say, all news articles shared on Twitter? The matrix will be even sparser (than that in the scientific articles case) which makes CF not very applicable. Of course, we can do some content-aware analysis (taking into account, the text of an article), but that's not my focus. Or we can potentially limit our time window (focus, say, on all news articles shared in the last day or week) to make the user-article matrix denser. Any other ideas how to fight the fact that the matrix is very sparse? What are the results in research in the area of CF for news article recommendations? Thanks a lot in advance!
You might try using an object-to-object collaborative filter instead of a user-to-object filter. Age out related pairs (and low-incidence pairs) over time since they're largely irrelevant in your use case anyway.
I did some work on the Netflix Prize back in the day, and quickly found that I could significantly outperform the base model with regard to predicting which items were users' favorites. Unfortunately, since it's basically a rank model rather than a scalar predictor, I didn't have RMSE values to compare.
I know this method works because I wrote a production version of this same system. My early tests showed that, given a task wherein 50% of users' top-rated movies were deleted, the object-to-object model correctly predicted (i.e., "replaced") about 16x more of users' actual favorites than a basic slope-one model. Plus the table size is manageable. From there it's easy to include a profitability weight against the sort order, etc. depending on your application.
Hope this helps! I have a working version in production but am still looking for beta clients to bang on the system... if anyone has time to give it a run I'd love to hear from you.
Jeb Stone, PhD
www.selloscope.com
I have a group of documents in MongoDB with a "description" value about the size of a tweet. I need to generate a trending topics list from this. Clearly this is a solved problem but I can't find a definitive answer/gem for getting the job done without writing the code myself.
I am using ruby & mongoid in my app.
Is there any ruby gem that will help with or handle this? Thanks.
I know of no such gem, but here's an algorithm you may write for yourself:
Extract n-grams from texts. Since texts are small (tweet size you said) extract all n-grams, no limit here.
"I eat icecream" => {(I), (eat), (icecream), (I eat), (eat icecream), (I eat icecream)}
Compute TF-IDF weight vectors for each text's n-grams
{(I):0.1, (eat):0.01, (icecream):0.2, (I eat):0.12, (eat icecream):0.001, (I eat icecream):0.00012}
Use cosine similarity as a measure function for a incremental clustering algorithm over your vectors, maybe script the Weka library over JRuby
Order all clusters by the population size. The n-grams in the centers of largest clusters are your trendy topics.
A quick search of rubygems.org revelead that you are going to have to do some programming. This is a good thing as a system to generically detect trends would either be hopelessly difficult to setup and tune or awful at guessing what dictates a "trend" in your application.
I'm going to make some assumptions about your application.
Let's assume users are self categorizing their tweets by using hash tags (#). Also, lets go ahead and say a sorted count of these hash tags would determine if a topic was trending.
Now let's talk about the computer science part. Given our assumptions above, you will need to be able to quickly query and sort a collection of hashtags to figure out what is trending.
Your are using MongoDB and mongoid (with rails) so the simplest way to do this would be to create a collection that has tag documents that contain a count of their use. Create indexes on tag and count.
When someone tweets, figure out what the hash tags are, look them up in the tags collection and increment their count. To figure out what is trending, query the tags collection and sort by count. This would get you all-time trending hash tags.
If you wanted to get more specific, instead of just storing counts, store counts broken out by time deltas (week, day, hour etc) perhaps storing them separately. You could create documents that represent your time delta instead of the individual tags and store all the tags with their counts inside.
{
start: "start datetime",
end: "end datetime",
tags: {
awesome: 3,
cool: 2,
boring: 2
}
}
You could also use a capped collection. Hope that helps, all of this really depends on what you are trying to do. You can get really crazy and calculate the trends with time decay, etc. You could read the reddit or hacker news code to get a good idea of what that is like.
I'm writing a webrobot which categorizes sites based on there keyword/meta/links into a predefined list of categories.
I've been looking at various ontology approaches and have looked at Wordnet (for the hypernym/hyponym), ResearchCyc , WebKb and was wondering if this was as hard a problem as I'm thinking or has it been solved somewhere else before.
Essentially I have large stacks of sorted keyword values and would like to use them to match against a category name. My current thoughts are to check against the category name in some kind of ontology hierarchy.
Has anyone else approached a ontology based problem like this?
Cheers!
You might want to look at text mining, specifically keyword mining or subject indexing, research.