I am trying to build a recommender system which would recommend webpages to the user based on his actions(google search, clicks, he can also explicitly rate webpages). To get an idea the way google news does it, it displays news articles from the web on a particular topic. In technical terms that is clustering, but my aim is similar. It will be content based recommendation based on user's action.
So my questions are:
How can I possibly trawl the internet to find related web-pages?
And what algorithm should I use to extract data from web-page is textual analysis and word frequency the only way to do it?
Lastly what platform is best suited for this problem. I have heard of Apache mahout and it comes with some re-usable algos, does it sound like a good fit?
as Thomas Jungblut said, one could write several books on your questions ;-)
I will try to give you a list of brief pointers - but be aware there will be no ready-to-use off-the-shelf solution ...
Crawling the internet: There are plenty of toolkits for doing this, like Scrapy for Python , crawler4j and Heritrix for Java, or WWW::Robot for Perl. For extracting the actual content from web pages, have a look at boilerpipe.
http://scrapy.org/
http://crawler.archive.org/
http://code.google.com/p/crawler4j/
https://metacpan.org/module/WWW::Robot
http://code.google.com/p/boilerpipe/
First of all, often you can use collaborative filtering instead of content-based approaches. But if you want to have good coverage, especially in the long tail, there will be no way around analyzing the text. One thing to look at is topic modelling, e.g. LDA. Several LDA approaches are implemented in Mallet, Apache Mahout, and Vowpal Wabbit.
For indexing, search, and text processing, have a look at Lucene. It is an awesome, mature piece of software.
http://mallet.cs.umass.edu/
http://mahout.apache.org/
http://hunch.net/~vw/
http://lucene.apache.org/
Besides Apache Mahout which also contains things like LDA (see above), clustering, and text processing, there are also other toolkits available if you want to focus on collaborative filtering: LensKit, which is also implemented in Java, and MyMediaLite (disclaimer: I am the main author), which is implemented in C#, but also has a Java port.
http://lenskit.grouplens.org/
http://ismll.de/mymedialite
https://github.com/jcnewell/MyMediaLiteJava
This should be a good read: Google news personalization: scalable online collaborative filtering
It's focused on collaborative filtering rather than content based recommendations, but it touches some very interesting points like scalability, item churn, algorithms, system setup and evaluation.
Mahout has very good collaborative filtering techniques, which is what you describe as using the behaviour of the users (click, read, etc) and you could introduce some content based using the rescorer classes.
You might also want to have a look at Myrrix, which is in some ways the evolution of the taste (aka recommendations) portion of Mahout. In addition, it also allows applying content based logic on top of collaborative filtering using the rescorer classes.
If you are interested in Mahout, the Mahout in Action book would be the best place to start.
Related
I've been doing some research on the feasibility of building a mobile/web app that allows users to say a phrase and detects the accent of the user (Boston, New York, Canadian, etc.). There will be about 5 to 10 predefined phrases that a user can say. I'm familiar with some of the Speech to Text API's that are available (Nuance, Bing, Google, etc.) but none seem to offer this additional functionality. The closest examples that I've found are Google Now or Microsoft's Speaker Recognition API:
http://www.androidauthority.com/google-now-accents-515684/
https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api
Because there are going to be 5-10 predefined phrases I'm thinking of using a machine learning software like Tensorflow or Wekinator. I'd have initial audio created in each accent to use as the initial data. Before I dig deeper into this path I just wanted to get some feedback on this approach or if there are better approaches out there. Let me know if I need to clarify anything.
There is no public API for such a rare task.
Accent detection as language detection is commonly implemented with i-vectors. Tutorial is here. Implementation is available in Kaldi.
You need significant amount of data to train the system even if your sentences are fixed. It might be easier to collect accented speech without focusing on the specific sentences you have.
End-to-end tensorflow implementation is also possible but would probably require too much data since you need to separate speaker-instrinic things from accent-instrinic things (basically perform the factorization like i-vector is doing). You can find descriptions of similar works like this and this one.
You could use(this is just an idea, you will need to experiment a lot) a neural network with as many outputs as possible accents you have with a softmax output layer and cross entropy cost function
I am working on a text mining project which focus on the computer technology documents. So there're many jargons. Tasks like part-of-speech tagging require some training data to built a pos-tagger. And I think this training data should be from the same domain with words like ".NET, COM, JAVA" correctly tagged.
So where can I find such corpus? Or is there any work around? Or can we tune an existing tagger to handle domain specific task?
Gathering training data (and defining features) is going to be the hardest step of this problem. I'm sure there are datasets out there. But an alternative option for you would be to identify a few journals or news sites that focus on your area of interest and crawl them and pull down the text, perhaps validating each article you pull down by searching for keywords. I've done that before to develop a corpus focused on elections.
Unfortunately, it is domain-specific where you can find such a corpus.
Catch-22. There is no general source for specialized data.
Just like there is no universal software to solve domain-specific problems.
I want to recommend items that are tagged and are categorized into three price categories (cheap, regular and expensive). I know that with Mahout recommendation could be achieved but here's why I don't know how to use it.
Mahout is based on the other users opinion but all of the new items that I want to recommend are just the new ones that don't have any preferences set yet.
Is Mahout the right tool for this? Is this content-based? (which mahout don't support yet????) or should I use classification?
Thanks!
Since I've never built any recommender system - do not take this answer very seriously (no-one has answered it, so I try)
recommendation system has to be built on some already known (or partially known data). If you have only new (unseen) data there is only possibility to use some clustering algorithm in order to build some clusters.
And if those clusters would be ok, they can be used for training some recommendation system.
Mahout is just a tool which implement various ML methods. You can use other tools like Weka, R, ...
If you have no data at all about a new user, there's really nothing you can do to make recommendations, no matter what you do. There is zero input that would differentiate the person from anyone else.
Good systems should however be able to do something reasonable after the first input is available.
This is not a classifier problem by nature, no. It is also not a clustering tool, other answers notwithstanding.
The price categories are not core to any rec process you would use. You have other data presumably, what is it? That's important.
Finally whether or not to use Mahout depends on taste. You would use it if you want to use Java and Hadoop. And in turn you would only consider Hadoop if you had very large input, and few people have that much data (like >10M data points at least).
(Well, not quite -- my recommender pieces in Mahout pre-date Hadoop and are for on-line, smaller-scale applications. You might indeed be interested in this, if you are working in Java.)
I working on a site that needs to present a set of options that have no particular order. I need to sort this list based on the customer that is viewing the list. I thought of doing this by generating recommendation rules and sorting the list putting the best suited to be liked by the customer on the top. Furthermore I think I'd be cool that if the confidence in the recommendation is high, I can tell the customer why I'm recommending that.
For example, lets say we have an icecream joint who has website where customers can register and make orders online. The customer information contains basic info like gender, DOB, address, etc. My goal is mining previous orders made by customers to generate rules with the format
feature -> flavor
where feature would be either information in the profile or in the order itself (like, for example, we might ask how many people are you expecting to serve, their ages, etc).
I would then pull the rules that apply to the current customer and use the ones with higher confidence on the top of the list.
My question, what's the best standar algorithm to solve this? I have some experience in apriori and initially I thought of using it but since I'm interested in having only 1 consequent I'm thinking now that maybe other alternatives might be better suited. But in any case I'm not that knowledgeable about machine learning so I'd appreciate any help and references.
This is a recommendation problem.
First the apriori algorithm is no longer the state of the art of recommendation systems. (a related discussion is here: Using the apriori algorithm for recommendations).
Check out Chapter 9 Recommendation System of the below book Mining of Massive Datasets. It's a good tutorial to start with.
http://infolab.stanford.edu/~ullman/mmds.html
Basically you have two different approaches: Content-based and collaborative filtering. The latter can be done in terms of item-based or user-based approach. There are also methods to combine the approaches to get better recommendations.
Some further readings that might be useful:
A recent survey paper on recommendation systems:
http://arxiv.org/abs/1006.5278
Amazon item-to-item collaborative filtering: http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
Matrix factorization techniques: http://research.yahoo4.akadns.net/files/ieeecomputer.pdf
Netflix challenge: http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
Google news personalization: http://videolectures.net/google_datar_gnp/
Some related stackoverflow topics:
How to create my own recommendation engine?
Where can I learn about recommendation systems?
How do I adapt my recommendation engine to cold starts?
Web page recommender system
I'm mainly just looking for a discussion of approaches on how to go from decentralized, non-normalized, completely open user-submitted tags, to start making sense of all of it through combining them into those semantic groups they called "clusters".
Does it take actual people to figure out what people actually mean by the tags used, or can it be done simply by automatically analyzing how often the tags go together?
That kind of stuff. Feel free to elaborate wildly :) (Also, if this has been discussed elsewhere, I'd love to hear about it).
Read this article: Automated Tag Clustering. It provides a good overview of the existing approaches and describes the algorithms for tag clustering.
Algorithms of the Intelligent Web (Manning) (esp. Chapter 4) and a book with a similar title from O'Reilly cover clustering algorithms. The Manning book starts with naive SQL approaches and moves to K-means, ROCK, and DBSCAN. It's more generalized than just focusing on tags, but easy to apply in that context. Code is presented in Java but is easily adapted to Ruby (sometimes more easily than adapting the Java code to your problem).
Chapter 5 covers classifications, which is about building topologies, and discusses Bayesian algorithms.