It is well-known that, in the turn of the century, Google stood out of other search engines because of its revolutionary PageRank algorithm (despite the fact that mathematically, PageRank is just a very simple application of Perron-Frobenius theorem). However, very little information about those search algorithms used by pre-Google search engines can be found on the web.
Does anybody know how big names like Lycos, Alta Vista, Excite, Yahoo or Ask Jeeves perform web search and page ranking? Is there any material that documents those algorithms in more details?
For instance, I learn from this short history of early search engines that Excite did "use statistical analysis of word relationships to improve relevancy of searches on the Internet", while Yahoo maintained "a highly regarded directory of sites that were cataloged by human editors". That's a valuable piece of information, but the descriptions are still too vague.
There is this documentary about the history of the internet (up to 2008) called Download: The History of the Internet. I think it mentioned that yahoo curated and categorized popular websites by hand.
I think it worked like an address book, the websites needed to opt-in and provide yahoo with it's web address and category.
Related
I want to store large number of data ppints for user actions, like likes, tags etc (I have plans for both e-commerce and document management).
With the data points, I want to support functions such as
"users who loved X loved Y,Z" recommendations
"fetch more stuff similar to X,Y" clustering.
By production-ready, real-time; I mean that I can enter data points and make queries at the same time, the server will take care of answering queries and updating scores by itself.
I searched around the interwebs and the solutions that come up are either of:
Data-mining libraries that are mostly academic-oriented and are meant for large batch operations, not for heavy real-time queries
Hadoop/Mahout, which is production-ready and support real time updates and queries, but have a steep learning curve and tough to administer.
For recommenders, Mahout has a non-distributed recommender implementation that does not use Hadoop. In fact, this is the only part that is real-time; the Hadoop-based parts are not.
I think there is little learning curve to it; see here and here for a pretty complete writeup.
Mahout in Action chapters 2-5 cover this quite well too.
Please understand that for useful recommendations, the various parameters of such a system must be carefully fine tuned. The out of the box functionality many systems have (Oracle data mining, Microsoft data mining extensions etc.) just offer the core functionality.
So in the end, you will not get around the "steep learning curve", I guess. That is why you need experts for data mining. If there were a point-and-click solution, it would already be integrated everywhere.
Example "similar items". I laughed hard, when Amazon once recommended me to buy two products: Debian Linux Administrators Handbook and ... Debian Linux Admininstrators Handbook WITH CD.
I hope you get the key point of this example: to a plain algorithm, the two books appear "similar", and thus a sensible combination. To a human, it it pointless to buy the same book twice. You need to teach such rules to any recommendation system, as they cannot be trivially learned from the data. There will always be good results and useless results, and you need to tune and parameterize the system carefully.
A recent announcement by Google about the Google Prediction API sounded very interesting. It could be useful for a project that is coming up, and would probably do a better job than some custom code I was considering.
However, there is some vendor lock-in. Google retain the trained model, and could later choose to overcharge me for it. It occurred to me that there are probably open-source equivalents, if I was willing to host the training myself (I am) and live without their ability to throw hardware at the problem at a moment's notice.
Last time I looked at 3rd Party computer training code was many years ago, and there were a lot of details that needed to be carefully considered and customised for your project. Google appear to have hidden those decisions, and take care of them for you. To me, this is still indistinguishable from magic, but I would like to hear whether others can do the same.
So my question is:
What alternatives to Google Prediction API exist which:
categorise data with supervised machine learning,
can be easily configured (or don't need configuration) for different kinds and scales of data-sets?
are open-source and self-hosted (or at the very least, provide you with a royalty free use of your model, without a dependence on a third party)
Maybe Apache Mahout?
PredictionIO is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery.
Have been looking recently at tools like google prediction API, one of the first ones I got put on to was Weka machine learning tool which could be worth checking out for anyone looking.
I'm not sure if it's relevant, but directededge seams to be doing exactly that :)
There is good free for use service Yandex Predictor with 100000/day request quota. It works for text only, supports several languages and spell correction.
You need to get free API Key, then you can use simple RESTful API. Api support JSON, XML and JSONP as output.
Unfortunately I cannot find documentation in English. You can use Google Translate.
I can translate docs if there is some demand.
With millions of users searching for so many things on google, yahoo and so on.
How can the server handle so many concurrent searches?
I have no clue as to how they made it so scalable.
Any insight into their architecture would be welcomed.
One element, DNS load balancing.
There are plenty of resources on google architecture, this site has a nice list:
http://highscalability.com/google-architecture
DNS Load Balancing is correct, but it is not really the full answer to the question. Google uses a multitude of techniques, including but not limited to the following:
DNS Load Balancing (suggested)
Clustering - as suggested, but note the following
clustered databases (the database storage and retrieval is spread over many machines)
clustered web services (analogous to DNSLB here)
An internally developed clustered/distributed filing system
Highly optimised search indices and algorithms, making storage efficient and retrieval fast across the cluster
Caching of requests (squid), responses (squid), databases (in memory, see shards in the above article)
I've gone searching for information about this topic recently and Wikipedia's Google Platform article was the best all around source of information on how Google does it. However, the High Scalability blog has outstanding articles on scalability nearly every day. Be sure to check it out their Google architecture article too.
The primary concept in most of the highly scalable applications is clustering.
Some resources regarding the cluster architecture of different search engines.
http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/googlecluster-ieee.pdf
https://opencirrus.org/system/files/OpenCirrusHadoop2009.ppt
You can also read interesting research articles at Google Research and Yahoo Research.
I'm working on a large search engine system.
However, I'm not familiar with the background.
Where can I find materials about indexing and page ranking?
You can always look at the google research stuff. It is naturally very intense stuff but interesting none the less.
Modern information Retrieval
A very known and a good book that will introduce you to these concepts.
facebook, skype, myspace etc... all have millions and millions of users, does anyone know what their architecture looks like. Is it distributed on different nodes or do they use massive clusters?
Check below link to read how bigger applications like Amason, eBay, Flickr, Google etc. lives with high traffic.
http://highscalability.com/links/weblink/24
Interesting website for architects.
(I blogged about this earlier after a research for a BIG project - http://blog.ninethsense.com/high-scalability-how-big-applications-works/)
Memcached is used by a lot of sites with a lot of users, including Facebook. You can find lots of blogs that discuss the architecture of various high traffic Web sites. A long time ago I listened to this Arcast which I thought was quite interesting (if you do ASP.NET with SQL Server)
Facebook uses Hadoop/Hive and Erlang among other things (see http://www.facebook.com/notes.php?id=9445547199)
Hey this may not directly answer your question.
Interesting to watch nonetheless. See
facebook software stack.