Search log dataset - search-engine

I'm interested in training a query recommendation system on top of search queries but it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?
Is there any search log that belong to search engines like google,yahoo,bing?
Tnx

Related

Resources for Promotion/Demotion Strategies for ML Item Recommendation Systems?

We are looking to design a system where specific items or categories of items can be boosted/promoted up or relegated/demoted down the recommendation order.
What are the common strategies or standards to do such?
A cursory google search did not yield anything super-useful.
Though this seems like a common problem in e-commerce.
We are looking into Amazon Personalize on AWS as one option.
What is this area called in literature, is there standard name used in the field/industry?
Are there introductory or survey papers?

Find commonly joined queries in Redshift

I want to get a list of the most frequently joined tables in our Redshift. Ideally with the join conditions. Reason: we're adding sortkeys and distkeys, and trying to be relatively thorough (sidenote: if you have any good tips for optimizing query runtimes, I'm eager to hear).
I know I can query STL_QUERY to get querytext, runtimes, etc. But aside from doing some manual text analysis, any way to see which tables are merged by query id?
As far as I know, there is not "STL" table in redshift, that can readily give out this information. As you mentioned, you would need to look at all the queries in STL_QUERYTXT table and search for joins.
In terms of general performance tuning suggestions, I would suggest you look at persicope's blog if you havent already. And there is this.

Recommender System: Is it content-based filtering?

Can someone please help me clarify.
I am currently using collaborative filtering (ALS) which returns a recommendation list with scores corresponding to the recommended items. In addition to this, I am boosting the scores (+0.1) if the items contain a tag that corresponds with what the user has specified they prefer such as "romantic movies". To me, this is considered a hybrid collaborative approach since it's boosting the Collaborative filtering results with content-based filtering (Please correct me if I am wrong).
Now, what if I did the same approach without doing Collaborative filtering? would it be considered Content-based Filtering? since I will be still recommending dishes based on the content and attributes of each dish corresponding to what the user has specified they like (such as "romantic movies").
The reason why I'm confused is because I've seen content-based filtering where they apply an algorithm such as Naive Bayes etc, and this approach would be similar to a simple search of the items (on the contents).
Not sure you can do what you suggest because you have no score to boost without CF.
You are indeed using a hybrid, much the same as the Universal Recommender. To do purely content-based recommendations you have to implement two methods
Personalized recommendations: here you have to look at the content of items the user preferred and find items that have similar content. This can be done by using something like the Mahout spark-rowsimilarity job to create a model of item: list-of-similar-items then indexing the results with a search engine and using the user's preferred item ids as the query. This is being added to the Universal Recommender.
"People who liked this also liked these": these are items similar to one being viewed, for example, and are the same for all users. They are not personalized and so are useful even for anonymous users with no history. This can be done with the same indexed ids as above but using the items similar to the one being viewed as the query. One might think to use only the similar items themselves but by using them as a query you can put the categorical boost in the search engine query and have boosted items returned. This already works in the Universal Recommender but the similar items are not in the model yet.
That said mixing content with collaborative-filtering will almost surely give better results since CF works better when the data is available. The only time to rely on content-based recommendations is when your catalog is of one-off items, which never get enough CF interactions or you have rich content, which has a short lifetime like breaking news.
BTW anyone who wants to help add the pure content-based part to the Universal Recommender can contact the new maintainers of it at ActionML.com

Mahout Recommender - questions to setup user preference

I'm looking for some advice / guidance --
I'm working on a recommendation engine / personnel assistance app, using Mahout as the framework -
What I want to do is for new users of the app to begin by answering 5 questions and use the answers from the questions to effect the recommendation -- pretty much feeding the answers as a user-preference
I'm just not sure how to incorporate this into my code, I'm not even sure where to begin looking - I've been Googling but none of the search results really address this...
Any suggestions / advice / guidance will be greatly appreciated
Thanks
I did just that with the new Spark Itemsimilarity implementation about a year ago. You'll need a search engine for the recommendations query because Mahout doesn't have a server. I'd suggest using the new "Universal Recommender" engine template with PredicitonIO. It uses Mahout to calculate the model and Elasticsearch to serve it.
https://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
PreditionIO is a framework of integrated components that provide an event server (for event storage) integration with Hadoop/HDFS, Spark, Hbase, and a REST or SDK API. All you do is install it and get the template as a plugin engine. This will provide pretty advanced recommendations queries with multiple event ingestion, a hybrid content-based method to tune results, and several methods of using popular items for backfill when no other recommendations can be made. It also uses realtime user actions for recommendations.
This last bit is super important if you want to have your users go through some training. This way they will see the benefit of training in realtime. Check this site, where I did exactly what you are talking about: https://guide.finderbots.com Notice the "Trainer". It presents you with movies and asks for thumbs up or down for as many as you care to do, then when you ask for recommendations they will be based on the realtime preferences of the user. You need to create an account first so we have a user-id.
The way I created the list for the trainer is by cluster popular items. By clustering I mean based on the users that preferred the items. Clustering produces items that are differentiated because they belong to different clusters, which means different user-sets tended to like them, and the popular ones are more likely to be known by users when they go through training. These are good things to have in a trainer.

I want to get related searches or keywords

How can I use php to categorise different keywords together for example to consider shoes, boots, nike, etc in the similar categories.
Any code would be appreciated.
(Warning: a fundamentalist approach) Look at MIT Reality Commons [0] and OpenCyc [1], [2]. These are two open databases of common sense. Make several searches by categories you're interested in. You'll get some related terms for each category. Put it in a fast database of your liking, and you're set.
Also, various SEO people like to create clouds of related keywords in meta tags of relevant pages. Take a look at source of several such pages, extract and filter keywords.

Resources