Mahout recommendataionon Implicit data with multiple factors. - mahout

Could you please provide me some details on mahout recommendations using data with multiple factors.? I have data with user id, book, language, category etc. Say suppose, a person read the book with category as thriller, in french language. Now considering all those facts i need to recommend a book to him. Could you please give me some insight on picking the right path.?

Just the thing for Mahout 1.0 where we create models for a search engine to index and query.
The models are called indicators and are lists of similar items for each item. Similar in the sense that they were purchased by the same people. This is the essence of a cooccurrence recommender.
The collaborative filtering data is the book read or the ID. If you recommend a book you can show other IDs with the same titles for multiple formats (ebook, recorded, papeback etc.) The metadata can be used to skew recs toward a certain category. The language is probably a filter unless you think your audience are usually multilingual.
Create the CF type indicator by feeding purchases into Mahout 1.0 spark-itemsimilairty. out will come a list of similar books for each book. Index those in a search engine. Then the simplest query is the user's history of books purchased. This will yields unskewed recommendations as an ordered list of books.
Now to skew results towards the user's most favored category index the categories for each item in a separate field in the index. So the index has a field for "indicators" and one for "categories". The "docs" are really items/books in your catalog. the skewed query is (pseudo-code):
query:
field: indicators; q: "book1 book2 book3 book10" //the user's purchase history
field: categories; q: "user's-favorite-category user's-second-favorite-category"
field: language; filter: "list-of-languages-of-books-the-user-has-purchased"
You can put as many categories in the query on that field as you wish, perhaps all the user has purchased from. Note use of a language filter, you may want to use this as a skewing factor rather than a filter. In this way you can seamlessly integrate collaborative filtering recs skewed or filtered by metadata to get higher quality recs. Any metadata can be used that you think will help.
BTW you will get even better recs if you add in other actions you have recorded like views of book details. This will call for a specially processed indicator called a cross-cooccurrence indicator and is also calculated by spark-itemsimilairty. In fact you can include just about any action the user takes--the entire clickstream as separate cross-cooccurrence indicators. This will tend to greatly increase the amount of collaborative filtering data you can use in making recs and therefore improve quality.
This idea can even be extended to actions on items that are not books, like categories. If a user purchases a book they also, in a sense, purchase a category. If you record these "category purchases" as a secondary action and create a cross-cooccurrence indicator with them you can use them both to skew results and as a purchase indicator. The query would look like this:
query:
field: indicators; q: "book1 book2 book3 book10" //the user's purchase history
field: category-indicators; q: "user's-history-of-purchased-categories"
field: categories; q: "user's-favorite-category user's-second-favorite-category"
field: language; filter: "list-of-languages-of-books-the-user-has-purchased"
Read about spark-itemsimilarity here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html This includes some discussion of how to use a search engine (Solr, Elasticsearch) for the index and query part.

Related

Solr join across multiple collections and fetch data from both collections

I have 2 solr collections:
Ads {id, title, body, description, etc etc)
AdPlacement (ad_id, placement_id, price)
Each Ad can have 500-1000 placements, with different prices.
The search usecase is where I have a placement and some search keyword and I want to find the Ads that map the keyword provided in the title/body/description fields and it should be sorted by the price in the AdPlacement collection for the given placement. We would like to get the Ad details and the price in the output returned.
Is there any way to achieve this in solr using join across multiple collections? What I have read so far says you can only get data from one collection and use the other one just for filtering.
Solr is a Document database and supports nested documents so ideally you would want to model such that your add placement records are a part of the Ad document. This would be the better way to handle your scenario. Please go through this blog Solr Nested Objects and the relevant Solr documentation
In case modifying the document structure is not an option then consider this documentation which mentions about allowing some level of join between collections.

Indexing criteria for elasticsearch

I am working with twitter streaming api. and am a little confused about deciding the criteria for indexing the data. Right now I have a single index that contains all the tweets in one doc_type and users in another doc type.
Is it the best way to go about storing them or should i create a new doc type for every category (category can be decided on basis of hashtag and tweet content)
What should be the best approach to storing such data?
Thanks in advance.
At first, the answer to your question is that this very much depends on your use case. What is your application doing? What do you do with the tweets? How many categories do you plan to have?
I'd in general, however, go for a solution where you use the same index and the same doc_type for all tweets. This allows you to build queries and aggregations over all your tweets without thinking about the different types of categories. It also allows you to add new categories easily without having to change your queries.
If you want to do some classification of the tweets you could add a category field to the tweet document stored in elasticsearch. You can then use this category field to implement your specific application logic.
If your category names have spaces or punctuation marks don't forget to define the category field as not_analyzed. Otherwise it will be broken up in parts.

Rails advice on planning data structure

I am building an inventory tracking tool to help people track either unique items (one-offs - say a vintage T-Shirt) or groups of items (a T-shirt design where I have a quantity). The data structures will be very similar, so that:
**Item**
Title
Status (sold, for sale) <- right now this is a simple array
Location <- this is a relationship to a diff model
etc...
**Item Group**
Title
Quantity
Status ([quantity] sold, [quantity] for sale) <- this should be an hstore??
Locations ([quantity] location1, [quantity] location2) <- not sure about this yet!
etc...
I'm expecting to use different forms to gather this information, as too much complexity on the form to accommodate these differences will add difficulty for my user group.
So my questions are as follows:
What is the best data solution for this? Do I want to have two models/controllers or try to extend the Item model? How do people usually handle this sort of issue?
I do have the requirement that I need to show the user all of their inventory (items and groups) at once, but this seems the smaller task to me.
Reduce your headaches and don't differentiate between unique items and non-unique ( ie, all items have a quantity ).
Then you want a "purchace" model, and then a "item_purchace" model to act as a join table.
Following the layout here: guides.rubyonrails.org...

Single Inheritance or Polymorphic?

I'm programming a website that allows users to post classified ads with detailed fields for different types of items they are selling. However, I have a question about the best database schema.
The site features many categories (eg. Cars, Computers, Cameras) and each category of ads have their own distinct fields. For example, Cars have attributes such as number of doors, make, model, and horsepower while Computers have attributes such as CPU, RAM, Motherboard Model, etc.
Now since they are all listings, I was thinking of a polymorphic approach, creating a parent LISTINGS table and a different child table for each of the different categories (COMPUTERS, CARS, CAMERAS). Each child table will have a listing_id that will link back to the LISTINGS TABLE. So when a listing is fetched, it would fetch a row from LISTINGS joined by the linked row in the associated child table.
LISTINGS
-listing_id
-user_id
-email_address
-date_created
-description
CARS
-car_id
-listing_id
-make
-model
-num_doors
-horsepower
COMPUTERS
-computer_id
-listing_id
-cpu
-ram
-motherboard_model
Now, is this schema a good design pattern or are there better ways to do this?
I considered single inheritance but quickly brushed off the thought because the table will get too large too quickly, but then another dilemma came to mind - if the user does a global search on all the listings, then that means I will have to query each child table separately. What happens if I have over 100 different categories, wouldn't it be inefficient?
I also thought of another approach where there is a master table (meta table) that defines the fields in each category and a field table that stores the field values of each listing, but would that go against database normalization?
How would sites like Kijiji do it?
Your database design is fine. No reason to change what you've got. I've seen the search done a few ways. One is to have your search stored procedure join all the tables you need to search across and index the columns to be searched. The second way I've seen it done which worked pretty well was to have a table that is only used for search which gets a copy of whatever fields that need to be searched. Then you would put triggers on those fields and update the search table.
They both have drawbacks but I preferred the first to the second.
EDIT
You need the following tables.
Categories
- Id
- Description
CategoriesListingsXref
- CategoryId
- ListingId
With this cross reference model you can join all your listings for a given category during search. Then add a little dynamic sql (because it's easier to understand) and build up your query to include the field(s) you want to search against and call execute on your query.
That's it.
EDIT 2
This seems to be a little bigger discussion that we can fin in these comment boxes. But, anything we would discuss can be understood by reading the following post.
http://www.sommarskog.se/dyn-search-2008.html
It is really complete and shows you more than 1 way of doing it with pro's and cons.
Good luck.
I think the design you have chosen will be good for the scenario you just described. Though I'm not sure if the sub class tables should have their own ID. Since a CAR is a Listing, it makes sense that the values are from the same "domain".
In the typical classified ads site, the data for an ad is written once and then is basically read-only. You can exploit this and store the data in a second set of tables that are more optimized for searching in just the way you want the users to search. Also, the search problem only really exists for a "general" search. Once the user picks a certain type of ad, you can switch to the sub class tables in order to do more advanced search (RAM > 4gb, cpu = overpowered).

Clubbing reprints of a book with different ISBNs

I am using Google Books API to let a user search for a particular book and display the results. The problem is that different editions of the same book have different ISBNs. Is there any way to club these reprints, based on the information the API returns ?
I want to do this is because I have the ISBNs of some of the editions in my database. So when the user searches for a book, I would like to clubs all the results and display them as one result.
I'm not familiar with this use of the word "club", but it appears that you want to group books that have the same ISBN. I don't know how to do this solely with Google Books, but you can use the wonderful xISBN web service to look up alternate ISBNs for books.
Hit a URL like
http://xisbn.worldcat.org/webservices/xid/isbn/0596002815
to get this response:
<rsp stat="ok">
<isbn>0596002815</isbn>
<isbn>1565928938</isbn>
<isbn>1565924649</isbn>
<isbn>0596158068</isbn>
<isbn>0596513984</isbn>
<isbn>1600330215</isbn>
<isbn>8371975961</isbn>
<isbn>059680539X</isbn>
<isbn>8324616489</isbn>
</rsp>
Which includes first the original ISBN and then all alternates known to WorldCat. Then you can use the alternates for grouping.

Resources