display "movie name" instead of "movie id" as recommendation from apache mahout - mahout

I am developing a simple movie recommender system using apache mahout by referring a short video here- https://www.youtube.com/watch?v=yD40rVKUwPI. The code for recommender is
public class App
{
public static List<RecommendedItem> getRecommend(int k) throws Exception
{
ClassLoader classLoader = App.class.getClassLoader();
DataModel model = new FileDataModel(new File(classLoader.getResource("data/dataset.csv").getFile()));
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(k, 3);
return recommendations;
}
}
This generates recommendations in the form of movie id's.What I want is to display names instead of movie id. The dataset I am using (which generates id's)has following columns in csv form
user_id movie_id rating
but since there is a MovieLens dataset which has two files- one with fields
user_id movie_id rating
and second with
movie_id movie_name
How can i use the above resourcesto get movie_names instead of id. Is it possible with DataModel class or there is some other way out.
I want recommendations as
movie_name value
instead of present
movie_id value

You likely cannot with Mahout alone. You will need to load the movie title CSV file using a CSV reader, or import it into a database, and map movie IDs back to names yourself.

Related

How to handle release year difference in movie recommendation

I have been part of the movie recommendation project. We have developed a doc2vec model using gensim.
You can have a look at gensim documentation if needed.
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar
Trained the model and when i took top 10 similar movies for a film based on cast it gives way back old movies with release_yr as (1960, 1950, ...). So i have tried including the release_yr as a parameter to gensim model but still it shows me old movies. How can i solve this release_yr difference? When I see top10 recommendations for a film I need those movies whose release_yr difference is less (like past 10 years movies not more than that). How can i do that?
code for doc2vec model
def d2v_doc(titles_df):
tagged_data = [TaggedDocument(words=_d, tags=[str(titles_df['id_titles'][i])]) for i, _d in enumerate(titles_df['doc'])]
model_d2v = Doc2Vec(vector_size=300,min_count=10, dm=1)
model_d2v.build_vocab(tagged_data)
model_d2v.train(tagged_data,epochs=100,total_examples=model_d2v.corpus_count)
return model_d2v
titles_df dataframe contains columns(id_titles, title, release_year, actors, director, writer, doc)
col_names = ['actors', 'director','writer','release_year']
titles_df['doc'] = titles_df[col_names].apply(lambda x: ' '.join(x.astype(str)), axis=1).str.split()
Code for Top 10 similar movies
def titles_lookup(similar_doc,titles_df):
df = pd.DataFrame(similar_doc, columns =['id_titles', 'simialrity'])
df = pd.merge(df, titles_df[['id_titles','title','release_year']],on='id_titles',how='left')
print(df)
def demo_d2v_title(model,titles_df, id_titles):
similar_doc = model.docvecs.most_similar(id_titles)
titles_lookup(similar_doc,titles_df)
def demo(model,titles_df):
print('hunt for red october')
demo_d2v_title(model,titles_df, 'tt0099810')
The output for Top 10 similar movies for film - "hunt for red october"
id_titles similarity title release_year
0 tt0105112 0.541722 Patriot Games 1992.0
1 tt0267626 0.524941 K19: The Widowmaker 2002.0
2 tt0112740 0.496758 Crimson Tide 1995.0
3 tt0052151 0.471951 Run Silent Run Deep 1958.0
4 tt1922685 0.464007 Phantom 2013.0
5 tt0164184 0.462187 The Sum of All Fears 2002.0
6 tt0058962 0.459588 The Bedford Incident 1965.0
7 tt0109444 0.456760 Clear and Present Danger 1994.0
8 tt0063121 0.455807 Ice Station Zebra 1968.0
9 tt0146309 0.452572 Thirteen Days 2001.0
you can see from the output that i'm still getting old movies. Please help me how to solve that.
Thanks in advance.
Doc2Vec only knows text-similarity; it doesn't have the idea of other fields.
So if you want to discard matches according to some criteria other than text-similarity, that's only represented external to the Doc2Vec model, you'll have to do that in a separate step.
So, you could use .most_similar() with a topn=len(model.docvecs) parameter - to get back all moviews, ranked. Then, filter that result-set by discarding any whose year is too-far from your desired year. Then, trim that result-set to the top N that you really want.

Creating join table vs manually defining?

The app I am working on is a print shop that prints items and each item can have multiple locations with each location having it's own pricing to add onto the final cost of the item.
I'm doing something like this in the create method:
if (#product.front_print && #product.back_print).present?
#product.production_price = (#product.price.to_i + 5)
else
#product.production_price = #product.price
end
price is base price. production_price is final cost of goods.
This is an example where if there is more than one print location (ie, front + back), it will generate $5 extra on top of the base price to the final production_price.
The final situation will have 6 possible print locations.
Should i be creating a join table for something like this with the join table having a Product model, and PrintLocation model?
The PrintLocation model would have ID, Title, and Price.
Is this ideal or what would be the best route to go about doing this?

Grails returning one element from each object

I am trying to just return a single string from each object.
Given the following:
class Book {
String title
Date releaseDate
String author
Boolean paperback
}
for every instance of Book I want to get an array of authors then make them unique.
I thought you could do something like:
def authors = Book.findAllByAuthor()
This just gives me an array off book objects.
I know i can do a
a =[]
authors.each{a.add(it.author)}
a.unique()
I am almost certain there is a way just to grab all authors in one line.
any ideas?
This gives you distinct authors of any book:
Book.executeQuery("select distinct author from Book")
You can use projections to get a distinct list of authors across all books. Take a look at the createCriteria documentation for more examples.
def c = Book.createCriteria()
def authors = c.list() {
projections {
distinct('author')
}
}

Updating the data-set when classifing new nominal instances

I'm using J48 to classify instances composed of both numeric and nominal values.
My problem is that I don't know which nominal-value I'll come across during my program.
Therefor I need to update my nominal-attribute's data of the model "on the fly".
For instance, say I have only 2 attribute, occupation and age and the run is as followed:
OccuptaionAttribute = {}.
input: [Piano teacher, 22].
OccuptaionAttribute = {Piano teacher}.
input: [school teacher, 30]
OccuptaionAttribute = {Piano teacher, school teacher}.
input: [Piano teacher, 40]
OccuptaionAttribute = {Piano teacher, school teacher}.
etc.
Now I've try to do so manually by copying the previous attributes, adding the new attribute and then updating the model's data.
That works fine when training the model.
But!
when I want to classify a new instance, say [SW engineer, 52], OccuptaionAttribute was updated:
OccuptaionAttribute = {Piano teacher, school teacher, SW engineer}, but the tree itself never "met" "SW engineer" before so the classification cannot be fulfilled and an Exception is thrown.
Can you direct how to handle the above situation?
Does Weka has any mechanism supporting the above issue?
Thanks!
When training add a placeholder data to your nominal-attributes like __other__.
Before trying to classify an instance first check whether the value of nominal attribute is seen before; if its not use the placeholder value:
Attribute attribute = instances.attribute("OccuptaionAttribute");
String s = "SW engineer";
int index = attribute.indexOfValue(s);
if (index == -1) {
index = attribute.indexOfValue("__other__");
}
When you have enough data train again with the new values.

MVC more specified models should be populated by more precise query too?

If you have a Car model with 20 or so properties (and several table joins) for a carDetail page then your LINQ to SQL query will be quite large.
If you have a carListing page which uses under 5 properties (all from 1 table) then you use a CarSummary model. Should the CarSummary model be populated using the same query as the Car model?
Or should you use a separate LINQ to SQL query which would be more precise?
I am just thinking of performance but LINQ uses lazy loading anyway so I am wondering if this is an issue or not.
Create View Models to represent the different projections you require, and then use a select projection as follows.
from c in Cars
select new CarSummary
{
Registration = c.Registration,
...
}
This will create a query that only select the properties needed.
relationships will be resolved if they are represented in the data context diagram (dbml)
select new CarSummary
{
OwnerName = c.Owner.FirstName
}
Also you can nest objects inside the projection
select new CarSummary
{
...
Owner = new OwnerSummary
{
OwnerName = c.Owner.FirstName,
OwnerAge = c.Owner.Age
}
...
}
If you are using the same projection in many places, it man be helpful to write a method as follows, so that the projection happens in one place.
public IQueryable<CarSummary> CreateCarSummary(IQueryable<Car> cars)
{
return from c in cars
select new CarSummary
{
...
}
}
This can be used like this where required
public IQueryable<CarSummary> GetNewCars()
{
var cars = from c in Cars
select c;
return CreateCarSummary(cars);
}
I think that in your case lazy loading doesn't bring much benefit as you are going to use 1 property from each table, so sooner or later to render the page you will have to perform all the joins. In my opinion you could use the same query and convert from a Car model to a CarSummary model.
If performance is actually a concern or an issue currently, you should do a separate projection linq query so that the sql query only selects the 5 fields you need to populate your view model instead of returning all 20 fields.

Resources