what is the best maching learning algorithm in my situation - machine-learning

Assume that a tourist has no idea about the city to visit , I want to recommend top 10 cities based on his features about the city (budgetToTravel , isCoastel , isHitorical , withFamily, etc ...).
My dataset contains features for every city for exemple :
Venice Italy
(budgetToTravel='5000' , isCoastel=1 , isHistorical =1 , withFamily=1,...)
Berlin Germany (BudgetToTravel='6000' ,isHistorical=1, isCoastel =0 , withFamily=1 ,...).
I want to know the best algorithm of machine learning to recommend the top 10 cities to visit based on the features of a tourist .

As stated Pierre S. you can start withKNearestNeigbours
This algorithm will allow you do exactly what you want by doing:
n_cities_to_recommend = 10
neigh = NearestNeighbors(2, radius=1.0) # you need to play with radius here o scale your data to [0, 1] with [scaler][2]
neigh.fit(cities)
user_input = [budgetToTravel, isCoastel, isHistorical, withFamily, ...]
neigh.kneighbors([user_input], n_cities_to_recommend, return_distance=False) # this will return you closest entities id's from cities

You can use (unsupervised) clustering algorithm like Hierarchical Clustering or K-Means Clustering to have clusters of 10 and then you can match the person (tourist) features with the clusters.

Related

Bart Large MNLI - Get predicted label in a single column

I'm trying to classify the sentences of a specific column into three labels with Bart Large MNLI. The problem is that the output of the model is "sentence + the three labels + the scores for each label. Output example:
{'sequence': 'Growing special event set production/fabrication company
is seeking a full-time accountant with experience in entertainment
accounting. This position is located in a fast-paced production office
located near downtown Los Angeles.Responsibilities:• Payroll
management for 12+ employees, including processing new employee
paperwork.', 'labels': ['senior', 'middle', 'junior'], 'scores':
[0.5461998581886292, 0.327671617269516, 0.12612852454185486]}
What I need is to get a single column with only the label with the highest score, in this case "senior".
Any feedback which can help me to do it? Right now my code looks like:
df_test = df.sample(frac = 0.0025)
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
sequence_to_classify = df_test["full_description"]
candidate_labels = ['senior', 'middle', 'junior']
df_test["seniority_label"] = df_test.apply(lambda x: classifier(x.full_description, candidate_labels, multi_label=True,), axis=1)
df_test.to_csv("Seniority_Classified_SampleTest.csv")
(Using a Sample of the df for testing code).
And the code I've followed comes from this web, where they do receive a column with labels as an output idk how: https://practicaldatascience.co.uk/machine-learning/how-to-classify-customer-service-emails-with-bart-mnli

How to handle release year difference in movie recommendation

I have been part of the movie recommendation project. We have developed a doc2vec model using gensim.
You can have a look at gensim documentation if needed.
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar
Trained the model and when i took top 10 similar movies for a film based on cast it gives way back old movies with release_yr as (1960, 1950, ...). So i have tried including the release_yr as a parameter to gensim model but still it shows me old movies. How can i solve this release_yr difference? When I see top10 recommendations for a film I need those movies whose release_yr difference is less (like past 10 years movies not more than that). How can i do that?
code for doc2vec model
def d2v_doc(titles_df):
tagged_data = [TaggedDocument(words=_d, tags=[str(titles_df['id_titles'][i])]) for i, _d in enumerate(titles_df['doc'])]
model_d2v = Doc2Vec(vector_size=300,min_count=10, dm=1)
model_d2v.build_vocab(tagged_data)
model_d2v.train(tagged_data,epochs=100,total_examples=model_d2v.corpus_count)
return model_d2v
titles_df dataframe contains columns(id_titles, title, release_year, actors, director, writer, doc)
col_names = ['actors', 'director','writer','release_year']
titles_df['doc'] = titles_df[col_names].apply(lambda x: ' '.join(x.astype(str)), axis=1).str.split()
Code for Top 10 similar movies
def titles_lookup(similar_doc,titles_df):
df = pd.DataFrame(similar_doc, columns =['id_titles', 'simialrity'])
df = pd.merge(df, titles_df[['id_titles','title','release_year']],on='id_titles',how='left')
print(df)
def demo_d2v_title(model,titles_df, id_titles):
similar_doc = model.docvecs.most_similar(id_titles)
titles_lookup(similar_doc,titles_df)
def demo(model,titles_df):
print('hunt for red october')
demo_d2v_title(model,titles_df, 'tt0099810')
The output for Top 10 similar movies for film - "hunt for red october"
id_titles similarity title release_year
0 tt0105112 0.541722 Patriot Games 1992.0
1 tt0267626 0.524941 K19: The Widowmaker 2002.0
2 tt0112740 0.496758 Crimson Tide 1995.0
3 tt0052151 0.471951 Run Silent Run Deep 1958.0
4 tt1922685 0.464007 Phantom 2013.0
5 tt0164184 0.462187 The Sum of All Fears 2002.0
6 tt0058962 0.459588 The Bedford Incident 1965.0
7 tt0109444 0.456760 Clear and Present Danger 1994.0
8 tt0063121 0.455807 Ice Station Zebra 1968.0
9 tt0146309 0.452572 Thirteen Days 2001.0
you can see from the output that i'm still getting old movies. Please help me how to solve that.
Thanks in advance.
Doc2Vec only knows text-similarity; it doesn't have the idea of other fields.
So if you want to discard matches according to some criteria other than text-similarity, that's only represented external to the Doc2Vec model, you'll have to do that in a separate step.
So, you could use .most_similar() with a topn=len(model.docvecs) parameter - to get back all moviews, ranked. Then, filter that result-set by discarding any whose year is too-far from your desired year. Then, trim that result-set to the top N that you really want.

How to predict an item's category given its name?

Currently I have a database consisted of about 600,000 records represents merchandise with their category information look like below:
{'title': 'Canon camera', 'category': 'Camera'},
{'title': 'Panasonic regrigerator', 'category': 'Refrigerator'},
{'title': 'Logo', 'category': 'Toys'},
....
But there are merchandises without category information.
{'title': 'Iphone6', 'category': ''},
So I'm thinking whether it is possible to train a text classifier based on my items' name by using scikit-learn to help me predict which the category should the merchandise be. I'm forming this problem as a multi-class text classification but there are also one~many pictures for each item so maybe deep learning/Keras can also be used?
I don't know what is the best way to solve this problem so any suggestion or advice is welcome, thank you for reading this.
P.S. the actual text is in Japanese
You could build a 2-char / 3-char model and calculate values e.g. how often does the 3-gram "pho" appear in the category "Camera".
trigrams = {}
for record in records: # only the ones with categories
title = record['title']
cat = record['category']
for trigram in zip(title, title[1:], title[2:])
if trigram not in trigrams:
trigrams[trigram] = {}
for category in categories:
trigrams[trigram] = 0
trigrams[trigram][cat] += 1
Now you can use the titles trigrams to calculate a score:
scores = []
for trigram in zip(title, title[1:], title[2:]):
score = []
for cat in categories:
score.append(trigrams[trigram][cat])
# Normalize
sum_ = float(sum(score))
score = [s / sum_ for s in score]
scores.append(score)
Now score contains a probability distribution for every trigram: P(class | trigram). It does not take into account that some classes are just more common (prior, see Bayes theorem). I'm currently also not quite sure if you should do something against the problem that some titles might just be really long and thus have a lot of trigrams. I guess taking the prior does that already.
If it turns out that you have many trigrams missing, you could switch to bigrams. Or simply do Laplace smoothing.
edit: I've just seen that the text is in Japanese. I think the n-gram approach might be useless there. You could translate the name. However, it is probably easier to just take other sources for this information (e.g. wikipedia / amazon / ebay?)

mahout for content based recomendation

I have a list user data : user name, age, sex , address, location etc and
a set of product data : Product name, Cost , description etc
Now i would like to build a recommendation engine that will be able to :
1 Figure out similar products
eg :
name : category : cost : ingredients
x : x1 : 15 : xx1, xx2, xx3
y : y1 : 14 : yy1, yy2, yy3
z : x1 : 12 : xx1, xy1
here x and z are similar.
2 Recommend relevant products from the product list to user
How this kind or recommendation engine can be implement with mahout ? Which all are the available methods ? Is there any useful tutorial/link available ? Please help
In mahout v1 from here https://github.com/apache/mahout your can use "spark-rowsimilarity" to create indicators for each type of metadata, categroy, cost, and ingredients. This will give you three matrices containing similar items for each item based on that particular metadata. This will give you a "more like this" type of recommendation. You can also try combining the metadata into one input matrix and see if that gives better results.
To personalize this record which items the user has expressed some preference for. Index the indicator matrices in Solr, one indicator per Solr "field" all attached to the item ID (name?). Then the query is the user's history against each field. You can boost certain fields to increase their weight in the recommendations.
This is described
On the Mahout site: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
And some slides here: http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/

How mahout user based recommendation works?

I am using generic user based recommender of mahout taste api to generate recommendations..
I know it recommends based on ratings given to past users..I am not getting mathematics behind its selection of recommended item..for example..
for user id 58
itemid ratings
231 5
235 5.5
245 5.88
3 neighbors are,with itemid and ratings as,{231 4,254 5,262 2,226 5}
{235 3,245 4,262 3}
{226 4,262 3}
It recommends me 226 how?
With advance thanks,
It depends on the UserSimilarity and the UserNeighborhood you have chosen for your recommender. But in general the algorithm works as follows for user u:
for every other user w
compute a similarity s between u and w
retain the top users, ranked by similarity, as a neighborhood n
for every item i that some user in n has a preference for, but that u has no preference for yet
for every other user v in n that has a preference for i
compute a similarity s between u and v
incorporate v's preference for i, weighted by s, into a running average
Source: Mahout in Action http://manning.com/owen/

Resources