How to predict an item's category given its name? - machine-learning

Currently I have a database consisted of about 600,000 records represents merchandise with their category information look like below:
{'title': 'Canon camera', 'category': 'Camera'},
{'title': 'Panasonic regrigerator', 'category': 'Refrigerator'},
{'title': 'Logo', 'category': 'Toys'},
....
But there are merchandises without category information.
{'title': 'Iphone6', 'category': ''},
So I'm thinking whether it is possible to train a text classifier based on my items' name by using scikit-learn to help me predict which the category should the merchandise be. I'm forming this problem as a multi-class text classification but there are also one~many pictures for each item so maybe deep learning/Keras can also be used?
I don't know what is the best way to solve this problem so any suggestion or advice is welcome, thank you for reading this.
P.S. the actual text is in Japanese

You could build a 2-char / 3-char model and calculate values e.g. how often does the 3-gram "pho" appear in the category "Camera".
trigrams = {}
for record in records: # only the ones with categories
title = record['title']
cat = record['category']
for trigram in zip(title, title[1:], title[2:])
if trigram not in trigrams:
trigrams[trigram] = {}
for category in categories:
trigrams[trigram] = 0
trigrams[trigram][cat] += 1
Now you can use the titles trigrams to calculate a score:
scores = []
for trigram in zip(title, title[1:], title[2:]):
score = []
for cat in categories:
score.append(trigrams[trigram][cat])
# Normalize
sum_ = float(sum(score))
score = [s / sum_ for s in score]
scores.append(score)
Now score contains a probability distribution for every trigram: P(class | trigram). It does not take into account that some classes are just more common (prior, see Bayes theorem). I'm currently also not quite sure if you should do something against the problem that some titles might just be really long and thus have a lot of trigrams. I guess taking the prior does that already.
If it turns out that you have many trigrams missing, you could switch to bigrams. Or simply do Laplace smoothing.
edit: I've just seen that the text is in Japanese. I think the n-gram approach might be useless there. You could translate the name. However, it is probably easier to just take other sources for this information (e.g. wikipedia / amazon / ebay?)

Related

Why is my CNN returning tokens instead or readable labels?

I am currently studying machine learning and have created a CNN using fastai that labels the category of clothes items. I built this model using the Fashion-MNIST data set.
Everything funcitons fine and it looks like it's predicting correctly but I dont know how to make it return the labels and categories rather than this weird tokenized text it is returning. where am I going wrong?
Here is some code
This is where I create the dataframe that has the category mapped to the image path.
from fastcore.all import *
ds = dataFrame.filter(['masterCategory', 'imagePath'], axis=1)
ds
masterCategory imagePath
0 Apparel ../input/fashion-product-images-small/images/1...
1 Apparel ../input/fashion-product-images-small/images/3...
2 Accessories ../input/fashion-product-images-small/images/5...
3 Apparel ../input/fashion-product-images-small/images/2...
4 Apparel ../input/fashion-product-images-small/images/5...
... ... ...
44419 Footwear ../input/fashion-product-images-small/images/1...
44420 Footwear ../input/fashion-product-images-small/images/6...
44421 Apparel ../input/fashion-product-images-small/images/1...
44422 Personal Care ../input/fashion-product-images-small/images/4...
44423 Accessories ../input/fashion-product-images-small/images/5...
44424 rows × 2 columns
Then I create a datablock
def getImages(d): return d['imagePath']
def getLabel(d): return d['masterCategory']
from fastai.vision.all import *
dblock = DataBlock(
blocks=(ImageBlock, MultiCategoryBlock),
get_x=getImages,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=getLabel,
item_tfms=[Resize(192, method='squish')]
)
Then I use the dataloader and when I show batch, but I get these weird labels instead of the the mater categories.
dsets = dblock.dataloaders(ds, bs=32)
dsets.show_batch(max_n=20)
thank you.
I found the issue, The block I needed is not MultiCategoryBlock, it is CategoryBlock. I thought since there where multiple categories ot pick from that is what was needed but no MulticategoryBlock is used to label one image with multiple categories. Not to pick from multiple categories.

Bart Large MNLI - Get predicted label in a single column

I'm trying to classify the sentences of a specific column into three labels with Bart Large MNLI. The problem is that the output of the model is "sentence + the three labels + the scores for each label. Output example:
{'sequence': 'Growing special event set production/fabrication company
is seeking a full-time accountant with experience in entertainment
accounting. This position is located in a fast-paced production office
located near downtown Los Angeles.Responsibilities:• Payroll
management for 12+ employees, including processing new employee
paperwork.', 'labels': ['senior', 'middle', 'junior'], 'scores':
[0.5461998581886292, 0.327671617269516, 0.12612852454185486]}
What I need is to get a single column with only the label with the highest score, in this case "senior".
Any feedback which can help me to do it? Right now my code looks like:
df_test = df.sample(frac = 0.0025)
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
sequence_to_classify = df_test["full_description"]
candidate_labels = ['senior', 'middle', 'junior']
df_test["seniority_label"] = df_test.apply(lambda x: classifier(x.full_description, candidate_labels, multi_label=True,), axis=1)
df_test.to_csv("Seniority_Classified_SampleTest.csv")
(Using a Sample of the df for testing code).
And the code I've followed comes from this web, where they do receive a column with labels as an output idk how: https://practicaldatascience.co.uk/machine-learning/how-to-classify-customer-service-emails-with-bart-mnli

How to handle release year difference in movie recommendation

I have been part of the movie recommendation project. We have developed a doc2vec model using gensim.
You can have a look at gensim documentation if needed.
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar
Trained the model and when i took top 10 similar movies for a film based on cast it gives way back old movies with release_yr as (1960, 1950, ...). So i have tried including the release_yr as a parameter to gensim model but still it shows me old movies. How can i solve this release_yr difference? When I see top10 recommendations for a film I need those movies whose release_yr difference is less (like past 10 years movies not more than that). How can i do that?
code for doc2vec model
def d2v_doc(titles_df):
tagged_data = [TaggedDocument(words=_d, tags=[str(titles_df['id_titles'][i])]) for i, _d in enumerate(titles_df['doc'])]
model_d2v = Doc2Vec(vector_size=300,min_count=10, dm=1)
model_d2v.build_vocab(tagged_data)
model_d2v.train(tagged_data,epochs=100,total_examples=model_d2v.corpus_count)
return model_d2v
titles_df dataframe contains columns(id_titles, title, release_year, actors, director, writer, doc)
col_names = ['actors', 'director','writer','release_year']
titles_df['doc'] = titles_df[col_names].apply(lambda x: ' '.join(x.astype(str)), axis=1).str.split()
Code for Top 10 similar movies
def titles_lookup(similar_doc,titles_df):
df = pd.DataFrame(similar_doc, columns =['id_titles', 'simialrity'])
df = pd.merge(df, titles_df[['id_titles','title','release_year']],on='id_titles',how='left')
print(df)
def demo_d2v_title(model,titles_df, id_titles):
similar_doc = model.docvecs.most_similar(id_titles)
titles_lookup(similar_doc,titles_df)
def demo(model,titles_df):
print('hunt for red october')
demo_d2v_title(model,titles_df, 'tt0099810')
The output for Top 10 similar movies for film - "hunt for red october"
id_titles similarity title release_year
0 tt0105112 0.541722 Patriot Games 1992.0
1 tt0267626 0.524941 K19: The Widowmaker 2002.0
2 tt0112740 0.496758 Crimson Tide 1995.0
3 tt0052151 0.471951 Run Silent Run Deep 1958.0
4 tt1922685 0.464007 Phantom 2013.0
5 tt0164184 0.462187 The Sum of All Fears 2002.0
6 tt0058962 0.459588 The Bedford Incident 1965.0
7 tt0109444 0.456760 Clear and Present Danger 1994.0
8 tt0063121 0.455807 Ice Station Zebra 1968.0
9 tt0146309 0.452572 Thirteen Days 2001.0
you can see from the output that i'm still getting old movies. Please help me how to solve that.
Thanks in advance.
Doc2Vec only knows text-similarity; it doesn't have the idea of other fields.
So if you want to discard matches according to some criteria other than text-similarity, that's only represented external to the Doc2Vec model, you'll have to do that in a separate step.
So, you could use .most_similar() with a topn=len(model.docvecs) parameter - to get back all moviews, ranked. Then, filter that result-set by discarding any whose year is too-far from your desired year. Then, trim that result-set to the top N that you really want.

how can I retrieve words back from word mapping using numpy array? [Tensorflow RNN] Text classification

Here's what I have:
vocab_processor = skflow.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
X_train = np.array(list(vocab_processor.fit_transform(X_train)))
X_test = np.array(list(vocab_processor.transform(X_test)))
Now,It creates a numpy array of ids of words in word dictionary.
What should I do if I want to retrieve those words back from the dictionary?
There is a function called reverese(document) but it doesn't work in this case. It is returning list containing marker.
['What is most beautiful in <UNK> men is something feminine'
"The camera makes everyone a tourist in other people's reality"
'<UNK> in reality is the worst of all evils because' ...,
'<UNK> aware that no bank would do this as they'
'<UNK> keep sending you many details through the post like'
'<UNK> banking transactions should be conducted in a secure place']
This will give you id: word
w_dict = {v:k for k,v in vocab_processor.vocabulary_._mapping.items()}
Then you can get the words:
words = w_dict.values()

Categorizing Hastags based on similarities

I have different documents with a list of hashtags in each. I would like to group them under the most relevant hashtag (which would be present in the document itself).
Egs: If there are #Eco, # Ecofriendly # GoingGreen - I would like to group all these under the most relevant and representative Hashtag (say #Eco). How should I be approaching this and what techniques and algorithms should I be looking at?
I would create a bipartite graph of documents-hashtags and use clustering on a bipartite graph:
http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_bipartite.pdf
This way I am not using the content of the document, but just clustering the hashtags, which is what you wanted.
Your question is not very strict, and as such may have multiple answers, however, if we assume that you literally want "I would like to group all these under the most common Hashtag", then simply loop through all hashtags, compute have often they come up, and then for each document select the one with highest number of occurences.
Something like
N = {}
for D in documents:
for h in D.hashtags:
if h not in N: N[h] = 0
N[h] += 1
for D in documents:
best = None
for h in D.hashtags:
if best==None or N[best] < N[h]:
best = h
print 'Document ',D,' should be tagged with ',best

Resources