Problem
I want to find best sentence match from predefined list of sentences that matches user's keywords.
Common use case would be Instagram hashtags. User enters a few hashtag and he gets a suggested sentence that best encapsulates those hashtags.
Imagine user entered 3 hashtags:
#water #sunny #outdoor.
Our predefines sentences:
["Today is a beautiful day", "Grass is green", "Its sunny outside"].
Best match:
I guess its not trivial to determine what a best match is, but it doesn't have to be most alike by words or characters, however it should summarize the keywords the best.
In our example: "Its sunny outside"
you can use nlp model to get similarities between words word2vec
so you could compare the hastags with each words in each sentence, and find the best match
you should use tf-idf approach for weights on words, for better performance in your word-to-sentence matching
you can also use doc2vec but it's mostly implemented as specific case of word2vec, also I'm not sure how the hastags which can't create a proper sentence will behave
Related
Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.
I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.
Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?
Another question: In the context of this question, does bag-of-words and corpus mean the same thing?
Thank you in advance!
This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form
word: how many times this word is present in a document
for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).
Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.
In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".
I want to predict movie gross collections using data that is available before release, eg title, actors, director, studio, critic ratings, genre, etc. I found a way to numerically quantify most of these, but could not quantify title. The title conveys much useful information, such as if the movie is a sequel, length of title, sentiment associated, etc. How to extract these information quantitatively from title?
BoWs are the standard way to create text based features, though I wouldn't recommend it since movie titles are short and many of them contains out of context words, named entity.. which will make your feature vector even more sparse.
You can create a word2vec encoding of each word of the title and then take the mean vector of the title as your feature. My favorite tools to do it: gensim or Facebook fast Text
I am building a recommendation system for dishes. Consider a user eating french fries and rates it a 5. Then I want to give a good rating to all the ingredients that the dish is made of. In the case of french fires the linked words should be "fried" "potato" "junk food" "salty" and so on. From the word Tsatsiki I want to extract "Cucumbers", "Yoghurt" "Garlic". From Yoghurt I want to extract Milk product, From Cucumbers vegetable and so on.
What is this problem called in Natural Language Processing and is there a way to address it?
I have no data at all, and I am thinking of building web crawler that analyzes the web for the dish. I would like it to be as little Ad-Hoc as possible and not necessarily in English. Is there a way, maybe in within deep learning to do the thing? I would not only a dish to be linked to the ingredients but also a category: junk food, vegetarian, Italian food and so on.
This type of problem is called ontology engineering or ontology building. For an example of a large ontology and how it's structured, you might check out something like YAGO. It seems like you are going to be building a boutique ontology for food and then overlaying a rating's system. I don't know of any ontologies out there of the form you're looking for, but there are relevant things out there you should take a look at, for example, this OWL-based food ontology and this recipe ontology.
Do you have a recipe like that:
Ingredients:
*Cucumbers
*Garlic
*Yoghurt
or like that:
Grate a cucumber or chop it. Add garlic and yoghurt.
If the former, your features have already been extracted. The next step would be to convert to a vector recommend other recipes. The simplest way would be to do (unsupervised) clustering of recipes.
If the latter, I suspect you can get away with a simple rule of thumb. Firstly, use a part-of-speech tagger to extract all the nouns in the recipe. This would extract all the ingredients and a bit more (e.g. kitchen appliances, cutlery, etc). Look up the nouns in a database of food ingredients database such as this one.
Am working on this problem where I need to cluster search phrase based on what they are looking for (for now, let's assume they are looking for only places, such as bookstore, supermarket, ..)
"Where can I find a cheesecake ?"
could get clustered probabilistically to 'desserts', 'restaurants', ...
"Where can I buy groceries ?"
could get clustered probabilistically to 'supermarkets', 'vegetables', ...
Assume for beginning with, a set of what the search phrases could get classified to, already exists.
I looked into topic modeling but I feel like I might be heading the wrong direction. Any suggestions on how to get started off / what to look into would be highly helpful.
Thanks a lot.
Topic modelling certainly provides one possible solution. Induce a topic model from a large corpus, as representative as possible of the texts you're indexing and searching with. Then represent each query as the posterior over the topics given the query. If you want to obtain a clustering of queries, you could then do so on this reduced set, or if you're doing IR you could use the resulting vectors instead of the original bag of words.
If this isn't what you want, can you elaborate on the problem? What do you hope to do with the clustered queries?