Query-document similarity with doc2vec - machine-learning

Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?

The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.

Related

Using ontology to infer labels for process model

I'm trying to implement a specific type of process mining, that has been presented in this thesis [link]. It is based on HMMs and generates a process model in form of a directed graph, where:
Nodes are called intentions and correspond to hidden states
Edges are called strategies and consist of different activities
These activities correspond to the HMM's observable emissions
Intentions can be fulfilled using different strategies
A user event log consisting of user IDs, timestamps and activities is used as input. The image below is an example of such a process model. The highlighted nodes and edges resemble the path that has been predicted using the Viterbi algorithm.
You can see that the graph's nodes and edges only carry numeric labels, which allow to distinguish between the different strategies and intentions. In order to make these labels more meaningful to the human reader, I'd like to infer some suitable labels.
My idea is to use an ontology to obtain those labels. After some research I figured out that I probably needed to do something that is generally referred to as "ontology learning". For this I would need to create some axioms in RDF/OWL format and then use these as input for a reasoner, that would infer an ontology.
Is this approach correct and reasonable to achieve my goal?
If this is the way to go, I will need some tool to generate axioms in an automated way. So far I couldn't find any tool that would do that completely out-of-the-box. Based on what I've seen so far I conclude that I would need to define some kind of mapping between the original data and the desired axioms. I took a closer look at protégé, which offers a plugin for spreadsheets. It seems to be based on the MappingMasterDSL project [link].
I've also found an interesting paper [link] on ontology learning where an RNN-based model is trained in a end-to-end fashion to translate definitory sentences into OWL formulae. BUT: My user event log data does not contain any natural sentences. Its activities are defined by tokens derived from HTML elements of the user interface. Therefore the RNN-based approach does not seem to be applicable here. (For the interested reader, the related project can be found here [link])
Isn't there really any easier way than hand-crafting the axioms' schema(ta)?
Assuming that I have created my axioms and inferred an ontology, I would like to use the strategies' (edges') observable activities (emissions) to infer a suitable label. I guess I would need to query my ontology somehow. I could use the activity names as parameters for my query and look for some related entities that reveal the desired label. I'm expecting something like:
"I have a strategy with ID=3, that strategy can be executed with
actions a, b and c, give me all entities of the ontology, that
have these actions as property value and show and give me all related
labels for those entities"
But where would the data for the labels actually come from?
I think I'm missing some important step during the process of ontology learning. Where do I find an additional data source for the labels and how do I relate this data to my ontology's entities?
Also I'm wondering if there is a way to incorporate the inherent knowledge of the process model's topology into my ontology.

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.

Searching for list of terms using Google in order to build a bag-of-words for a particular category

I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.
Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?
Another question: In the context of this question, does bag-of-words and corpus mean the same thing?
Thank you in advance!
This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form
word: how many times this word is present in a document
for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).
Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.
In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

Which machine learning model should be used in this situation?

Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.

Resources