I have a similarity problem here. I want to predict the traffic of a new rule using historical data (The traffic of rules implemented in the past). Traffic here means the number of times a rule matched a Person. Here is an example of a Rule :
Person.Age<20 and
(Person.number_of_children==3 or Person.married==True) and
Person.Work==student and
Person.Car.isSportCar==False and
Person.Car.Color in [blue,pink,red]
As you can see, in a rule there are a lot of attributes linked with Boolean expressions. The rule matches a person if he and his car satisfy some criteria. To predict the traffic of a rule I have to find a distance or a similarity metric between my rules but I find it difficult to flatten the rules in a column expression. If I do it I’ll lose information and here is why:
An example of column presentation of my rule :
Person.Age : 20
Person.number_of_children:3
Person.married:True
Person.work:student
Person.Car.isSportCar:False
Person.Car.Color:[blue,pink,red]
With this I lose the ‘OR’ and ‘<’ and ‘in’
Is flattening my rules expression a good idea or is there another one? Should I convert my rules to another data structure (A tree data structure for example) to better catch the similarity value between them? Do you have some suggestions?
What I'd do in a case like this would be to try to convert the specifications of the rules to sets so flattening them makes sense and then compute a Jaccard distance. Jaccard distance is defined by intersection over union of the sets. Finally, weight the different attributes (or not and use a single set for everything).
For example, given:
Person.Age<20 and (Person.number_of_children==3 or
Person.married==True) and Person.Work==student and
Person.Car.isSportCar==False and Person.Car.Color in [blue,pink,red]
and:
Person.Age<15 and (Person.number_of_children==2 or
Person.married==False) and Person.Work==student and
Person.Car.isSportCar==False and Person.Car.Color in [pink,red,white]
Convert them to something like this:
Person.Age (5,5,5,5)
Person.Relatives (Child,Child,Child,Wife)
Person.CarColor (blue,pink,red)
Person.Age (5,5,5)
Person.Relatives (Child,Child)
Person.CarColor (pink,red,white)
And then your Jaccard distance will be something like:
Person.Age = 3/4
Person.Relatives = 2/4
Person.CarColor = 2/4
And aggregate them (weighted if necessary).
Let me suggest another approach:
Base the similarity score on the percentage of people for whom the two rules give the same result. Of course you'll need a large and heterogeneous population.
If both rules have a similar result for most of the population (e.g. "false") - you may base the score only on test cases where at least one of the rules' result is "true".
Related
I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!
Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)
The search algorithm is a Breadth first search. I'm not sure how to store terms from and equation into a open list. The function f(x) has the form of ax^e1 + bx^e2 + cx^e3 + k, where a, b, c, are coefficients; k is constant. All exponents, coefficients, and constants are integers between 0 and 5.
Initial state: of the problem solving process should be any term from the ax^e1, bx^e2, cX^e3, k.
The algorithm gradually expands the number of terms in each level of the list.
Not sure how to add the terms to an equation from an open Queue. That is the question.
The general problem that you are dealing belongs to the regression analysis area, and several techniques are available to find a function that fits a given data set, including the popular least squares methods for finding the line of best fit given a dataset (a brief starting point is the related page on wikipedia, but if you want to deepen this topic, you should look at the research paper out there).
If you want to stick with the breadth first search algorithm, although this kind of approach is not common for such a problem, first of all, you need to define all the elements for a search problem, namely (see for more information Chapter 3 of the book of Stuart and Russell, Artificial Intelligence: A Modern Approach):
Initial state: Some initial values for the different terms.
Actions: in your case it should be a change in the different terms. Note that you should discretize the changes in the values.
Transition function: function that determines the new states given a state and an action.
Goal test: a check to recognize whether a state is a goal state or not, and so to terminate the search. There are different ways to define this test in a regression problem. One way is to set a threshold for the sum of the square errors.
Step cost: The cost for an action. In such an abstract problem, probably you can consider the unweighted distance from the initial state on the search graph.
Note that you should carefully think about these elements, as, for example, they determine how efficient your search would be or whether you will have cycles in the search graph.
After you defined all of the elements for the search problem, you basically have to implement:
Node, that contains information about the parent, the state, and the current cost;
Function to expand a given node that returns the successor nodes (according to the transition function, the actions, and the step cost);
Goal test;
The actual search algorithm. In the queue at the beginning you will have the node containing the initial state. After, it is updated with the successor nodes.
I am trying to implement relation extraction between verb pairs. I want to use dependency path from one verb to the other as a feature for my classifier (predicts if relation X exists or not). But I am not sure how to encode the dependency path as a feature. Following are some example dependency paths, as space separated relation annotations from StanfordCoreNLP Collapsed Dependencies:
nsubj acl nmod:from acl nmod:by conj:and
nsubj nmod:into
nsubj acl:relcl advmod nmod:of
It is important to keep in mind that these path are of variable length and a relation could reappear without any restriction.
Two compromising ways of encoding this feature that come to my mind are:
1) Ignore the sequence, and just have one feature for each relation with its value being the number of times it appears in the path
2) Have a sliding window of length n, and have one feature for each possible pair of relations with the value being the number of times those two relations appeared consecutively. I suppose this is how one encodes n-grams. However, the number of possible relations is 50, which means I cannot really go with this approach.
Any suggestions are welcomed.
We had a project that built a classifier based off of dependency paths. I asked the group member who developed the system, and he said:
indicator feature for the whole path
So if you have the training data point (verb1 -e1-> w1 -e2-> w2 -e3-> w3 -e4-> verb2, relation1) the feature would be (e1-e2-e3-e4)
And he also did ngram sequences, so for that same data point, you would also have (e1), (e2), (e3), (e4), (e1-e2), (e2-e3), (e3-e4), (e1-e2-e3), (e2-e3-e4)
He also recommended collapsing appositive edges to make the paths smaller.
Also, I should note that he developed a set of high precision rules for each relation, and used this to create a large set of training data.
Just want to clarify one thing: the same attribute can appear in decision tree for many times as long as they are in different "branches" right?
For obvious reasons, it does not make sense to use the same decision within the same branch.
On different branches, this reasoning obviously does not hold.
Consider the classic XOR(x,y) problem. You can solve it with a two layer decision tree, but you will need to split on the same attribute in both branches.
If x is true:
If y is true: return false
If y is false: return true
If x is false:
If y is true: return true
If y is false: return false
Another example is the following: assume your data is positive in x=[0;1], and negative outside. A good tree would be the following:
If x > 1: return negative
If x <= 1:
If x >= 0: return positive
If x < 0: return negative
It's not the same decision, so it can make sense to use x twice.
In general , you can do whatever you want, as long as you keep a structure of a "tree". They can be customized in many ways and while there can be redundancy it doesn't undermine its validity.
Binary attributes shouldn't appear twice in the same brunch, that would be redundant. However, continuous attributes can appear in same branch several times.
If the attribute is categorical, it cannot be used as the split attribute for more than one time. If the attribute is numerical, in principle, it can be used for many times, but the standard decision tree algorithm (C4.5 algorithm) does not implemented that way.
The following description is based on the assumption that the attributes are all categorical.
From the explanation perspective, decision tree is explainable, how an instance labeled can be explained by the attributes (as well as the value of the attributes) used from the root to the leaf. Therefore, it does not make sense to have duplicate attributes in one branch of the tree.
From the algorithm perspective, once an attribute is selected as the split attribute, the attributes would have no chance to be selected again based on the attribute selection criteria, e.g. information gain would be 0. This is because all the instances would have the same attribute value once they have been filtered by the attribute. Using the attribute again cannot bring more information for classification.