I have a text corpus of many sentences, with some named entities marked within it.
For example, the sentence:
what is the best restaurant in wichita texas?
which is tagged as:
what is the best restaurant in <location>?
I want to expand this corpus, by taking or sampling all the sentences already in it, and replacing the named entities with other similar entities from the same types, e.g. replacing "wichita texas" with "new york", so the corpus will be bigger (more sentences) and more complete (number of entities within it). I have lists of similar entities, including ones which doesn't appear in the corpus but I would like to have some probability of inserting them in my replacements.
Can you recommend on a method or direct me to a paper regarding this?
For your specific question:
This type of work, assuming you have an organized list of named entities (like a separate list for 'places', 'people', etc), generally consists of manually removing potentially ambiguous names (for example, 'jersey' could be removed from your places list to avoid instances where it refers to the garment). Once you're confident you removed the most ambiguous names, simply select an appropriate tag for each group of terms ("location" or "person", for instance). In each sentence containing one of these words, replace the word with the tag. Then you can perform some basic expansion with the programming language of your choice so that each sentence containing 'location' is repeated with every location name, each sentence containing 'person' is repeated with every person name, etc.
For a general overview of clustering using word-classes, check out the seminal Brown et. al. paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9919&rep=rep1&type=pdf
Related
Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.
I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.
Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?
Another question: In the context of this question, does bag-of-words and corpus mean the same thing?
Thank you in advance!
This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form
word: how many times this word is present in a document
for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).
Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.
In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".
In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category.
Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category.
In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning news article, instead of providing hundred megabytes of data.
This would be rather cool. Model would have a definite dictionary of known entites (which does not need to be expanded) and a statistical approach on how those known entites are structured in human text.
PS - Just for clarity, not yearning for a regex ner. These are only cool if you got lots in the dictionary, lots of rule and lots of dulltime.
I think what you are talking about is Gazetteers list (dictionary.txt).
You would have to include corresponding feature for a word in training data and then specify it in template file.
For Example: Your list contains the entity: Hershey's
and training data has a sentence: I like Hershey's chocolates.
So when you arrange the data in CoNLL Format (for CRF++), you can add a column (which shall have values 0 or 1 , indicating is the word is present in dictionary) which will have 0 value for all words, except Hershey's.
You also have to include this column as feature in template file.
To get a better understanding on Template File and NER training with CRF++, you can watch the below videos and comment your doubts :)
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4
EDIT: (after viewing the OP's comment)
Sample Training Data with extra features: https://pastebin.com/fBgu8c67
I've added 3 features. The IsCountry feature value ( 1 or 0 ) can be obtained from a Gazetteers list of countries. The other 2 features can be computed offline. Note that Headers are added in file for reference only, should not be include in training data file.
Sample Template File for the above data : https://pastebin.com/LPvAGCVL
Note that, Test Data should also be in the same format as Train Data, with the same features / same no of columns.
I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.
I am trying to implement an approaches following a paper to disambiguate an entity. The process consists of 2 steps, a training phase and disambiguation phase. I would like to ask about training phase, I do not quite understand how the way to get prototype vectors as this paragraph explained:
In the training phase, we compute, for each word or phrase that is linked at least 10 times to a particular entity, what we called a prototype vector: this is a tf.idf-weighted, normalized list of all terms which occur in one of the neighbourhoods (we consider 10 words to the left and right) of the respective links. Note that one and the same word or phrase can have several such prototype vectors, one for each entity linked from some occurrence of that word or phrase in the collection.
They have used the approach for wikipedia and use the links from wikipedia as training set.
Could someone help me to give an example of the prototype vector as explained there,please? I am beginner in this field.
Here is a sketch of what the prototype vector is about:
The first thing to note is that a word in wikipedia can be hyperlink to a wikipedia page (which we will call an entity). This entity is related in some way to the word yet the same word could link to different entities.
"for each word or phrase that is linked at least 10 times to a particular entity"
Across wikipedia, we count the number of times that word_A links to entity_B, if it's over 10, we continue (writing down where the entities they link from):
[(wordA, entityA1), (wordA, entityA2),...]
Here wordA occurs in entityA1 where it links to entityB, etc.
"list of all terms which occur in one of the neighbourhoods of the respective links"
In entityA1, wordA has ten words to it's left and right (we show only 4 either side):
are developed and the entity relationships between these data
wordA
link # (to entityB)
['are', 'developed, 'and', 'the', 'relationships', 'between', 'these', 'data']
Each pair (wordA, entityAi) gives us such a list, concatenate them.
"tf.idf-weighted, normalized list"
Basically, tf.idf means you should give common words less "weight" than less-common words. For example, 'and' and 'the' are very common words so we give them less meaning (to their being next to 'entity') than 'relationships' or 'between'.
Normalise, means we should (essentially) count the number of times a word occurs (the more it occurs the more associated we think it is to wordA. Then multiply this count by the weight to get some score with which to sort the list... Putting the most-frequent least-common words at the top.
"Note that one and the same word or phrase can have several such prototype vectors"
This has been not only dependant on wordA but also entityB, you could think of it as a mapping.
(wordA, entityB) -> tf.idf-weighted, normalized list (as described above)
(wordA, entityB2) -> a different tf.idf-weighted, normalized list
This is making the point that links to cats from the word 'cat' are less likely to have the neighbour 'batman', than links to cat woman.