When is entity replacement necessary for relation extraction? - machine-learning

In this tutorial for "Training a Machine Learning Classifier for Relation Extraction from Medical Literature" the author does Entity replacement because that "we don’t want the model to learn according to a specific entity name, but we want it to learn according to the structure of the text".
Is this generally true or does it depend on the dataset or the used models?

Entity replacement, much like other text transformation techniques including stemming and lemmatization, is usually a part of the relationship extraction process because it increases the number of observation per feature. That increase in ratio might help your problem, depending on the size of the dataset, quality of features, type of feature extraction, and complexity of the model.
A good rule of thumb is to define your objective and subsequently your acceptable representation based on your understanding of the dataset. For instance, the given tutorial sets out to understand the relationship between miRNA and genes. The author is okay with grouping miRNA-335, miRNA-342, miRNA-100 and others, under the same entity name.
In scenarios, where you don't have domain understanding of the corpora, you can start out without entity replacement, inspect the result and understand the model's bias-variance tradeoff. Then, if required, try entity replacement after experimenting with some clustering techniques.

Related

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

Entity Type Recogition : Finding an Entity's Dominant Type from its Description

I've been working on a research project. I have a database of Wikipedia descriptions of a large number of entities, including sportspersons, politicians, actors, etc. The aim is to determine the type of entity using the descriptions. I have access to some data with the predicted type of entity which is quite accurate. This will be my training data. What I would like to do is train a model to predict the dominant type of entity for rest of the data.
What I've done till now:
Extracted the first paragraph, H1, H2 headers of Wiki description of the entity.
Extracted the category list of the entity on the wiki page (The bottom 'Categories' section present on any page like here.
Finding the type of entity can be difficult for entities that are associated with two or more concepts, like an actor who later became a politician.
I want to ask as to how I create a model out of the raw data that I have? What are the variables that I should use to train the model?
Also are there any Natural Language Processing techniques that can be helpful for this purpose? I know POS taggers can be helpful in this case.
My search over the internet has not been much successful. I've stumbled across research papers and blogs like this one, but none of them have relevant information for this purpose. Any ideas would be appreciated. Thanks in advance!
EDIT 1:
The input data is the first paragraph of the Wikipedia page of the entity. For example, for this page, my input would be:
Alan Stuart Franken (born May 21, 1951) is an American comedian, writer, producer, author, and politician who served as a United States Senator from Minnesota from 2009 to 2018. He became well known in the 1970s and 1980s as a performer on the television comedy show Saturday Night Live (SNL). After decades as a comedic actor and writer, he became a prominent liberal political activist, hosting The Al Franken Show on Air America Radio.
My extracted information is, the first paragraph of the page, the string of all the 'Categories' (bottom part of the page), and all the headers of the page.
From what I gather you would like to have a classifier which takes text input and predicts from a list of predefined categories.
I am not sure what your level of expertise is, so I will give a high level overview if additional people would like to know about the subject.
Like all NLP tasks which use ML, you are going to have to transform your textual domain to a numerical domain by a process of featurization.
Process the text and labels
Determine the relevant features
Create numerical representation of features
Train and Test on a Classifier
Process the text and labels
the text might have some strange markers or things that need to be modified to make it more "clean". this is standard as a text normalisation step.
then you will have to keep the related categories as labels for the texts.
It will end up being something like the following:
For each wiki article:
Normalise wiki article text
Save associated categories labels with text for training
Determine the relevant features
Some features you seem to have mentioned are:
Dominant field (actor, politician)
Header information
Syntactic information (POS Tags) are local (token level), but can be used to extract specific features such as if words are proper nouns or not.
Create numerical representation of features
Luckily, there are ways of doing auto-encoding, such as doc2vec, which can make a document vector from the text. Then you can add additional bespoke features that seem relevant.
You will then have a vector representation of features relevant to this text as well as the labels (categories).
This will become your training data.
Train and Test on a Classifier
Now train and test on a classifier of your choice.
Your data is one-to-many as you will try to predict many labels.
Try something simple just to seem if things work as you expect.
You should test your results with a cross validation routine such as k-fold validation using standard metrics (Precision, Recall, F1)
Clarification
Just to help clarify, This task is not really a named entity recognition task. It is a kind of multi-label classification task, where the labels are the categories defined on the wikipedia pages.
Named Entity Recognition is finding meaningful named entities in a document such as people, places. Usually something noun-like. This is usually done on a token level whereas your task is on a document level it seems.

metric learning for information retrieval in semi-structured text?

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.
These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.
The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).
I am looking toward machine learning techniques to extract all those (field,value) of interest.
I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.
Does anyone have already encounter a similar problem?
Any suggestion or literature on this topic?
Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.
Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)
Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)
A great choice for already train model in python is nltk's Information Extraction module.
Hope this sums it up

Topic Detection by Clustering Keywords

I want to text classification based on the keywords appear in the text, because I do not have sample data to use naive bayes for text classification.
Example:
my document has some few words as "family, mother , father , children ... " that the categories of document are family.Or "football, tennis, score ... " that the category is sport
What is the best algorithm in this case ?.And is there any api java for this problem?
What you have are feature labels, i.e., labels on features rather than instances. There are a few methods for exploiting these, but usually it is assumed that one has instance labels (i.e., labels on documents) in addition to feature labels. This paradigm is referred to as dual-supervision.
Anyway, I know of at least two ways to learn from labeled features alone. The first is Generalized Expectation Criteria, which penalizes model parameters for diverging from a priori beliefs (e.g., that "moether" ought usually to correlate with "family"). This method has the disadvantage of being somewhat complex, but the advantage of having a nicely packaged, open-source Java implementation in the Mallet toolkit (see here, specifically).
A second option would basically be to use Naive Bayes and give large priors to the known word/class associations -- e.g., P("family"|"mother") = .8, or whatever. All unlabeled words would be assigned some prior, presumably reflecting class distribution. You would then effectively being making decisions only based on the prevalence of classes and the labeled term information. Settles proposed a model like this recently, and there is a web-tool available.
You likely will need an auxillary data set for this. You cannot rely on your data set to convey the information that "dad" and "father" and "husband" have a similar meaning.
You can try to do mine for co-occurrences to detect near-synonyms, but this is not very reliable.
Probably wordnet etc. are a good place to disambiguate such words.
You can download the freebase topic collection: http://wiki.freebase.com/wiki/Topic_API.

Scalable Classifier For Finding Missing Attributes

I have a large sparse matrix representing attributes for millions of entities. For example, one record, representing an entity, might have attributes "has(fur)", "has(tail)", "makesSound(meow)", and "is(cat)".
However, this data is incomplete. For example, another entity might have all the attributes of a typical "is(cat)" entity, but it might be missing the "is(cat)" attribute. In this case, I want to determine the probability that this entity should have the "is(cat)" attribute.
So the problem I'm trying to solve is determining which missing attributes each entity should contain. Given an arbitrary record, I want to find the top N most likely attributes that are missing but should be included. I'm not sure what the formal name is for this type of problem, so I'm unsure what to search for when researching current solutions. Is there a scalable solution for this type of problem?
My first is to simply calculate the conditional probability for each missing attribute (e.g. P(is(cat)|has(fur) and has(tail) and ... )), but that seems like a very slow approach. Plus, as I understand the traditional calculation of conditional probability, I imagine I'd run into problems where my entity contains a few unusual attributes that aren't common with other is(cat) entities, causing the conditional probability to be zero.
My second idea is to train a Maximum Entropy classifier for each attribute, and then evaluate it based on the entity's current attributes. I think the probability calculation would be much more flexible, but this would still have scalability problems, since I'd have to train separate classifiers for potentially millions attributes. In addition, if I wanted to find the top N most likely attributes to include, I'd still have to evaluate all the classifiers, which would likely take forever.
Are there better solutions?
This sounds like a typical recommendation problem. For each attribute use the word 'movie rating' and for each row use the word 'person'. For each person, you want to find the movies that they will probably like but haven't rated yet.
You should look at some of the more successful approaches to the Netflix Challenge. The dataset is pretty large, so efficiency is a high priority. A good place to start might be the paper 'Matrix Factorization Techniques for Recommender Systems'.
If you have a large data set and you're worried about scalability, then I would look into Apache Mahout. Mahout is a Machine Learning and Data Mining library that might help you with your project, in particular they have some of the most well known algorithms already built-in:
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
High performance java collections (previously colt collections)

Resources