metric learning for information retrieval in semi-structured text? - parsing

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.
These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.
The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).
I am looking toward machine learning techniques to extract all those (field,value) of interest.
I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.
Does anyone have already encounter a similar problem?
Any suggestion or literature on this topic?

Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.
Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)
Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)
A great choice for already train model in python is nltk's Information Extraction module.
Hope this sums it up

Related

AutoML NL - Training model based on ICD10-CM - Amount of text required

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

Entity Type Recogition : Finding an Entity's Dominant Type from its Description

I've been working on a research project. I have a database of Wikipedia descriptions of a large number of entities, including sportspersons, politicians, actors, etc. The aim is to determine the type of entity using the descriptions. I have access to some data with the predicted type of entity which is quite accurate. This will be my training data. What I would like to do is train a model to predict the dominant type of entity for rest of the data.
What I've done till now:
Extracted the first paragraph, H1, H2 headers of Wiki description of the entity.
Extracted the category list of the entity on the wiki page (The bottom 'Categories' section present on any page like here.
Finding the type of entity can be difficult for entities that are associated with two or more concepts, like an actor who later became a politician.
I want to ask as to how I create a model out of the raw data that I have? What are the variables that I should use to train the model?
Also are there any Natural Language Processing techniques that can be helpful for this purpose? I know POS taggers can be helpful in this case.
My search over the internet has not been much successful. I've stumbled across research papers and blogs like this one, but none of them have relevant information for this purpose. Any ideas would be appreciated. Thanks in advance!
EDIT 1:
The input data is the first paragraph of the Wikipedia page of the entity. For example, for this page, my input would be:
Alan Stuart Franken (born May 21, 1951) is an American comedian, writer, producer, author, and politician who served as a United States Senator from Minnesota from 2009 to 2018. He became well known in the 1970s and 1980s as a performer on the television comedy show Saturday Night Live (SNL). After decades as a comedic actor and writer, he became a prominent liberal political activist, hosting The Al Franken Show on Air America Radio.
My extracted information is, the first paragraph of the page, the string of all the 'Categories' (bottom part of the page), and all the headers of the page.
From what I gather you would like to have a classifier which takes text input and predicts from a list of predefined categories.
I am not sure what your level of expertise is, so I will give a high level overview if additional people would like to know about the subject.
Like all NLP tasks which use ML, you are going to have to transform your textual domain to a numerical domain by a process of featurization.
Process the text and labels
Determine the relevant features
Create numerical representation of features
Train and Test on a Classifier
Process the text and labels
the text might have some strange markers or things that need to be modified to make it more "clean". this is standard as a text normalisation step.
then you will have to keep the related categories as labels for the texts.
It will end up being something like the following:
For each wiki article:
Normalise wiki article text
Save associated categories labels with text for training
Determine the relevant features
Some features you seem to have mentioned are:
Dominant field (actor, politician)
Header information
Syntactic information (POS Tags) are local (token level), but can be used to extract specific features such as if words are proper nouns or not.
Create numerical representation of features
Luckily, there are ways of doing auto-encoding, such as doc2vec, which can make a document vector from the text. Then you can add additional bespoke features that seem relevant.
You will then have a vector representation of features relevant to this text as well as the labels (categories).
This will become your training data.
Train and Test on a Classifier
Now train and test on a classifier of your choice.
Your data is one-to-many as you will try to predict many labels.
Try something simple just to seem if things work as you expect.
You should test your results with a cross validation routine such as k-fold validation using standard metrics (Precision, Recall, F1)
Clarification
Just to help clarify, This task is not really a named entity recognition task. It is a kind of multi-label classification task, where the labels are the categories defined on the wikipedia pages.
Named Entity Recognition is finding meaningful named entities in a document such as people, places. Usually something noun-like. This is usually done on a token level whereas your task is on a document level it seems.

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

How to include datetimes and other priority information for clustering?

I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
make a mapping (int -> term) of all terms in the input and store into a dictionary
convert all input documents into a normalized sparse vector
do clustering
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?
I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)
For this take a look at OpenNLP.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
in particular look at POS tagger, and namefinder.
Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.
good luck

Topic Detection by Clustering Keywords

I want to text classification based on the keywords appear in the text, because I do not have sample data to use naive bayes for text classification.
Example:
my document has some few words as "family, mother , father , children ... " that the categories of document are family.Or "football, tennis, score ... " that the category is sport
What is the best algorithm in this case ?.And is there any api java for this problem?
What you have are feature labels, i.e., labels on features rather than instances. There are a few methods for exploiting these, but usually it is assumed that one has instance labels (i.e., labels on documents) in addition to feature labels. This paradigm is referred to as dual-supervision.
Anyway, I know of at least two ways to learn from labeled features alone. The first is Generalized Expectation Criteria, which penalizes model parameters for diverging from a priori beliefs (e.g., that "moether" ought usually to correlate with "family"). This method has the disadvantage of being somewhat complex, but the advantage of having a nicely packaged, open-source Java implementation in the Mallet toolkit (see here, specifically).
A second option would basically be to use Naive Bayes and give large priors to the known word/class associations -- e.g., P("family"|"mother") = .8, or whatever. All unlabeled words would be assigned some prior, presumably reflecting class distribution. You would then effectively being making decisions only based on the prevalence of classes and the labeled term information. Settles proposed a model like this recently, and there is a web-tool available.
You likely will need an auxillary data set for this. You cannot rely on your data set to convey the information that "dad" and "father" and "husband" have a similar meaning.
You can try to do mine for co-occurrences to detect near-synonyms, but this is not very reliable.
Probably wordnet etc. are a good place to disambiguate such words.
You can download the freebase topic collection: http://wiki.freebase.com/wiki/Topic_API.

Resources