Finding custom entities using NLP - machine-learning

Basically, from a paragraph, I have to find two entities Role and Oragnization.
Org should be captured along with their branch location, not their full address if provided in paragraph
Role can be in bracket and right before bracket at same time like
Org as role(The "Role")
Role can have multiple words
Example paragraph would be:
XXXX, dated as of November 10, 2050, among: (i) <ORG_NAME_1 PLC>,
a public company incorporated under the laws and having
its registered office at the 123 Penss Aven, NY, USA (the “<ROLE1>”), (ii) <ORG_NAME_2 PLC>, a public limited company incorporated
under the laws having its registered office at Manhattan Blvd, Seattle, WA, as <ROLE2> (the
“<ROLE2>”), (iii) the , Guarantors named in some random text hereto, (iv)
<ORG_NAME_3>, N.A., Belfast Branch, as <ROLE3> (the “<ROLE3>”), (v) <ORG_NAME_4> , as <ROLE4>, <ROLE5> and <ROLE6>, and (vi) <ORG_NAME_5> Deutschland AG, as <ROLE7>.
After processing, desired result would be linking role with organization.
<ROLE1> --> <ORG_NAME_1 PLC>
<ROLE2> --> <ORG_NAME_2 PLC>
<ROLE3> --> <ORG_NAME_3>, N.A., Belfast Branch
<ROLE4>, <ROLE5> and <ROLE6> --> <ORG_NAME_4>
<ROLE7> --> <ORG_NAME_5> Deutschland AG
Another example would be
XXXX dated as of November 28, 2027 among <ORG_NAME_1> A/S, a company incorporated
under the laws of Australia (the “<ROLE1>”), the Guarantors (as defined herein), <ORG_NAME_2>, N.A., Montreal
Branch, as <ROLE2>, <ROLE3>, <ROLE4>, <ROLE5>, <ROLE6> and <ROLE7>
After processing, desired result should be:
<ROLE1> --> <ORG_NAME_1> A/S
<ROLE2>, <ROLE3>, <ROLE4>, <ROLE5>, <ROLE6> and <ROLE7> --> <ORG_NAME_2>, N.A., Montreal Branch
I tried to use PoS, NER but not desired results.
Played with stanford NLP for NER, but organizations are not detected properly, tried to train my own data but accuracy is not accepted enough. It does not detect all organizations properly and is tagged as OTHER rather. Did not tweak with actual CRF model.
Played with NLTK python and tried to make some rule around NNP(proper noun) but sometimes roles are detected as verb, sometimes noun and it depends on case also sometimes, so not sure if it is desired approach.
There are not much varieties in paragraph patterns, I can post 1 or 2 more examples of different patterns if needed. Roles are fixed around 40 and organization would be dynamic.
Please suggest if I should read out some specific papers or models. Thanks.

Related

Create a solution to automatically split addresses into their separate components using python

I am trying to find a solution for being able to automatically split address into their separate components using python.
below is some sample data
Full Address
Street Number
Street
City
State
Zip Code
661 Camel Back Road Tulsa Oklahoma 74120
661
Camel Back Road
Tulsa
Oklahoma
68 Gnatty Creek Road Roslyn New York 11576
68
Gnatty Creek Road
Roslyn
New York
1 Raccoon Run Seattle Washington 98119
1
Raccoon Run
Seattle
Washington
616 Friendship Lane Santa Clara California 95054
616
Friendship Lane
Santa Clara
California
95054
3878 Grand Avenue Maitland Florida 32751
3878
Grand Avenue
Maitland
Florida
32751
The above data is a representation of what I am trying to achieve.
on the left is my input address, and on the right is the result after having being split out automatically.
The problem here, as cannot be seen in this over simplified example, is that the input addresses don't come in the same order, and will include components such as names of buildings etc.
My options so far are the following:
REGEX
MACHINE LEARNING MODEL
The REGEX option is familiar, but it will still be largely inaccurate. I need this solution to be as accurate as possible.
The MACHINE LEARNING MODEL option is more difficult in that I am not aware of any model or framework capable of classifying multiple categories as once.
Can anyone help?
so far I haven't really started the REGEX in anticipation of major gaps in capturing groups.
I think the only way to do this and get a fairly accurate result is to get the list of zip codes, for instance from here:
https://www.zipcode.com.ng/2022/06/list-of-5-digit-zip-codes-united-states.html?m=1
and a list of US cities.
Then you can match the zip code, state and city to the lists.

Extract some keywords like rent, deposit, liabilities etc. from unstructured document

Writing an algorithm to extract some keywords like rent, deposit, liabilities etc. from rent agreement document. I used "naive bayes classifier" but the output is not giving desired output:
my training data is like:
train = [
("refundable security deposit Rs 50000 numbers equal 5 months","deposit"),
("Lessee pay one month's advance rent Lessor","security"),
("eleven (11) months commencing 1st march 2019","duration"),
("commence 15th feb 2019 valid till 14th jan 2020","startdate")]
The below code is not giving desired keyword:
classifier.classify(test_data_features)
Please share if there are any libraries in NLP to accomplish this.
Seems like you need to make your specific NER(Named Entity Recognizer) for parsing your unstructured document.
where you need to tag every word of your sentence into certain labels. Based on the surrounding words and context window your trained NER will be able to give you the results which you looking for.
Check standford corenlp implementation of NER.

Applying MACHINE learning in biological text data

I am trying to solve the following question - Given a text file containing a bunch of biological information, find out the one gene which is {up/down}regulated. Now, for this I have many such (60K) files and have annotated some (1000) of them as to which gene is {up/down}regulated.
Conditions -
Many sentences in the file have some gene name mention and some of them also have neighboring text that can help one decide if this is indeed the gene being modulated.
Some files also have NO gene modulated. But these still have gene mentions.
Given this, I wanted to ask (having absolutely no background in ML), what sequence learning algorithm/tool do I use that can take in my annotated (training) data (after probably converting the text to vectors somehow!) and can build a good model on which I can then test more files?
Example data -
Title: Assessment of Thermotolerance in preshocked hsp70(-/-) and
(+/+) cells
Organism: Mus musculus
Experiment type: Expression profiling by array
Summary: From preliminary experiments, HSP70 deficient MEF cells display moderate thermotolerance to a severe heatshock of 45.5 degrees after a mild preshock at 43 degrees, even in the absence of hsp70 protein. We would like to determine which genes in these cells are being activated to account for this thermotolerance. AQP has also been reported to be important.
Keywords: thermal stress, heat shock response, knockout, cell culture, hsp70
Overall design: Two cell lines are analyzed - hsp70 knockout and hsp70 rescue cells. 6 microarrays from the (-/-)knockout cells are analyzed (3 Pretreated vs 3 unheated controls). For the (+/+) rescue cells, 4 microarrays are used (2 pretreated and 2 unheated controls). Cells were plated at 3k/well in a 96 well plate, covered with a gas permeable sealer and heat shocked at 43degrees for 30 minutes at the 20 hr time point. The RNA was harvested at 3hrs after heat treatment
Here my main gene is hsp70 and it is down-regulated (deducible from hsp(-/-) or HSP70 deficient). Many other gene names are also there like AQP.
There could be another file with no gene modified at all. In fact, more files have no actual gene modulation than those who do, and all contain gene name mentions.
Any idea would be great!!
If you have no background in ML I suggest buying a product like this one, this one or this one. These products where in development for decades with team budgets in millions.
What you are trying to do is not that simple. For example a lot of papers contain negative statements by first citing the original statement from another paper and then negating it. In your example how are you going to handle this:
AQP has also been reported to be important by Doe et al. However, this study suggest that this might not be the case.
Also, if you are looking into large corpus of biomedical research papers, or for this matter any corpus of research papers. You will find tons of papers that suggest something for example gene being up-regulated or not, and then there is one paper published in Cell magazine that all previous research has been mistaken.
To make matters worse, gene/protein names are not that stable. Besides few famous ones like P53. There is a bunch of run of the mill ones that are initially thought that they are one gene, but later it turns out that these are two different things. When this happen there are two ways community handles it. Either both of the genes get new names (usually with some designator at the end) or if the split is uneven the larger class retains original name and the second one gets the new name. To compound this problem, after this split happens not all researchers get the memo at instantly, so there is still stream of publications using old publication.
These are just two simple problems, there are 100s of these.
If you are doing this for personal enrichment. Here are some suggestions:
Build a language model on biomedical papers. Existing language models are usually built from news-wire sources or from social media data. All three of the corpora claim to be written in English language. But in reality these are three different languages with their own grammar and vocabulary
Look into things like embeddings and word2vec.
Look into Kaggle competitions, this is somewhat popular topic there.
Subscribe to KDD and BIBM magazines or find them in nearby library. There are 100s of papers on this subject.

How to tag text based on its category using OpenNLP?

I want to tag text based on the category it belongs to ...
For example ...
"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic
"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..
How to do this using openNLP or other NLP engines.
MY WORKS
I tried NER model , but It needs large number of training corpus which I don't have ?
My Need
Do any ready made training corpus available for NER or classification (it must contains scientific and engineering words).. ?
If you want to create a set of class labels for an entire sentence, then you will want to use the Doccat lib. With Doccat you would get a prob distribution for each chunk of text.
with doccat your sample would produce something like this:
"Clutch and gear is monitored using microchip " -> mechanical 0.85847568, electronic 0.374658
with doocat you will lose the keyword->classlabel mapping, so if you really need it doccat might not cut it.
as for NER, OpenNLP has an addon called Modelbuilder-addon that may help you. It is designed to expedite the creation of NER model building. You can create a file/list of as many of the terms for each category as you can think of, then create a file of a bunch of sentences, then use the addon to create an NER model using the seed terms and the file of sentences. see this post where I described it before with code example. You will have to pull down the addon from SVN.
OpenNLP: foreign names does not get recognized

How to estimate the quality of a web page?

I'm doing a university project, that must gather and combine data on a user provided topic. The problem I've encountered is that Google search results for many terms are polluted with low quality autogenerated pages and if I use them, I can end up with wrong facts. How is it possible to estimate the quality/trustworthiness of a page?
You may think "nah, Google engineers are working on the problem for 10 years and he's asking for a solution", but if you think about it, SE must provide up-to-date content and if it marks a good page as a bad one, users will be dissatisfied. I don't have such limitations, so if the algorithm accidentally marks as bad some good pages, that wouldn't be a problem.
Here's an example:
Say the input is buy aspirin in south la. Try to Google search it. The first 3 results are already deleted from the sites, but the fourth one is interesting: radioteleginen.ning.com/profile/BuyASAAspirin (I don't want to make an active link)
Here's the first paragraph of the text:
The bare of purchasing prescription drugs from Canada is big
in the U.S. at this moment. This is
because in the U.S. prescription drug
prices bang skyrocketed making it
arduous for those who bang limited or
concentrated incomes to buy their much
needed medications. Americans pay more
for their drugs than anyone in the
class.
The rest of the text is similar and then the list of related keywords follows. This is what I think is a low quality page. While this particular text seems to make sense (except it's horrible), the other examples I've seen (yet can't find now) are just some rubbish, whose purpose is to get some users from Google and get banned 1 day after creation.
N-gram Language Models
You could try training one n-gram language model on the autogenerated spam pages and one on a collection of other non-spam webpages.
You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content.
Better Scoring through Bayes Law
When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, P(Text|Spam). The notation reads as the probability of Text given Spam (page). The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, P(Text|Non-Spam).
However, the term you probably really want is P(Spam|Text) or, equivalently P(Non-Spam|Text). That is, you want to know the probability that a page is Spam or Non-Spam given the text that appears on it.
To get either of these, you'll need to use Bayes Law, which states
P(B|A)P(A)
P(A|B) = ------------
P(B)
Using Bayes law, we have
P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)
and
P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)
P(Spam) is your prior belief that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually tune to trade-off precision and recall. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam.
The term P(Text) is the overall probability of finding Text on any webpage. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were determined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam). This sums out the binary variable Spam/Non-Spam.
Classification Only
However, if you're not going to use the probabilities for anything else, you don't need to calculate P(Text). Rather, you can just compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam). If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both P(Spam|Text) and P(Non-Spam|Text) are normalized by the same P(Text) value.
Tools
In terms of software toolkits you could use for something like this, SRILM would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use IRST LM, which is distributed under the LGPL.
Define 'quality' of a web - page? What is the metric?
If someone was looking to buy fruit, then searching for 'big sweet melons' will give many results that contain images of a 'non textile' slant.
The markup and hosting of those pages may however be sound engineering ..
But a page of a dirt farmer presenting his high quality, tasty and healthy produce might be visible only in IE4.5 since the html is 'broken' ...
For each result set per keyword query, do a separate google query to find number of sites linking to this site, if no other site links to this site, then exclude it. I think this would be a good start at least.
if you are looking for performance related metrics then Y!Slow [plugin for firefox] could be useful.
http://developer.yahoo.com/yslow/
You can use a supervised learning model to do this type of classification. The general process goes as follows:
Get a sample set for training. This will need to provide examples of documents you want to cover. The more general you want to be the larger the example set you need to use. If you want to just focus on websites related to aspirin then that shrinks the necessary sample set.
Extract features from the documents. This could be the words pulled from the website.
Feed the features into a classifier such as ones provided in (MALLET or WEKA).
Evaluate the model using something like k-fold cross validation.
Use the model to rate new websites.
When you talk about not caring if you mark a good site as a bad site this is called recall. Recall measures of the ones you should get back how many you actually got back. Precision measures of the ones you marked as 'good' and 'bad' how many were correct. Since you state your goal to be more precise and recall isn't as important you can then tweak your model to have higher precision.

Resources