Uima Ruta Rule for Below Regex - machine-learning

I need to annotate below cases. I have block of texts & need to get sub blocks of texts which contains banks.For example below is a complete text block i need to annotate Bank of America as Bank Name.
hereinafter described and hereinafter referred to as Owner and Bank
of America NA,successor in interest from
There could be many cases for bank name
Bank of America [Bank in starting]
Royal Bank of Scotland [Bank in middle]
Yes Bank [Bank in end]
etc.
So text need to be annotate totally depends on bank
I'm not able to write generic rule to cover all cases so far i've tried below rules
- Rule 1
W[0,3] BankNameKeyWord W[0,3] {-> MARK(BANKNAME,1,3)};(looking around bank for 3 words)
- Rule 2
W? W? W? BankNameKeyWord W? W? W? {-> MARK(BANKNAME,1,7)};
I'm looking for generic approach to cover all cases.

You could maybe apply a rule like:
(CW[0,3] #BankNameKeyWord SW.ct=="of"? CW[0,3]) {-> MARK(BANKNAME,1,3)};
but this does not solve your problem. As the commnets note, you need some linguistic preprocessing like a chunker. If it's just bank names, you could consider a dictionary.
DISCLAIMER: I am a developer of UIMA Ruta

Related

Text recommendation based on keywords

I need some advice on the following problem.
I'm given a set of weighted keywords (by percentage) and need to find a text in a database that best matches those keywords. I will give an example.
I'm presented with these keywords
Sun(90%)
National Park(85% some keywords contain 2 words)
Landmark(60%)
Now lets say my database contains 3 entries of texts e.g
Going-to-the-Sun Road is a scenic mountain road in the Rocky Mountains of the western United States, in Glacier National Park in Montana.
Everybody has a little bit of the sun and moon in them. Everybody has a little bit of man, woman, and animal in them.
A hybrid car is one that uses more than one means of propulsion - that means combining a petrol or diesel engine with an electric motor.
Obviously the first text is the one that best describes the given set of keywords so this is what I want to recommend to the user. Following the second text that somewhat relates with the "sun" keyword and that could be an acceptable choice too.
The 3rd text is totally irrelevant and shall only be recommended as a last resort when everything else fails.
I'm totally new to that kind of stuff so I need some advice as to which technologies/algorithms I should use. Seems like there is some machine learning (nlp) involved or some kind of fuzzy logic. I'm not really sure.
You need to use a combination of query terms boosting and synonyms
Look into Is there a way to do fuzzy string matching for words on string?

Best ways in text parsing for Job description (JD parsing)?

I am a developer and having little knowledge in text parsing.
I need to parse the Job description and get some outputs. I need to parse the following fields from Job description.
Job Responsibilities,
Qualification,
Specialization,
Domain,
Skills Required,
Job Description,
Work Experience Min,
Work Experience Max,
Industry,
Occupation,
Functional Area,
Currency,
Salary,
Salary Type,
Employment Type,
Work Authorisation,
Required Visa Status,
Required English Level,
Country,
State,
City,
Zipcode,
Address of Job.
To accomplish this, I am utilizing the Regex pattern matching. But the output efficiency is low many times. It sometimes requires exact pattern to identify the parameters. So it fails many times.
I found other ways too.
Named Entity Recognition:
By using Stanford NLp, I am able to get the location, address. But I don't know how can I train the module for other parameters or we have any possibilities.
Fuzzy logic:
Did some research on fuzzy logic to validate the results.
My questions are,
1. What are the approaches to accomplish the JD parsing?
2. How effective is NER?
3. Is there any conceivable outcomes to use fuzzy logic in JD text parsing?
Any help would be really appriciated.
I think you can try dependency parsing if regex doesn't work accurately. NER will not support all the findings you need. Employment type is something would like to learn from you as well.

NLP parsing multiple questions contained in one single query

If a single query from the user contains multiple questions belonging to different categories, how can they be identified, split and parsed?
Eg -
User - what is the weather now and tell me my next meeting
Parser - {:weather => "what is the weather", :schedule => "tell me my next meeting"}
Parser identifies the parts of sentences where the question belongs to two different categories
User - show me hotels in san francisco for tomorrow that are less than $300 but not less than $200 are pet friendly have a gym and a pool with 3 or 4 stars staying for 2 nights and dont include anything that doesnt have wifi
Parser - {:hotels => ["show me hotels in san francisco",
"for tomorrow", "less than $300 but not less than $200",
"pet friendly have a gym and a pool",
"with 3 or 4 stars", "staying for 2 nights", "with wifi"]}
Parser identifies the question belonging to only one category but has additional steps for fine tuning the answer and created an array ordered according to the steps to take
From what I can understand this requires a sentence segmenter, multi-label classifier and co-reference resolution
But the sentence segementer I have come across depend heavily on grammar, punctuations.
Multi-label classifiers, like a good trained naive bayes classifier works in most cases but since they are multi-label, most times output multiple categories for sentences which clearly belong to one class. Depending solely on the array outputs to check the labels present would fail.
If used a multi-class classifier, that is also good to check the array output of probable categories but obviously they dont tell the different parts of the sentence much accurately, much less in what fashion to proceed with the next step.
As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules. Good accuracy of this would help a lot in classification.
As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules.
Instead of doing this I'd suggest you use the parse-tree directly (either dependency parser, or constituency parse).
Here I'm showing the output of the dependency parse and you can see that the two segments are separated via a "CONJ" arrow:
(from here: http://deagol.cs.illinois.edu:8080/)
Another solution I'd give try is ClausIE:
https://gate.d5.mpi-inf.mpg.de/ClausIEGate/ClausIEGate?inputtext=what+is+the+weather+now+and+tell+me+my+next+meeting++&processCcAllVerbs=true&processCcNonVerbs=true&type=true&go=Extract
If you want something for segmentation that doesn't depend on grammar heavily, then chunking comes to mind. In the NLTK book there is a fragment on that. The approach authors take here depends only on part of speech tags.
BTW Jurafsky and Martin's 3rd ed of Speech and Language processing contains information on chunking in the parsing chapter, and it also contains a chapters on information retrieval nad chatbots.

Machine learning: which algorithm fits for questions answering

I want to build a ml program who talks to user and get some inputs from the user.
The ml program analyze the input data(keywords) then predict the best solution.
So, you are looking at an AI application which needs some sort of machine intelligence for processing natural language.
Let us say the language of choice here is English. There are many things to be considered before building such a system.
Dependency parsing
Word Sense Disambiguation
Verb Sense Disambiguation
Coreference Resolution
Semantic Role Labelling
Universe of knowledge.
In brief you need to build all the above essential modules before you can generate your response.
You need to decide what kind of problem you are working on? Is it an open domain or closed domain problem, meaning what is the scope of knowledge of this application.
For example: Google now is an open domain problem which can practically take any possible input.
But some applications pertain to a particular task like automating food orders in an app etc where the scope of questions which can be asked is limited.
Once that is decided, you need to parse your input sentence and dependency parsing is the way to go. You can use Stanford core NLP suite to achieve most of the NLP tasks which were mentioned above.
Once the input sentence is parsed and you have the subjects, objects, etc it is time to disambiguate the words in the sentence as a particular word can have different meanings.
Then disambiguate the verb meaning identifying the type of verb (like return could mean going back to a place or giving back something )
Then you need to resolve coreference resolution meaning mapping the nouns and pronouns and other entities in a given context. For example:
My name is John. I work at ABC company.
Here I in the second sentence refers to John.
This helps us in answering questions like where does John work. Since John was only used in the first sentence and his work was mentioned in the second sentence coreference resolution helps us map them together.
The next task at hand is semantic role labelling, which basically means labelling all the arguments in a sentence with respect to each of its verb.
For example: John killed Mary.
Here the verb is kill, John and Mary are the arguments of the verb kill. John takes the role A0 and Mary the role A1. Where the definitions of these roles for each verb are mentioned in a huge frame and argument annotation framework created by the NLP community. Here A0 means the person who killed, A1 means the person who was killed.
Now once you have identified A0 and A1 just look into the definition of the kill frame and return A0 for killer and A1 for the victim.
Another important task at hand is to identify when your system must respond with an answer. For which you need to know if the given sentence is a declarative or assertive sentence or an interrogative sentence. You can just check that by seeing if the input sentence ends with a question mark.
Now to answer your question:
Let us say your input to the application is:
Input 1: John killed Mary.
Clearly this is an assertive sentence so just store it and process it as mentioned above.
Now the next input is:
Input 2: Who killed Mary?
This is an interrogative sentence so you need to come up with a reply or a response.
Now find the semantic role labels of input 1 and input 2 and return the word of input 1 which matches the argument of Who in sentence 2.
Here in this case who would be labeled as A0 and John would be labeled as A0, simply return John.
Most of the NLP modules mentioned can directly be implemented using Stanford core NLP however if you want to implement some algorithms on your own you can go through the recent publications in EMNLP, NIPS, ICML, CONLL etc to understand them better and implement the one which best suits you.
Good luck !

Does an algorithm exist to identify different queries/questions in sentence?

I want to identifies different queries in sentences.
Like - Who is Bill Gates and where he was born? or Who is Bill Gates, where he was born? contains two queries
Who is Bill Gates?
Where Bill Gates was born
I worked on Coreference resolution, so I can identify that he points to Bill Gates so resolved sentence is "Who is Bill Gates, where Bill Gates was born"
Like wise
MGandhi is good guys, Where he was born?
single query
who is MGandhi and where was he born?
2 queries
who is MGandhi, where he was born and died?
3 queries
India won world cup against Australia, when?
1 query (when India won WC against Auz)
I can perform Coreference resolution but not getting how can I distinguish queries in it.
How to do this?
I checked various sentence parser, but as this is pure nlp stuff, sentence parser does not identify it.
I tried to find "Sentence disambiguation" like "word sense disambiguation", but nothing exist like that.
Any help or suggestion would be much appreciable.
Natural language is full of exceptions. Especially in English, it is often said that there are more exceptions than rules. So, it is almost impossible to get a completely accurate solution that works every single time, but using a parser, you can achieve reasonably good performance.
I like to use the Berkeley parser for such tasks. Their online demo includes a graphical representation of the parse tree, which is extremely helpful when trying to formulate heuristics.
For example, consider the question "Who is Bill Gates and where was he born?". The parse tree looks like this:
Clearly, you can split the tree at the central conjunction (CC) node to extract the individual queries. In general, this will be easy if the parsed sentence is simple (where there will be only one query) or compound (where the individual queries can be split by looking at conjunction nodes, as above).
Another more complex example in your question has three queries, such as "Who is Gandhi and where did he work and live?". The parse tree:
Again, you can see the conjunction node which splits "Who is Gandhi" and "Where did he work and live*". The parse does not, however, split up the second query into two, as you would ideally want. And that brings us to the hardest part of what you are trying to do: dealing with (computationally, of course) what is known as right node raising. This is a linguistic construct where common parts get shared.
For example, consider the question "When and how did he suffer a setback?". What it really asks is (a) when did he suffer a setback?, and (b) how did he suffer a setback? Right-node raising issues cannot be solved by just parse trees. It is, in fact, one of the harder problems in computational linguistics, and belongs to the domain of hardcore academic research.

Resources