How to check whether a phrase "functions" as noun in a sentence - machine-learning

In addition to noun and noun phrase, there are some other constructs in English that can also function as noun. Gerundive, for example, can be used as noun: you need good habits such as "being polite".
In an app I'm developing, I need to find all the components functioning as noun. I tried various chunking tools (NLTK, etc.) but they all seem to only recognize noun and noun phrase and not anything else.
These clunkers also don't recognize complements as part of NP, for example, "the fact that she's alive" will not be a single chunk even though they together act as noun in this sentence.
Is there any tool that can do trick like this?
Thanks.

I'm afraid having such control will require a proper statistical parser; for example, Stanford Parser gives the following tree for your sample sentence:
(ROOT
(NP (DT the) (NN fact)
(SBAR (IN that)
(S
(NP (PRP she))
(VP (VBZ is)
(ADJP (JJ alive)))))
(. .)))
recognizing that the whole segment is an NP. For the case of gerundive:
(ROOT
(S
(VP (VB thank)
(NP (PRP you))
(PP (IN for)
(NP (NN listening))))
(. .)))
Stanford parser provides an API you can use from your app.

Since SyntaxNet produces dependency parse trees, you would need to write some heuristics to get such information. A constituency parser could give you this information more directly, but would be lacking information about the role that the nodes play in the tree (for example, you wouldn't know whether the NP is the subject of the verb or the direct object).

#Roy I agree with Slav as I had the problem with the word "open". in my sentence "open" was imperative verb but syntaxnet marked it as adjective. I was not a computer science and I wrote a very simple and basic algorithm for fixing the problem you can see it here

Related

Epsilon(ε) productions and LR(0) grammars and LL(1) grammars

At many places (for example in this answer here), I have seen it is written that an LR(0) grammar cannot contain ε productions.
Also in Wikipedia I have seen statements like: An ε free LL(1) grammar is also SLR(1).
Now the problem which I am facing is that I cannot reason out the logic behind these statements.
Well, I know that LR(0) grammars accept the languages accepted by a DPDA by empty stack, i.e. the language they accept must have prefix property. [This prefix property can, however, be dealt with if we assume end markers and as such given any language the prefix property shall always be satisfied. Many texts like Theory of Computation by Sipser assume this end marker to simply their argument]. That being said, we can say (informally?) that a grammar is LR(0) if there is no state in the canonical collection of LR(0) items that have a shift-reduce conflict or reduce-reduce conflict.
With this background, I tried to consider the following grammar:
S -> Aa
A -> ε
canonical collection of LR(0) items
In the above DFA, I find that there is no state which has a shift-reduce conflict or reduce-reduce conflict.
So this grammar should be LR(0) as per my analysis. But it also has ε production.
Isn't this example contradicting the statement:
"no grammar with ε productions can be LR(0)"
I guess if I know the logic behind the above quoted statement then I can understand the concept better.
Actually my main problem arose with the statement :
An ε free LL(1) grammar is also SLR(1).
When I asked one of my friends, he gave the argument that as the LL(1) grammar is ε free hence it is LR(0) and hence it is SLR(1).
But I could not understand his logic either. When I asked him about reasoning, he started sharing post regarding "grammar with ε productions can never be LR(0)"...
But personally I could not think of any logic as to how "ε free LL(1) grammar is SLR(1)". Is it really related to the above property of "grammar with ε productions cannot be LR(0)"? If so, please do help me out.. If not, then should I consider asking a separate question for the second confusion?
I have got my concepts of compiler design from the dragon book by Ullman only. Also the knowledge of TOC from Ullman and from few other texts like Sipser, Linz.
A notable feature of your grammar is that A could just be eliminated. It serves absolutely no purpose. (By "eliminated", I mean simply removing all references to it; leaving productions otherwise intact.)
It is true that it's existence doesn't preclude the grammar from being LR(0). Similarly, a grammar with an unreachable non-terminal and an ε-production for that non-terminal could also be LR(0).
So it would be more accurate to say that a grammar cannot be LR(0) if it has a productive non-terminal with both an ε-production and some other productive production. But since we usually only consider reduced grammars without pointless non-terminals, I'm not sure that this additional pedantry serves much purpose.
As for your question about ε-free LL(1) grammars, here's a rough outline:
If an ε-free grammar is not LR(0), then there is some state with both a shift and a reduce action. Since the grammar is ε-free, that state was reached by way of a shift or a goto. The previous state must then have had two different productions with the same FIRST set, contradicting the LL(1) condition.

Why bison does not convert grammar automatically?

I am learning lexer and parser, so I am reading this classical book : flex & bison (By John Levine, Publisher: O'Reilly Media).
An example is given that could not be parsed by bison :
phrase : cart_animal AND CART | work_animal AND PLOW
cart_animal-> HORSE | GOAT
work_animal -> HORSE | OX
I understand very well why it could not. Indeed, it requires TWO symbols of lookahead.
But, with a simple modification, it could be parsed :
phrase : cart_animal CART | work_animal PLOW
cart_animal-> HORSE AND | GOAT AND
work_animal -> HORSE AND | OX AND
I wonder why bison is not able to translate automatically grammar in simple cases like that ?
Because simple cases like that are pretty well all artificial, and in the case of real-life examples, it is difficult or impossible.
To be clear, if you have an LR(k) grammar with k>1 and you know the value of k, there is a mechanical transformation with which you can make an equivalent LR(1) grammar, and moreover you can, with some juggling, fix the reduction actions so that they have the same effect (at least, as long as they don't contain side effects). I don't know any parser generator which does that, in part because correctly translating the reduction actions will be tricky, and in part because the resulting LR(1) grammar is typically quite large, even for small values of k.
But, as I mentioned above, you need to know the value of k to perform this transformation, and it turns out that there is no algorithm which can take a grammar and tell you whether it is LR(k). So all you could do is try successively larger values of k until you find one which works, or you decide to give up.

why top down parser cannot handle left recursion?

I wanted to know why top down parsers cannot handle left recursion and we need to eliminate left recursion due to this as mentioned in dragon book..
Think of what it's doing. Suppose we have a left-recursive production rule A -> Aa | b, and right now we try to match that rule. So we're checking whether we can match an A here, but in order to do that, we must first check whether we can match an A here. That sounds impossible, and it mostly is. Using a recursive-descent parser, that obviously represents an infinite recursion.
It is possible using more advanced techniques that are still top-down, for example see [1] or [2].
[1]: Richard A. Frost and Rahmatullah Hafiz. A new top-down parsing algorithm to accommodate ambiguity and left recursion in polynomial time. SIGPLAN Notices, 41(5):46–54, 2006.
[2]: R. Frost, R. Hafiz, and P. Callaghan, Modular and efficient top-down
parsing for ambiguous left-recursive grammars. ACL-IWPT, pp. 109 –
120, 2007.
Top-down parsers cannot handle left recursion
A top-down parser cannot handle left recursive productions. To understand why not, let's take a very simple left-recursive grammar.
S → a
S → S a
There is only one token, a, and only one nonterminal, S. So the parsing table has just one entry. Both productions must go into that one table entry.
The problem is that, on lookahead a, the parser cannot know if another a comes after the lookahead. But the decision of which production to use depends on that information.

which parser is most suitable for [biomedical] relation extraction?

I have read about continuency parser and dependency parser. but confused which could be the best choice.
my task is to extract relationship from english wikipedia text(other source may also be included later). What I need is an semantic path(with only most important information) between the two entities interesting. for instance,
form text:
"In America, diabetes is, as everybody knows, a common disease."
I need the information:
"diabetes is disease"
which implementation of parser would you suggest? Stanford? Maltparser? or other?
any clue is appreciated.
You mean a syntactic parser vs a dependency parser? The online Stanford Parser shows you how these parses are different.
Syntactic Parse
(ROOT
(S
(PP (IN In)
(NP (NNP America)))
(, ,)
(NP (NNP diabetes))
(VP (VBZ is) (, ,)
(PP (IN as)
(NP (NN everybody) (NNS knows)))
(, ,)
(NP (DT a) (JJ common) (NN disease)))))
Dependency Parse (collapsed)
prep_in(disease-13, America-2)
nsubj(disease-13, diabetes-4)
cop(disease-13, is-5)
nn(knows-9, everybody-8)
prep_as(disease-13, knows-9)
det(disease-13, a-11)
amod(disease-13, common-12)
root(ROOT-0, disease-13)
They are not that different actually (see Collins' thesis or Nieve's book for more details) but I find dependency parses easier to work with. As you can see, you get a direct relation for diabetes -> disease. Then you can attach the copula.
Of course, a dependency parser like the Stanford dependency parser would be the right choice for you. I would recommend using the BLLIP reranking parser with David McClosky's biomedical model for getting the phrase structure and then converting to dependencies with Stanford Dependencies. This way you would get better dependency trees/graphs for biomedical text.

Prolog Parsing Output

I'm doing a piece of university coursework, and I'm stuck with some Prolog.
The coursework is to make a really rudimentary Watson (the machine that answers questions on Jeapoardy).
Anyway, I've managed to make it output the following:
noun_phrase(det(the),np2(adj(traitorous),np2(noun(tostig_godwinson)))),
verb_phrase(verb(was),np(noun(slain)))).
But the coursework specifies that I now need to extract the first and second noun, and the verb, to make a more concise sentence; i.e. [Tostig_godwinson, was, slain].
I much prefer programming in languages like C etc., so I'm a bit stuck. If this were a procedural language, I'd use parsing tools, but Prolog doesn't have any... What do I need to do to extract those parts?
Thank you in advance
In Prolog, the language is the parsing tool. Use the univ (=..) operator to do term inspection:
% find terminal nodes (words) in Tree
terminal(Tree, Type, Item) :-
Tree =.. [Type, Item],
atomic(Item).
terminal(Tree, Type, Item) :-
Tree =.. [_, Sub],
member(Node, Sub),
terminal(Node, Type, Item).
Now get a list of all nouns with findall(N, terminal(Tree, noun, N), Nouns) and get the nth1 element.

Resources