How to create dependency graph (parse tree) for random sentences. Is there any predined grammer to parse english sentences using nltk.
Example:
I want to make a parse tree for the sentence
“A large company needs a sustainable business model.”
which should look like this.
Please suggest me how this can be done.
This question is a near-duplicate of 3125926. But I'll elaborate just a little on the answer given there.
I don't have personal experience with dependency parsing under NLTK, but according to the accepted answer, the integration with MaltParser is documented at http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html
If for some reason MaltParser doesn't suit your needs, you might also take a look at MSTParser and the Stanford Parser. I think those three options are the best-known, and I expect one (or all) of them will work for you.
Note that the Stanford Parser includes routines to convert from constituency trees and between several of the standard dependency representations, so if you need a specific format, you might look at the format-conversion arguments to the edu.stanford.nlp.trees.EnglishGrammaticalStructure class.
e.g., to convert from constituency trees to basic dependencies:
java -cp stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile <input trees> -basic
Related
I was entirely amazed by how Coq's parser is implemented. e.g.
https://softwarefoundations.cis.upenn.edu/lf-current/Imp.html#lab347
It's so crazy that the parser seems ok to take any lexeme by giving notation command and subsequent parser is able to parse any expression as it is. So what it means is the grammar must be context sensitive. But this is so flexible that it absolutely goes beyond my comprehension.
Any pointers on how this kind of parser is theoretically feasible? How should it work? Any materials or knowledge would work. I just try to learn about this type of parser in general. Thanks.
Please do not ask me to read Coq's source myself. I want to check the idea in general but not a specific implementation.
Indeed, this notation system is very powerful and it was probably one of the reasons of Coq's success. In practice, this is a source of much complication in the source code. I think that #ejgallego should be able to tell you more about it but here is a quick explanation:
At the beginning, Coq's documents were evaluated sentence by sentence (sentences are separated by dots) by coqtop. Some commands can define notations and these modify the parsing rules when they are evaluated. Thus, later sentences are evaluated with a slightly different parser.
Since version 8.5, there is also a mechanism (the STM) to evaluate a document fully (many sentences in parallel) but there is some special mechanism for handling these notation commands (basically you have to wait for these to be evaluated before you can continue parsing and evaluating the rest of the document).
Thus, contrary to a normal programming language, where the compiler will take a document, pass it through the lexer, then the parser (parse the full document in one go), and then have an AST to give to the typer or other later stages, in Coq each command is parsed and evaluated separately. Thus, there is no need to resort to complex contextual grammars...
I'll drop my two cents to complement #Zimmi48's excellent answer.
Coq indeed features an extensible parser, which TTBOMK is mainly the work of Hugo Herbelin, built on the CAMLP4/CAMLP5 extensible parsing system by Daniel de Rauglaudre. Both are the canonical sources for information about the parser, I'll try to summarize what I know but note indeed that my experience with the system is short.
The CAMLPX system basically supports any LL1 grammar. Coq exposes to the user the whole set of grammar rules, allowing the user to redefine them. This is the base mechanism on which extensible grammars are built. Notations are compiled into parsing rules in the Metasyntax module, and unfolded in a latter post-processing phase. And that really is AFAICT.
The system itself hasn't changed much in the whole 8.x series, #Zimmi48's comments are more related to the internal processing of commands after parsing. I recently learned that Coq v7 had an even more powerful system for modifying the parser.
In words of Hugo Herbelin "the art of extensible parsing is a delicate one" and indeed it is, but Coq's achieved a pretty great implementation of it.
If I want to train the Stanford Neural Network Dependency Parser for another language, there is a need for a "treebankLanguagePack"(TLP) but the information about this TLP is very limited:
particularities of your treebank and the language it contains
If I have my "treebank" in another language that follows the same format as PTB, and my data is using CONLL format. The dependency format follows the "Universal Dependency" UD. Do I need this TLP?
As of the current CoreNLP release, the TreebankLanguagePack is used within the dependency parser only to 1) determine the input text encoding and 2) determine which tokens count as punctuation [1].
Your best bet for a quick solution, then, is probably to stick with the UD English TreebankLanguagePack. You should do this by specifying the property language as "UniversalEnglish" (whether you're accessing the dependency parser via code or command line). If you're using the dependency parser via the CoreNLP main entry point, this property key should be depparse.language.
Technical details
Two very subtle details follow. You probably don't need to worry about these if you're just trying to hack something together at first, but it's probably good to mention so that you can avoid apocalyptic / head-smashing bugs in the future.
Evaluation and punctuation: If you do choose to stick with UniversalEnglish, be aware that there is a hack in the evaluation code that overrides the punctuation set for English parsing in particular. Any changes you make to punctuation in PennTreebankLanguagePack (the TLP used for the UniversalEnglish language) will be ignored! If you need to get around this, it should be enough to copy and paste the PennTreebankLanguagePack into your own codebase and name it something different.
Potential memory leak: When building parse results to be returned to the user, the dependency parser draws from a pool of cached GrammaticalRelation objects. This cache does not live-update. This means that if you have relations which aren't formally defined in the language you specified via the language property, they will lead to the instantiation of a new object whenever those relations show up in parser predictions. (This can be a big deal memory-wise if you happen to store the parse objects somewhere.)
[1]: Punctuation is excluded during evaluation. This is a standard "cheat" used throughout the dependency parsing literature.
I've heard that "real compiler writers" roll their own handmade parser rather than using parser generators. I've also heard that parser generators don't cut it for real-world languages. Supposedly, there are many special cases that are difficult to implement using a parser generator. I have my doubts about this:
Theoretically, a GLR parser generator should be able to handle most programming language designs (except maybe C++...)
I know of at least one production language that uses a parser generator: Ruby [1].
When I took my compilers class in school, we used a parser generator.
So my question: Is it reasonable to write a production compiler using a parser generator, or is using a parser generator considered a poor design decision by the compiler community?
[1] https://github.com/ruby/ruby/blob/trunk/parse.y
For what it's worth, GCC used a parser generator pre-4.0 I believe, then switched to a hand written recursive descent parser because it was easier to maintain and extend.
Parser generators DO "cut it" for "real" languages, but the amount of work to transform your grammar into something workable grows exponentially.
Edit: link to the GCC document detailing the change with reasons and benefits vs cost analysis: http://gcc.gnu.org/wiki/New_C_Parser.
I worked for a company for a few years where we were more or less writing compilers. We weren't concerned much with performance; just reducing the amount of work/maintenance. We used a combination of generated parsers + handwritten code to achieve this. The ideal balance is to automate the easy, repetitive parts with the parser generator and then tackle the hard stuff in custom functions.
Sometimes a combination of both methods, is used, like generating code with a parser, and later, modifying "by hand" that code.
Other way is that some scanner (lexer) and parser tools allow them to add custom code, additional to the grammar rules, called "semantic actions". A good example of this case, is that, a parser detects generic identifiers, and some custom code, transform some specific identifiers into keywords.
EDIT:
add "semantic actions"
Is there existing software for discriminative reranking, such as that used by the Charniak NLP parser, Shen, Sarkar, and Och's parser or Shen and Joshi's techniques? I'd like something that I can easily adapt for my own uses, which are similar to parse reranking.
Charniak-Johnson Reranking
The source code for the Charniak-Johnson (CJ) reranking parser is freely available, you can download a copy here.
The reranker is a separate code module that takes as input n-best lists of parses, so it's trivial to decouple it from the parsing front end.
SVM-rank
Alternatively, the package SVM-rank, from Thorsten Joachim's lab at Cornell, is a general purpose ranker. It might be easier to go with this package, if what you want to do deviates significantly from what's being done by the Charniak-Johnson parser.
I have stumbled upon the following F77 yacc grammar: http://yaxx.cvs.sourceforge.net/viewvc/yaxx/yaxx/fortran/fortran.y?revision=1.3&view=markup.
How can I make a Fortran 77 parser out of this file using Happy?
Why is there some C?/C++? code in that .y file?
UPDATE: Thank you for your replies!
I've been playing with two fresh approaches for a while now:
extracting and modifiying the parser from the source code package bundled with a paper titled Parametric Fortran,
writing a grammar from scratch with the help of BNFC.
I've got both to parse simple code excerpts already. I'll keep people in the know should something usable come into existence within this century ^__^" hehe.
P/S: Want to see whether I could gather enough momentum on my own to initiate a project for an automatic differentiation engine to replace a binary-only one we depend on for the time being. For entertainment at the initial stages: I'm watching Love Shuffle! It's a very enjoyable J-Drama! Highly recommendable ...
The C is the semantic action for reducing the stack when the syntax is read in. These actions are in C because the definition is intended for Bison/Yacc which produces a C source file.
If you want to use Happy, port the BNF to the Happy definition syntax and write your semantics in Haskell.
Just the tip of the iceberg for getting anything useful however.
If you don't have a copy already, invest in the Dragon Book (Compilers: Principles, Techniques & tools by Aho, Lam, Sethi, Ullman - Pearson)
Why the other answers are true in the general sense, in that you'll need to write your own actions to do anything meaningful the Yacc definition that you linked to actually doesn't have any actions associated with the grammar rules. What it does is that it defines the yyerror function and some code for extracting values from yylval based on the token type.
If you have no clue what yyerror/yylval are about you should read a bison/flex tutorial. The Dragon book is also a good resource if you're more serious about this. There are also some excellent handouts from a Stanford course on compilers floating around the Net, which are based on the book.
You'll need an AST to build that can be constructed in an equivalent way to the C fragments in the Yacc file.
Use BNFC and write your own grammar from scratch! BNFC works wonders and you could do your parsing exactly as you desire.