Parsing CoNLL-U files with NLTK

Parsing CoNLL-U files with NLTK - parsing

I know there are CoNLL-U parsers in Python. I would just like to get confirmation that NLTK does not have a native routine to parse CoNLL-U (or other CoNLL formats with dependency syntax).
Looking at the code, it seems HEAD and DEP are not among the permitted column types of conll. This is very unexpected because CoNLL-U is very popular nowadays, dependency syntax has been a core feature of many CoNLL formats since about 15 years, and this gap is not documented anywhere, so I'm pretty sure I'm overlooking something.

The Python library conllu can.
courtesy: this answer to "Why can't I read in .conll file with Python (confusing parse-error)?"

Related

What makes libadalang special?

I have been reading about libadalang 1 2 and I am very impressed by it. However, I was wondering if this technique has already been used and another language supports a library for syntactically and semantically analyzing its code. Is this a unique approach?

C and C++: libclang "The C Interface to Clang provides a relatively small API that exposes facilities for parsing source code into an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical source locations with elements within the AST, and other facilities that support Clang-based development tools." (See libtooling for a C++ API)
Python: See the ast module in the Python Language Services section of the Python Library manual. (The other modules can be useful, as well.)
Javascript: The ongoing ESTree effort is attempting to standardize parsing services over different Javascript engines.
C# and Visual Basic: See the .NET Compiler Platform ("Roslyn").
I'm sure there are lots more; those ones just came off the top of my head.
For a practical and theoretical grounding, you should definitely (re)visit the classical textbook Structure and Interpretation of Computer Programs by Abelson & Sussman (1st edition 1985, 2nd edition 1996), which helped popularise the idea of Metacircular Interpretation -- that is, interpreting a computer program as a formal datastructure which can be interpreted (or otherwise analysed) programmatically.

You can see "libadalang" as ASIS Mark II. AdaCore seems to be attempting to rethink ASIS in a way that will support both what ASIS already can do, and more lightweight operations, where you don't require the source to compile, to provide an analysis of it.
Hopefully the final API will be nicer than that of ASIS.
So no, it is not a unique approach. It has already been done for Ada. (But I'm not aware of similar libraries for other languages.)

Verilog gate level parser

I want to parse Verilog gate level code and store the data in a data structure (ex. graph).
Then I want to do something on the gates in C/C++ and output a corresponding Verilog file.
(I would like to build one program which input and output are Verilog gate level code)
(input.v => myProgram => output.v)
If there is any library or open source code to do so?
I found that it can be done by Flex and Bison but I have no idea how to use Flex and Bison.

There was a similar question a few days ago about doing this in ruby, in which I pointed to my Verilog parser gem. Not sure if it is robust enough for you though, would love feedback, bug reports, feature requests.
There are perl verilog parsers out there but I have not used any of them directly and avoid perl, hopefully others can add info about other parsers.

I have used Verilog-Perl successfully to parse Verilog code. It is well-maintained: it even supports the recent SystemVerilog extensions.

Yosys (https://github.com/cliffordwolf/yosys) is a framework for Verilog Synthesis written in C++. Yosys is still under construction but if you only want to read and write gate-level netlists it can do what you need..
PS: A reference manual (that also covers the C++ APIs) is on the way. I've written ~100 pages already, but can't publish it before I've finished my BSc. thesis (another month or so).

Any references for parsing incomplete or incorrect code?

Can anybody point me at references on techniques for parsing code that contains syntax errors, or is missing necessary punctuation, for example?
The application that I'm working on is an IDE, where we'd like to provide features like "jump to definition", auto-complete, and refactoring features, without requiring the source to be syntactically correct at the moment the functions are invoked.
Most parser code I've seen appears to work on the principle of "fail early", rather than focusing on error recovery or parsing partially-complete code.

Have you tried ANTLR?
In "The Definitive ANTLR Reference", section 10.7 Automatic Error Recovery Strategy for 5 pages Terrence talks about this. He references Algorithms + Data Structures = Programs, A Note on Error Recovery in Recursive Descent Parsers, Efficient and Comfortable Error Recovery in Recursive Descent Parsers.
Also see the pages from the web site:
Error reporting and recovery
ANTLR 3.0 Error Reporting and Recovery
Custom Syntax Error Recovery
Also check the ANTLR tag for accessing the ANTLR forum where Terrence Parr answers questions. He does answer some questions here as The ANTLR Guy.
Also the new version of ANTLR 4 is due out as well as the book.
Sorry to sound like a sales pitch, but I have been using ANTLR for years because it used by lots of people, is used in production systems, has a few solid versions: Java, C, C#, has a very active community, has a web site, has books, is evolving, maintained, open source, BSD license, easy to use and has some GUI tools.
One of the people working on a GUI for ANTLR 4 that has syntax highlight and auto-completion among other useful IDE editing is Sam Harwell. If you can reach him through the ANTLR forum, he might be able to help you out.

I don’t know of any papers or tutorials, but uu-parsinglib is a Haskell parsing library that can recover from syntax errors in a general fashion. If, for example, ; was expected but int was received, the parser can continue as though ; were inserted at that source position.
It’s up to you where the parser will fail and where it will proceed with corrections, and the results will be delivered alongside a set of the errors corrected during parsing. Even if you don’t intend to implement your parsing code in Haskell, an examination of the library may offer you some insight. Or you can write a parser in Haskell and call it from C.

Research on "Island grammars" may interest you. It's been a while since I looked at them, but I believe that they are supposed to reasonably handle cases where there are many chunks of nonsense in the file. I didn't have much luck with CiteSeer (oddly; usually it's pretty good), but Google Scholar found a number of relevant papers. Generating robust parsers using island grammars looks like a good place to start.

What front-end can I use with RPython to implement a language?

I've looked high and low for examples of implementing a language using the RPython toolchain, but the only one I've been able to find so far is this one in which the author writes a simple BF interpreter. Because the grammar is so simple, he doesn't need to use a parser/lexer generator. Is there a front-end out there that supports developing a language in RPython?
Thanks!

I'm not aware of any general lexer or parser generator targeting RPython specifically. Some with Python output may work, but I wouldn't bet on it. However, there's a set of parsing tools in rlib.parsing. It seems quite usable. OTOH, there's a warning in the documentation: It's reportedly still in development, experimental, and only used for the Prolog interpreter so far.
Alternatively, you can write the frontend by hand. Lexers can be annoying and unnatural, granted (you may be able to rip out the utility modules for DFAs used by the Python implementation). But parsers are a piece of cake if you know the right algorithms. I'm a huge fan of "Top Down Operator Precedence parsers" a.k.a. "Pratt parsers", which are reasonably simple (recursive descent) but make all expression parsing issues (nesting, precedence, associativity, etc.) a breeze. There's depressingly little information on them, but the few blog posts were sufficient for me:
One by Crockford (wouldn't recommend it though, it throws a whole lot of unrelated stuff into the parser and thus obscures it),
another one at effbot.org (uses Python),
and a third by a sadly even-less-famous guy who's developing a language himself, Robert Nystrom.

Alex Gaynor has ported David Beazley's excellent PLY to RPython. Its documentation is quite good, and he even gave a talk about using it to implement an interpreter at PyCon US 2013.

Are there any examples or documentation on using the Castalia source parser?

While I have written plenty of recursive parsers before, I have recently become interested in the Castalia Delphi Parser (why re-invent the wheel). I know this parser have been used in many projects over the years - but finding any documentation for it seems difficult.
Where exactly can I find the documentation? Or as an alternative, are there any clear cut examples on using it in a real-life parsing scenario?
The idea is to use Castalia for syntax verification of Delphi units, and (if possible) benefit in generating a node-tree of a program (with classes, their methods, parameters, result datatypes, if/then/else -- basically a full map of a unit or program). You could think of it as "half a script runtime" without actually running any code, just breaking it down into it's most fundamental aspects.

why don't you use JvInterpreterParser? it has only 2-3 unit dependencies... can be easily modified to fit your needs and you can also improve the speed, in a old test I've parsed a 80 MB file about in 6 Sec. on a Pentium 4 running # 2.8 Ghz or so...

Using the parser is described here:
http://delphiblog.twodesk.com/using-the-castalia-delphi-parser
The post also references some projects that are using the parser.
Here is another one:
https://github.com/LaKraven/MonkeyMixer

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parsing CoNLL-U files with NLTK - parsing

The Python library conllu can. courtesy: this answer to "Why can't I read in .conll file with Python (confusing parse-error)?"

Related

What makes libadalang special?

Verilog gate level parser

Any references for parsing incomplete or incorrect code?

What front-end can I use with RPython to implement a language?

Are there any examples or documentation on using the Castalia source parser?

Categories

Resources