Text parsing library

Text parsing library - parsing

A colleague of mine works on an universal text parsing library, based on C# lambdas. The core looks cool, but unfortunately to me he has hardcoded a grammar, specifical to his private task -- math expression evaluating. So, I will not use it as I had intended before I saw the API. And now I'm looking for another lib, that meets at least some of my requirements. It has to:
Be able to load a grammar from an external file -- say, XML, YML or JSON.
Return AST from grammar and parsed tree that is built from any text.
Work fast enough to load C# grammar then parse a large code file.
I'd prefer the library that has grammar format file simple enough for easy writing a grammar for math expressions, is open source and written in C# or C++.
Regards,
--
UPDATED: point 2 has been corrected.

You might check out Text Transformer which claims to be some kind of universal text processing language. I have no specific experience with it.
Building robust langauge front ends and usable processing tools is actually a lot of work.
If you want to process computer languages in a generic way, you might consider our DMS Software Reengineering Toolkit, a kind of generalized compiler technology for parsing, analyzing, transforming, and/or generating code (or any other kind of formal document).
DMS will accept arbitrary context free grammars for langauges, automatically builds an AST with no additional specification effort on your part, and is designed to handle not only large files but very large sets of files in a single computation. Normally people
that want to process code need pattern recognition, code analysis and code transformation capabilities; DMS has all of these built in. It also has a variety of predefined, mature grammars for a wide variety of computer langauges, well-known (C, C++, C#, COBOL, Java, JavaScript, ... ) and otherwise (Natural, EGL, Python, MATLAB, ...), and has been used to carry out massive automated analyses and transformations on programs in these various langauges.
DMS does not meet your open-source or C#/C++ implementation requirements. It is implemented as a set of domain-specific langauges for describing grammars, analyzers, transformations, prettyprinters, and scripting that allows parallel execution to enable complex analyses to run faster than single-threaded programs.

Related

What are common properties in an Abstract Syntax Tree (AST)?

I'm new to compiler design and have been watching a series of youtube videos by Ravindrababu Ravula.
I am creating my own language for fun and I'm parsing it to an Abstract Syntax Tree (AST). My understanding is that these trees can be portable given they follow the same structure as other languages.
How can I create an AST that will be portable?
Side notes:
My parser is currently written in javascript but I might move it to C#.
I've been looking at SpiderMonkey's specs for guidance. Is that a good approach?

Portability (however defined) is not likely to be your primary goal in building an AST. Few (if any) compiler frameworks provide a clear interface which allows the use of an external AST, and particular AST structures tend to be badly-documented and subject to change without notice. (Even if they are well-documented, the complexity of a typical AST implementation is challenging.)
An AST is very tied to the syntactic details of a language, as well as to the particular parsing strategy being used. While it is useful to be able to repurpose ASTs for multiple tasks -- compiling, linting, pretty-printing, interactive editing, static analysis, etc. -- the conflicting demands of these different use cases tends to increase complexity. Particularly at the beginning stages of language development, you'll want to give yourself a lot of scope for rapid prototyping.
The most tempting reason for portable ASTs would be to use some other language as a target, thereby saving the cost of writing code-generation, etc. However, in practice it is usually easier to generate the textual representation of the other language from your own AST than to force your parser to use a foreign AST. Even better is to target a well-documented virtual machine (LLVM, .Net IL, JVM, etc.), which is often not much more work than generating, say, C code.
You might want to take a look at the LLVM Kaleidoscope tutorial (the second section covers ASTs, although implemented in C++). Also, you might find this question on a sister site interesting reading. And finally, if you are going to do your implementation in Javascript, you should at least take a look at the jison parser generator, which takes a lot of the grunt-work out of maintaining a parser and scanner (and thus allows for easier experimentation.)

abstract syntax tree for imperative languages

I am looking for an abstract syntax tree representation that can be used for common imperative languages (Java, C, python, ruby, etc). I would like this to be as close to source as possible (as opposed to something like LLVM). I found Rose online but it is only able to handle C and Fortran. Does this exist?

You won't find "one" universal AST that can represent many languages. People have been searching for 50 years.
The essential reason is that an AST node implicitly represents the precise language semantics of the operator it encodes, and different languages have different semantics for what are apparently the same operators.
For example, the "+" operator in modern Fortran will add integers, reals, complex values, and slices of arrays of such things. Java "+" will add integers, reals, and glue strings together. If I wrote "a+b" in "universal AST", how would you know which semantic effect the corresponding AST encoded?
What you can do is build a system in which the ASTs for different languages are represented uniformly, so that you can share tool infrastructure across many languages. This is done by many Program Transformation Systems (PTS), where you provide the grammar (or pick one from an available library), and the PTS parses and builds an AST using its uniform representation. Most PTS provide additional support to analyze and transform the code.
So, all you need is a PTS and some sweat to define a grammar. That's really not true; getting a grammar right for a real language is actually pretty hard. Worse, there's a lot to Life After Parsing because you need the meaning of symbols and additional inferences such as control and data flow analysis. So you need full front ends (e.g., parsing, name/type resolution, flow analysis, ...), or as much as you can get, if you don't want to be distracted for months before beginning your real work.
What this means in practice is you want to find a tool that handles the languages of interest to you, with mature front ends already available:
Rose (you already found this) handle C, C++ and Fortran. It has no built-in parsing capability of its own; its front ends are custom built. So it is apparantly hard to extend to other languages. But it has good flow analysis capabilities and provides means to transform the code via hand-write AST walks/smashes.
Clang handles C and C++. Clang also uses hand-built front ends. It can also transform code, again by hand-written AST walks/smashes, with a small amount of pattern matching support. As I understand it, you have to use the LLVM part of Clang to do flow analysis.
Our DMS Software Reengineering Toolkit has full front ends for C, C++, Java and COBOL, and full parsers for many more languages such as Python. DMS provides pattern-based analysis and source-to-source transformation. It operates directly from a grammar (see one for Oberon, Nicklaus Wirth's latest language). (I don't know of any tool that handles Ruby, which is famously hard to parse; I understand its grammar is ambiguous, and DMS is good at handling ambiguous grammars).

Source of parsers for programming languages?

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)

If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.

You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

ANTLR vs. Happy vs. other parser generators

I want to write a translator between two languages, and after some reading on the Internet I've decided to go with ANTLR. I had to learn it from scratch, but besides some trouble with eliminating left recursion everything went fine until now.
However, today some guy told me to check out Happy, a Haskell based parser generator. I have no Haskell knowledge, so I could use some advice, if Happy is indeed better than ANTLR and if it's worth learning it.
Specifically what concerns me is that my translator needs to support macro substitution, which I have no idea yet how to do in ANTLR. Maybe in Happy this is easier to do?
Or if think other parser generators are even better, I'd be glad to hear about them.

People keep believing that if they just get a parser, they've got it made
when building language tools. Thats just wrong. Parsers get you to the foothills
of the Himalayas then you need start climbing seriously.
If you want industrial-strength support for building language translators, see our
DMS Software Reengineering Toolkit. DMS provides
Unicode-based lexers
full context-free parsers (left recursion? No problem! Arbitrary lookahead? No problem. Ambiguous grammars? No problem)
full front ends for C, C#, COBOL, Java, C++, JavaScript, ...
(including full preprocessors for C and C++)
automatic construction of ASTs
support for building symbol tables with arbitrary scoping rules
attribute grammar evaluation, to build analyzers that leverage the tree structure
support for control and data flow analysis (as well realization of this for full C, Java and COBOL),
source-to-source transformations using the syntax of the source AND the target language
AST to source code prettyprinting, to reproduce target language text
Regarding the OP's request to handle macros: our C, COBOL and C++ front ends handle their respective language preprocessing by a) the traditional method of full expansion or b) non-expansion (where practical) to enable post-parsing transformation of the macros themselves. While DMS as a foundation doesn't specifically implement macro processing, it can support the construction and transformation of same.
As an example of a translator built with DMS, see the discussion of
converting
JOVIAL to C for the B-2 bomber. This is 100% translation for > 1 MSLOC of hard
real time code. [It may amuse you to know that we were never allowed to see the actual program being translated (top secret).]. And yes, JOVIAL has a preprocessor, and yes we translated most JOVIAL macros into equivalent C versions.
[Haskell is a cool programming language but it doesn't do anything like this by itself.
This isn't about what's expressible in the language. Its about figuring out what machinery is required to support the task of manipulating programs, and
spending 100 man-years building it.]

Most effective way to parse C-like definition strings?

I've got a set of function definitions written in a C-like language with some additional keywords that can be put before some arguments(the same way as "unsigned" or "register", for example) and I need to analyze these lines as well as some function stubs and generate actual C code from them.
Is that correct that Flex/Yacc are the most proper way to do it?
Will it be slower than writing a Shell or Python script using regexps(which may become big pain, as I suppose, if the number of additional keywords becomes bigger and their effects would be rather different) provided that I have zero experience with analysers/parsers(though I know how LALR does its job)?
Are there any good materials on Lex/Yacc that cover similar problems? All papers I could find use the same primitive example of a "toy" calculator.
Any help will be appreciated.

ANTLR is commonly used (as are Lex\Yacc).
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.

There is also the Lemon Parser, which features a less restrictive grammar. The down side is you're married to lemon, re-writing a parser's grammar to something else when you discover some limitation sucks. The up side is its really easy to use .. and self contained. You can drop it in tree and not worry about checking for the presence of others.
SQLite3 uses it, as do several other popular projects. I'm not saying use it because SQLite does, but perhaps give it a try if time permits.

That entirely depends on your definition of "effective". If you have all the time of the world, the fastest parser would be a hand-written pull parser. They take a long time to debug and develop but today, no parser generator beats hand-written code in terms of runtime performance.
If you want something that can parse valid C within a week or so, use a parser generator. The code will be fast enough and most parser generators come with a grammar for C already which you can use as a starting point (avoiding 90% of the common mistakes).
Note that regexps are not suitable for parsing recursive structures. This approach would both be slower than using a generator and more error prone than a hand-written pull parser.

actually, it depends how complex is your language and whether it's really close to C or not...
Still, you could use lex as a first step even for regular expression ....
I would go for lex + menhir and o'caml....
but any flex/yacc combination would be fine..
The main problem with regular bison (the gnu implementation of yacc) stems from the C typing.. you have to describe your whole tree (and all the manipulation functions)... Using o'caml would be really easier ...

For what you want to do, our DMS Software Reengineering Toolkit is likely a very effective solution.
DMS is designed specifically to support customer analyzers/code generators of the type you are discussing. It provides very strong facilities for defining arbitrary language parsers/analyzers (tested on 30+ real languages including several complete dialects of C, C++, Java, C#, and COBOL).
DMS automates the construction of ASTs (so you don't have to do anything but get the grammar right to have a usable AST), enables the construction of custom analyses of exactly the pattern-directed inspection you indicated, can construct new C-specific ASTs representing the code you want to generate, and spit them out as compilable C source text. The pre-existing definitions of C for DMS can likely be bent to cover your C-like language.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart