parsing code into components - parsing

I need to parse some code and convert it to components because I want to make some stats about the code like number of code lines , position of condition statements and so on .
Is there any tool that I can use to fulfill that ?

Antlr is a nice tool that works with many languages, has good documentation and many sample grammars for languages included.
You can also go old-school and use Yacc and Lex (or the GNU versions Bison and Flex), which has pretty good book on generating parsers, as well as the classic dragon book.
It might be overkill, however, and you might just want to use Ruby or even Javascript.

You mention two distinct tasks:
refactor code into modular components
Run static code analysis to get code metrics.
I recommend:
Resharper, as the top code refactoring tool out there (assuming you're a .NET guy)
NCover and NDepend for static code analysis. You get #line of code, cyclomatic complexity, abstractness vs instability diagrams.. all of the cool stuff.

Related

Is using lex/yacc (or flex/bison) an overkill for configuration file parsing?

For the last couple of weeks I kept reading and playing with flex/bison, the main goal is to parse structured configuration file with nested groups and lists.
flex/bison seems very powerful but too complicated.
I surveyed few open source project and the only example I found for configuration parsing using Bison was ntpd, other projects build their own parser and lexer.
Is it really the right tool for the job? or is it better to build a recursive descent parser by hand (may be with flex as a lexer)?!
It's entirely appropriate. If you are versed in bison you can throw it together way quicker than you could write an RDP or some kind of ad-hoc parser. Might take a little longer if it's your first go at it - but it might also be a good way to learn.
It will also help you design your grammar - if you accidentally make it ambiguous, you'll get a R/R conflict right away, rather than getting way down to a depp dark place in your RDP and finding you have no way out...
I don't believe it's too complicated. Besides, handwritten parsers are poorly maintainable, compared to autogenerated parsers.
The biggest problem with GNU Bison and Flex is that there is no good tutorial for C++. There are plenty of badly written C examples with global variables, which doesn't help Bison/Flex reputation. Your percepsion may change when you have a working example.
Here is a working C++ solution using Bison 3 and Flex. Encapsulate it in your own namespace and voila - you can stuff your project with gazilion parsers for everything.
https://github.com/ezaquarii/bison-flex-cpp-example
There are lots of home-brew configuration file syntaxes that have been developed using primitive ad-hoc approaches, such as splitting a line into a name and value based on simple tokenizing. Such approaches tend to have limitations, and Java properties files come to mind as a particularly bad configuration format.
When you have made the decision to define a lexical and BNF specification for your configuration syntax, you are already ahead of the game. Whether you then choose to implement that specification via hand-written code or via tools such as flex & bison is just a relatively unimportant implementation detail.
When I designed and implemented Config4*, I choose the hand-written code approach, for reasons I discuss in one of the Config4* manuals. However, I agree with the advice from BadZen: if you are already comfortable using flex and bison, then using them will probably save time compared to using a hand-written lexer and recursive-descent parser.

Cross-platform parser development - What are the options?

I'm currently working on a project that makes use of a custom language with a simple context-free grammar.
Due to the project's characteristics the same language will have to be used on several platforms, especially mobile ones. Currently, I'm using my small hand-written Java parser (for the Android platform). Soon, I'll have to write basically the same parser for JavaScript and later possibly also for C# (Windows Phone) and Objective C (iOS). There is an additional chance that I'll also have to write it for PHP.
My question is: What options are there to simplify the parser development process? Do I really have to write basically the same parser for each platform or is there a less work-intensive way?
From a development process point of view the best alternative would enable me to write a grammar definition which would then automatically be compiled into a parser.
However, basically the only cross-platform parser generator I've found so far it the GOLD Parser which supports two of my target platforms (Java and C#). It would really be awesome if you could point me to other alternatives.
In case you don't know about other cross-platform compiler-compilers: Do you have hints how to structure the code towards future language extensibility?
I commend https://en.wikipedia.org/wiki/Comparison_of_parser_generators to your attention: if we restrict the domain to Java and C/C++, it suggests APG, GOLD, SableCC, and SLK (amongst others) as being cross-language enough for your stated goals. (I'm also requiring that the action code be separated from the grammar rather than inline, since the latter would defeat the purpose.) If you want JavaScript as well, it looks like your choices are APG (GPL-licensed) and WaxEye (MIT-licensed).
If your language is reasonably simple then I would say to just go with whichever you think will be easiest to integrate into your build environment(s) and has a reasonable match with how you think. Unless parsing time is a huge fraction of your application's total workload, parsing speed should not be an issue -- although table size and memory usage might matter in a mobile context. If your grammar is "simple enough," (i.e. not Perl, for instance) I would expect any of those tools to work.
Have a look in Antlr, I am using it for transforming java code and it is really great. Moreover you can find different grammars here.
REx parser generator supports the required targets, except for Objective C and PHP (code generators for those might be possible). It has not yet been published as open source, though, and there is no decent documentation, just sample grammars. But there are projects that are using it successfully, e.g. xqlint. Here is a paper describing the experience from that project.

Learning incremental compilation design

There are a lot of books and articles about creating compilers which do all the compilation job at a time. And what about design of incremental compilers/parsers, which are used by IDEs? I'm familiar with first class of compilers, but I have never work with the second one.
I tried to read some articles about Eclipse Java Development Tools, but they describe how to use complete infrastructure(i.e. APIs) instead of describing internal design(i.e. how it works internally).
My goal is to implement incremental compiler for my own programming language. Which books or articles would you recommend me?
This book is worth a look: Builing a Flexible Incremental Compiler Back-End.
Quote from Ch. 10 "Conclusions":
This paper has explored the design of
the back-end of an incremental
compilation system. Rather than
building a single fixed incremental
compiler, this paper has presented a
flexible framework for constructing such
systems in accordance with user needs.
I think this is what you are looking for...
Edit:
So you plan to create something that is known as a "cross compiler"?!
I started a new attempt. Until now, I can't provide the ultimate reference. If you plan such a big project, I'm sure you are an experienced programmer. Therefore it is possible, that you already know these link(s).
Compilers.net
List of certain compilers, even cross compilers (Translators). Unfortunately with some broken links, but 'Toba' is still working and has a link to its source code. May be that this can inspire you.
clang: a C language family frontend for LLVM
Ok, it's for LVVM but source is available in a SVN repository and it seems to be a front end for a compiler (translator). May be that this can inspire you as well.
I'm going to disagree with conventional wisdom on this one because most conventional wisdom makes unwritten assumptions about your goals, such as complete language designs and the need for extreme efficiency. From your question, I am assuming these goals:
learn about writing your own language
play around with your language until it looks elegant
try to emit code into another language or byte code for actual execution.
You want to build a hacking harness and a recursive descent parser.
Here is what you might want to build for a harness, using just a text based processor.
Change the code fragment (now "AT 0700 SET HALLWAY LIGHTS ON FULL")
Compile the fragment
Change the code file (now "tests.l")
Compile from file
Toggle Lexer output (now ON)
Toggle Emitter output (now ON)
Toggle Run on home hardware (now OFF)
Your command, sire?
You will probably want to write your code in Python or some other scripting language. You are optimizing your speed of play, not execution. A recursive descent parser might look like:
def cmd_at():
if next_token.type == cTIME:
num = next_num()
emit("events.setAlarm(events.DAILY, converttime(" + time[0:1] + ", "
+ time[2:] + ", func_" + num + ");")
match_token(cTIME)
match_token(LOCATION)
...
So you need to write:
A little menu for hacking.
Some lexing routines, to return different tokens for numbers, reserved words, and the like.
A bunch of logic for what your language
This approach is aimed at speeding up the cycle for hacking together the language. When you have finished this approach, then you reach for BISON, test harnesses, etc.
Making your own language can be a wonderful journey! Expect to learn. Do not expect to get rich.
I see that there is an accepted answer, but I think that some additional material could be usefully included on this page.
I read the Wikipedia article on this topic and it linked to a DDJ article from 1997:
http://www.drdobbs.com/cpp/codestore-and-incremental-c/184410345?pgno=1
The meat of the article is the first page. It explains that the code in the editor is divided into pieces that are "incorporated" into a "CodeStore" (database). The pieces are incorporated via a work queue which contains unincorporated pieces. A piece of code may be parsed and returned to the work queue multiple times, with some failure on each attempt, until it goes through successfully. The database includes dependencies between the pieces so that when the source code is edited the effects on the edited piece and other pieces can be seen and these pieces can be reprocessed.
I believe other systems approach the problem differently. Java presents different problems than C/C++ but has advantages as well, so Eclipse perhaps has a different design.

Example code for dynamic parsing techniques

I would like to learn how to write dynamic parsers to perform tasks such as code-completion, highlighting, etc.
I have read the dragon book and written some parsers, but I would like more experience with handling incorrect code, especially code as it is being written.
IDEs like Eclipse and NetBeans obviously include code for stuff like this, but where?
What other projects / books might be relevant?
LISP or functional examples are also welcome.
Take a look at http://www.antlr.org/.
Check out xtext. It uses an ANTLR parser behind the scenes, but generates a syntax-highlighting editor, content assist, outlining, and many other features for you.
See http://www.eclipse.org/Xtext/

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).
Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.
Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.
I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.
In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.
Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)
Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).
Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Resources