How to develop language syntax checker?

How to develop language syntax checker? - token

I want to develop a syntax checker tool for my web-project. The aim is to analyze the ECMAScript 6 syntax.
I know, that there are some tools like BabelJs, where there exist such tools (but developed in NodeJs environment), but I want try to make such tool, because of getting new knowledge.
What shout I start to read, which books and articles?
I shall highlight what I want, I don't want compiler/interpreter, I just want a syntax checker.

You'll need to:
Know the ECMAScript 6 specification in excruciating detail;
Build a lexer
Build a parser
You need to learn about classic lexing and parsing, from a good source of the theory. With that background you can think about building the lexer and parser for ECMAScript 6, which is what does the basic syntax checking.
See https://en.wikipedia.org/wiki/Parsing, especially the references. Ideal is the Aho/Ullman/Sethi book on compiling.
Don't expect this to be easy (most parser-newbies make this error); parsing is actually a pretty complex topic. Expect to spend a lot of effort to learn how to do this right. You'll also need to learn how to build syntax error recovery into your parser if if is going to check for syntax and complain; techniques for doing this aren't as well documented/taught.
Loud hint: building lexers and parsers is a lot easier if you use lexer-generator and parser-generator tools. You still need the compiler basics to understand what these do.
Here's a list of lexer ("regular expression") and parser generators to choose from: http://en.wikipedia.org/wiki/Comparison_of_parser_generators
JavaScript has some features ("semicolon insertion") which make it singularly hard to parse by most conventional methods. So what you will need to do is learn the theory to get the basics right, and then learn how to bend parsers to handle strange cases like JavaScript.
You'll also need that special parser bending to extract the JavaScript from the HTML pages that make up your "web project".

Related

Is using lex/yacc (or flex/bison) an overkill for configuration file parsing?

For the last couple of weeks I kept reading and playing with flex/bison, the main goal is to parse structured configuration file with nested groups and lists.
flex/bison seems very powerful but too complicated.
I surveyed few open source project and the only example I found for configuration parsing using Bison was ntpd, other projects build their own parser and lexer.
Is it really the right tool for the job? or is it better to build a recursive descent parser by hand (may be with flex as a lexer)?!

It's entirely appropriate. If you are versed in bison you can throw it together way quicker than you could write an RDP or some kind of ad-hoc parser. Might take a little longer if it's your first go at it - but it might also be a good way to learn.
It will also help you design your grammar - if you accidentally make it ambiguous, you'll get a R/R conflict right away, rather than getting way down to a depp dark place in your RDP and finding you have no way out...

I don't believe it's too complicated. Besides, handwritten parsers are poorly maintainable, compared to autogenerated parsers.
The biggest problem with GNU Bison and Flex is that there is no good tutorial for C++. There are plenty of badly written C examples with global variables, which doesn't help Bison/Flex reputation. Your percepsion may change when you have a working example.
Here is a working C++ solution using Bison 3 and Flex. Encapsulate it in your own namespace and voila - you can stuff your project with gazilion parsers for everything.
https://github.com/ezaquarii/bison-flex-cpp-example

There are lots of home-brew configuration file syntaxes that have been developed using primitive ad-hoc approaches, such as splitting a line into a name and value based on simple tokenizing. Such approaches tend to have limitations, and Java properties files come to mind as a particularly bad configuration format.
When you have made the decision to define a lexical and BNF specification for your configuration syntax, you are already ahead of the game. Whether you then choose to implement that specification via hand-written code or via tools such as flex & bison is just a relatively unimportant implementation detail.
When I designed and implemented Config4*, I choose the hand-written code approach, for reasons I discuss in one of the Config4* manuals. However, I agree with the advice from BadZen: if you are already comfortable using flex and bison, then using them will probably save time compared to using a hand-written lexer and recursive-descent parser.

Writing a code formatting tool for a programming language

I'm looking into the feasibility of writing a code formatting tool for the Apex language, a Salesforce.com variation on Java, and perhams VisualForce, its tag based markup language.
I have no idea on where to start this, apart from feeling/knowing that writing a language parser from scratch is probably not the best approach.
I have a fairly thin grasp of what Antlr is and what it does, but conceptually, I'm imagining one could 'train' antlr to understand the syntax of Apex. I could then get a structured version of the code in a data structure (AST?) which I could then walk to produce correctly formatted code.
Is this the right concept? Is Antlr a tool to do that? Any links to a brief synopsis on this? I'm looking for investing a few days in this task, not months, and I'm not sure if its even vaguely achievable.

Since Apex syntax is similar to Java, I'd look at Eclipse's JDT. Edit down the Java grammar to match Apex. Do the same w/ formatting rules/options. This is more than a few days of work.

Steven Herod wrote:
... I'm imagining one could 'train' antlr to understand the syntax of Apex. ...
What do you mean by "'train' antlr"? "Train" as in artificial intelligence (training a neural-net)? If so, then you are mistaken.
Steven Herod wrote:
... get a structured version of the code in a data structure (AST?) which I could then walk to produce correctly formatted code.
Is this the right concept? Is Antlr a tool to do that?
Yes, more or less. You write a grammar that precisely defines the language you want to parse. Then you use ANTLR which will generate a lexer (tokenizer) and parser based on the grammar file. You can let the parser create an AST from your input source and then walk the AST and emit (custom) output/code.
Steven Herod wrote:
... I'm looking for investing a few days in this task, not months, and I'm not sure if its even vaguely achievable.
Well, I don't know you of course, but I'd say writing a grammar for a language similar to Java, and then emitting output by walking the AST within just a couple of days is impossible, even more so for someone new to ANTLR. I am fairly familiar with ANTLR, but I couldn't do it in just a few days. Note that I'm only talking about the "parsing-part", after you've done that, you'll need to integrate this in some text editor. This all looks to be more a project of several months, not even weeks, let alone several days.
So, in short, if all you want to do is write a custom code highlighter, ANTLR isn't your best choice.
You could have a look at Xtext which uses ANTLR under the hood. To quote their website:
With Xtext you can easily create your own programming languages and domain-specific languages (DSLs). The framework supports the development of language infrastructures including compilers and interpreters as well as full blown Eclipse-based IDE integration. ...
But I doubt you'll have an Eclipse plugin up and running within just a few days.
Anyway, best of luck!

Our DMS Software Reengineering Toolkit is designed to do this as kind poker-pot ante necessary to do any kind of automated software reengineering project.
DMS allows one to define a grammar, similar to ANTLR's (and other parser generator) styles. Unlike ANTLR (and other parser generators), DMS uses a GLR parser, which means you don't have to bend the language grammar rules to meet the requirements of the parser generator. If you can write an context-free grammar, DMS will convert that into a parser for that language. This means in fact you can get a working, correct grammar up considerably faster than with typical LL or L(AL)R parser generators.
Unlike ANTLR (and other parser generators), there is no additional work to build the AST; it is automatically constructed. This means you spend zero time write tree-building rules and none debugging them.
DMS additionally provides a pretty-printing specification language, specifying text boxes stack vertically, horizontally, or indented, in which you can define the "format" that is used to convert the AST back into completely legal, nicely formatted source text. None of the well known parser generators provide any help here; if you want to prettyprint the tree, you get to do a great deal of custom coding. For more details on this, see my SO answer to Compiling an AST back to source. What this means is you can build a prettyprinter for your grammar in an (intense) afternoon by simply annotating the grammar rules with box layout directives.
DMS's lexer is very careful to capture comments and "lexical formats" (was that number octal? What kind of quotes did that string have? Escaped characters?) so that they can be regenerated correctly. Parse-to-AST and then prettyprint-AST-to-text round trips arbitrarily ugly code into formatted code following the prettyprinting rules. (This round trip is the poker ante: if you want go further, to actually manipulate the AST, you still want to be able to regenerate valid source text).
We recently built parser/prettyprinters for EGL. This took about a week end to end. Granted, we are expert at our tools.
You can download any of a number of different formatters built using DMS from our web site, to see what such formatting can do.
EDIT July 2012: Last week (5 days) using DMS, from scratch we (I personally) built a fully compliant IEC61131-3 "Structured Text" (industrial control language, Pascal-like) parser and prettyprinter. (It handles all the examples from the standards documents).

Reverse engineering a language to get a parser is hard. Very hard! Even if it's very close to Java.
But why reinvent the wheel?
There is a wonderful Apex parser implementation as part of the Force.com IDE on GitHub. It's just a jar without source code but you can use it for whatever you want. And the developers behind it are really supportive and helpful.
We are currently building an Apex module of the famous Java static code analyzer PMD here. And we use Salesforce.com internal parser. It works like a charm.
And hey, it's an open source project and we need contributers of any kind ;-)

Can Xtext be used for parsing general purpose programming languages?

I'm currently developing a general-purpose agent-based programming language (its syntaxt will be somewhat inspired by Java, and we are also using object in this language).
Since the beginning of the project we were doubtful about the fact of using ANTLR or Xtext. At that time we found out that Xtext was implementing a subset of the feature of ANTLR. So we decided to use ANLTR for our language losing the possibility to have a full-fledged Eclipse editor for free for our language (such a nice features provided by Xtext).
However, as the best of my knowledge, this summer the Xtext project has done a big step forward. Quoting from the link:
What are the limitations of Xtext?
Sven: You can implement almost any kind of programming language or DSL
with Xtext. There is one exception, that is if you need to use so
called 'Semantic Predicates' which is a rather complicated thing I
don't think is worth being explained here. Very few languages really
need this concept. However the prominent example is C/C++. We want to
look into that topic for the next release.
And that is also reinforced in the Xtext documentation:
What is Xtext? No matter if you want to create a small textual domain-specific language (DSL) or you want to implement a full-blown
general purpose programming language. With Xtext you can create your
very own languages in a snap. Also if you already have an existing
language but it lacks decent tool support, you can use Xtext to create
a sophisticated Eclipse-based development environment providing
editing experience known from modern Java IDEs in a surprisingly short
amount of time. We call Xtext a language development framework.
If Xtext has got rid of its past limitations why is it still not possible to find a complex Xtext grammar for the best known programming languages (Java, C#, etc.)?
On the ANTLR website you can find tons of such grammar examples, for what concerns Xtext instead the only sample I was able to find is the one reported in the documentation. So maybe Xtext is still not mature to be used for implementing a general purpose programming language? I'm a bit worried about this... I would not start to re-write the grammar in Xtext for then to recognize that it was not suited for that.

I think nobody implemented Java or C++ because it is a lot of work (even with Xtext) and the existing tools and compilers are excellent.
However, you could have a look at Xbase and Xtend, which is the expression language we ship with Xtext. It is built with Xtext and is quite a good proof for what you can build with Xtext. We have done that in about 4 person months.
I did a couple of screencasts on Xtend:
http://blog.efftinge.de/2011/03/xtend-screencast-part-1-basics.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-2-switch.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-3-rich-strings-ie.html
Note, that you can simply embed Xbase expressions into your language.

I can't speak for what Xtext is or does well.
I can speak to the problem of developing robust tools for processing real languages, based on our experience with the DMS Software Reengineering Toolkit, which we imagine is a language manipulation framework.
First, parsing of real languages usually involves something messy in lexing and/or parsing, due to the historical ways these languages have evolved. Java is pretty clean. C# has context-dependent keywords and a rudimentary preprocessor sort of like C's. C has a full blown preprocessor. C++ is famously "hard to parse" due to ambiguities in the grammar and shenanigans with template syntax. COBOL is fairly ugly, doesn't have any reference grammars, and comes in a variety of dialects. PHP will turn you to stone if you look at it because it is so poorly defined. (DMS has parsers for all of these, used in anger on real applications).
Yet you can parse all of these with most of the available parsing technologies if you try hard enough, usually by abusing the lexer or the parser to achieve your goals (how the GNU guys abused Bison to parse C++ by tangling lexical analysis with symbol table lookup is a nice ugly case in point). But it takes a lot of effort to get the language details right, and the reference manuals are only close approximations of the truth with respect to what the compilers really accept.
If Xtext has a decent parsing engine, one can likely do this with Xtext. A brief perusal of the Xtext site sounds like the lexers and parsers are fairly decent. I didn't see anything about the "Semantic Predicate"s; we have them in DMS and they are lifesavers in some of the really dark corners of parsing. Even using the really good parsing technology (we use GLR parsers), it would be very hard to parse COBOL data declarations (extracting their nesting structure during the parse) without them.
You have an interesting problem in that your language isn't well defined yet. That will make your initial parsers somewhat messy, and you'll revise them a lot. Here's where strong parsing technology helps you: if you can revise your grammar easily you can focus on what you want your language to look like, rather than focusing on fighting the lexer and parser. The fact that you can change your language definition means in fact that if Xtext has some limitations, you can probably bend your language syntax to match without huge amounts of pain. ANTLR does have the proven ability to parse a language pretty much as you imagine it, modulo the usual amount of parser hacking.
What is never discussed is what else is needed to process a language for real. The first thing you need to be able to do is to construct ASTs, which ANTLR and YACC will help you do; I presume Xtext does also. You also need symbol tables, control and data flow analysis (both local and global), and machinery to transform your language into something else (presumably more executable). Doing just symbol tables you will find surprisingly hard; C++ has several hundred pages of "how to look up an identifier"; Java generics are a lot tougher to get right than you might expect. You might also want to prettyprint the AST back to source code, if you want to offer refactorings. (EDIT: Here both ANTLR and Xtext offer what amounts to text-template driven code generation).
Yet these are complex mechanisms that take as much time, if not more than building the parser. The reason DMS exists isn't because it can parse (we view this just as the ante in a poker game), but because all of this other stuff is very hard and we wanted to amortize the cost of doing it all (DMS has, we think, excellent support for all of these mechanisms but YMMV).
On reading the Xtext overview, it sounds like they have some support for symbol tables but it is unclear what kind of assumption is behind it (e.g., for C++ you have to support multiple inheritance and namespaces).
If you are already started down the ANTLR road and have something running, I'd be tempted to stay the course; I doubt if Xtext will offer you a lot of additional help. If you really really want Xtext's editor, then you can probably switch at the price of restructuring what grammar you have (this is a pretty typical price to pay when changing parsing paradigms). Expect most of your work to appear after you get the parser right, in an ad hoc way. I doubt you will find Xtext or ANTLR much different here.

I guess the most simple answer to your question is: Many general purpose languages can be implemented using Xtext. But since there is no general answer to which parser-capabilities a general purpose languages needs, there is no general answer to your questions.
However, I've got a few pointers:
With Xtext 2.0 (released this summer), Xtext supports syntactic predicates. This is one of the most requested features to handle ambiguous syntax without enabling antlr's backtracking.
You might want to look at the brand-new languages Xbase and Xtend, which are (judging based on their capabilities) general-purpose and which are developed using Xtext. Sven has some nice screen casts in his blog: http://blog.efftinge.de/
Regarding your question why we don't see Xtext-grammars for Java, C++, etc.:
With Xtext, a language is more than just a grammar, so just having a grammar that describes a language's syntax is a good starting point but usually not an artifact valuable enough for shipping. The reason is that with an Xtext-grammar you also define the AST's structure (Abstract Syntax Tree, and an Ecore Model in fact) including true cross references. Since this model is the main internal API of your language people usually spend a lot of thought designing it. Furthermore, to resolve cross references (aka linking) you need to implement scoping (as it is called in Xtext). Without a proper implementation of scoping you can either not have true cross references in your model or you'll get many lining errors.
A guess my point is that creating a grammar + designing the AST model + implementing scoping is just a little more effort that taking a grammar from some language-zoo and translating it to Xtext's syntax.

In the programming languages specifications, why is it that lexical analysis not translatable?

In all of the standard specifications for programming languages, why is it that you cannot directly translate the lexical analysis/layout to a grammar that is ready to be plugged into and working?
I can understand that it would be impossible to adapt it for the likes of Flex/Bison, Lex/Yacc, Antlr and so on, and furthermore to make it readable for humans to understand.
But surely, if it is a standard specification, it should be a simple copy/paste the grammar layout and instead end up with loads of shift/reduce errors as a result which can back fire and hence, produce an inaccurate grammar.
In other words, why did they not make it readable for use by a grammar/parser tool straight-away?
Maybe it is a debatable thing I don't know...
Thanks,
Best regards,
Tom.

In other words, why did they not make
it readable for use by a
grammar/parser tool straight-away?
Standards documents are intended to be readable by humans, not parser generators.

It is easy for humans to look at a grammar and know what the author intended, however, a computer needs to have a lot more hand holding along the way.
Specifically, these specifications are generally not LL(1) or LR(1). As such, lookaheads are needed, conflicts need to be resolved. True, this could be done in the language specification, but then it is source code for a lexical analyzer, not a language specification.

I agree with your sentiment, but the guys writing standards can't win on this.
To make the lexer/grammar work for a parser generator directly-out-of-standard, the standard writers would have to choose a specific one. (What choice would the COBOL standard folks have made in 1958?)
The popular ones (LEX, YACC, etc.)
are often not capable of handling reference grammars, written for succinctness and clarity, and so would be a poor (e.g. non-)choice.
More exotic ones (Earley, GLR) might be more effective because they allow infinite lookahead and ambiguity, but are harder to find. So if a specific tool like this
was chosen you would not get what you wanted, which is a grammar that works with the parser generator you have.
Having said that, the DMS Software Reengineering Toolkit uses a GLR parser generator. We don't have to massage reference grammars a lot to get them to work, and DMS now handles a lot of languages, including ones that are famously hard such as C++. IMHO, this is as close to your ideal as you are likely to get.

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?

On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).

Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.

Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.

I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.

In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.

Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)

Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).

Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart