(Background: Inspired by Is C++ context-free or context-sensitive?, while I am writing a simple compiler using jflex/cup myself. )
If they are written using a lexer/parser generator, how do we specify the grammar?
Since code like
a b(c);
could be interpreted as either a function declaration or a local variable definition, how could we handle it in the grammar definition file?
Another example could be the token ">>" in the following code:
std::vector<std::vector<int>> foo;
int a = 1000 >> 4;
Thanks
Are the compilers of C++ written using a lexer/parser generator?
It depends. Some are, some aren't.
GCC originally did use GNU bison, but was re-written a couple of years ago with a hand-written parser. If I have understood that correctly, the main reason was that writing the parser by hand gives you more control over the parser state, and specifically, how much "extraneous" data to keep in there, so that you can generate better error messages.
If they are written using a lexer/parser generator, how do we specify the grammar?
This depends on which parser generator you are using.
Since code like
a b(c);
could be interpreted as either a function declaration or a local variable definition, how could we handle it in the grammar definition file?
Some parser generators may be powerful enough to handle this directly.
Some aren't. Some parser generators which aren't powerful enough have a concept of semantic action that allow you to attach code written in an arbitrarily powerful language to parser rules. E.g. yacc allows you to attach C code to rules.
Otherwise, you will have to handle it during semantic analysis.
Related
Can a parser, generated by tree-sitter, be used both for both syntax highlighting and compiler itself? If not - why?
It would be counterproductive to write 2 different parsers and maintain them.
Note: I haven't used tree-sitter yet, but consider using it for highlighting syntax of my own programming language. Due-to that, I may misunderstand how it's parser actually works.
I am planning to implement a meta language on top of Xtext. In other words, I am using the Xtext grammar to define my own meta language. This meta language can then be used to define a language (using the syntax that I defined). Using the defined language, a model can be created by the user.
Hence, I would like to use Xtext/Xtend as a generator for parser generators. This would enable me to add as many meta levels as I like. My understanding is, that Xtext itself is defined using Xtext, so this should be possible?
The problem is that I don't know how to approach this, as I am not an expert in Xtext or parser generator frameworks in general. Any solutions/approaches/hints are welcomed.
Update (more details and motivation)
Xtext can be used to generate anything, so I could write a generator based on Xtext that generates a parser. This could be done by specifying my meta language's grammar, using Xtext to generate a parser for that grammar, so I would have access to an AST that represents a model written in my meta language. However, from here on, I would be left alone to do whatever I want with the AST, e.g. generate a parser (because the AST represents the grammar of a user-defined language). But as Xtext has the specific ability to generate parsers, I was thinking of reusing this feature instead of implementing my own parser generator based on the AST of a grammar.
My motivation is the wish to define my own DSL grammar language (as a replacement for Xtext), while still being able to use the infrastructure provided by the Xtext project.
I came to the following solution:
A grammar that was written using my grammar language will be parsed by Xtext. Next, the resulting AST is transformed to the Xtext grammar language AST, which can be used as input for the existing parser generator.
In general, given some grammar language l1, a model written in this language will be parsed and the resulting AST will be transformed to the AST of the grammar language l2 that was used to specify l1. This step is repeated until we have an AST representing a model of the Xtext grammar language, which will be used to generate the new parser.
Naturally, any information added with the definition of a new grammar language will be lost in each transformation step. Therefore, the infrastructure that is developed around a grammar language has the responsibility to create some kind of functionality that makes this information available to a higher language developed using the grammar language.
For a different approach, see:
WWW.XTRAN-LLC.com/xtran.html#parse-gen
In a nutshell, I got tired of creating parsers for XTRAN, our Expert System whose rules language manipulates computer languages, data, and text, so I created a parsing engine that directly executes EBNF at parse time (as opposed to creating parsing code, e.g. Lexx/YACC and ANTLR). Since XTRAN must also render code content represented in its Internal Representation / AST (after it's manipulated) as source code text, I created a corresponding rendering engine that executes (a much simpler form of) EBNF at render time.
I know that it's possible to use, for example, bison-generated Java files in scala project, but is there any native "grammar to scala" LALR(1) generators?
Another plug here: ScalaBison is close to LALR(1) and lets you use Scala in the actions.
I'm not really answering the original question, and please excuse the plug, but you may be interested in our sbt-rats plugin for the sbt tool. It uses the Rats! parser generator for Java, but makes it easier to use from Scala.
Rats! uses parsing expression grammars as its syntax description formalism, not context-free grammars and definitely not LALR(1) grammars. sbt-rats also has a high-level syntax definition language that in most cases means you do not need to write semantic actions to get a syntax tree that represents your input. The plugin will optionally generate case classes for the tree representation and a pretty-printer for the tree structure.
To have a general-purpose documentation system that can extract inline documentation of multiple languages, a parser for each language is needed. A parser generator (which actually doesn't have to be that complete or efficient) is thus needed.
http://antlr.org/ is a nice parser generator that already has a number of grammars for popular languages. Are there better alternatives i.e. simpler ones that support generating parsers for even more languages out-of-the-box?
If you're only looking for "partial parsing", then you could use ANTLR's option to partially "lex" a token stream and ignore the rest of the tokens. You can do that by enabling the filter=true in a lexer-grammar. The lexer then tries to match any token you defined in your grammar, and when it can't match one of the tokens, it advances one single character (and ignores it) and then again tries to match one of your token at the next character:
lexer grammar Foo;
options {filter=true;}
StringLiteral
: ...
;
CharLiteral
: ...
;
SingleLineComment
: ...
;
MultiLineComment
: ...
;
When implemented properly, you can get the MultiLineComments (/* ... */) from a Java file quite easily without being afraid of single line comments and String- or char literals messing things up.
Obviously, your source files need to be valid to be able to properly tokenize a file, otherwise you get strange results!
My compiler uses Dypgen. This is a user extenisble GLR parser with lots of enrichments so it can parse many languages. The bootstrap grammar is EBNF like (it supports * + and ? directly in your productions). It is powerful enough to dynamically load extensions, a fact my compiler leverages: the bulk of my programming language has its syntax dynamically loaded at compiler startup.
Dypgen is written in Ocaml and generates Ocaml code.
There is a C++ GLR parser called Elkhound which is powerful enough to parse most of C++.
However, for your actual requirements, you do not really need to do any serious parsing: a regular expression matching engine is probably good enough. Googles re2 may be suitable (provides most PCRE functionality, a lot faster and with C++ interface).
Although this is less accurate, it is good enough because you can demand that inline documentation adhere to some simple formats. Most existing inline docs already do so for just this reason.
Where I work we used to use GOLD Parser. This is a lot simpler that Antlr and supports multiple languages. We have since moved to Antlr however as we needed to do more complex parsing, which we found Antlr was better for than GOLD.
By concept/function/implementation, what are the differences between compilers and parsers?
A compiler is often made up of several components, one of which is a parser.
A common set of components in a compiler is:
Lexer - break the program up into words.
Parser - check that the syntax of the sentences are correct.
Semantic Analysis - check that the sentences make sense.
Optimizer - edit the sentences for brevity.
Code generator - output something with equivalent semantic meaning using another vocabulary.
To add a little bit:
As mentioned elsewhere, small C is a recursive decent compiler that generated code as it parsed. Basically syntactical analysis, semantic analysis, and code generation in one pass. As I recall, it also lexed in the parser.
A long time ago, I wrote a C compiler (actually several: the Introl-C family for microcontrollers) that used recursive descent and did syntax and semantic checking during the parse and produced a tree representation of the program from which code was generated.
Today, I'm working on a compiler that does source -> tokens -> AST -> IR -> code, pretty much as I described above.
A parser just reads a text into an internal, more abstract representation, often a tree or graph of some sort.
A compiler translates such an internal representation into another format. Most often this means converting source code into executable programs. But the target doesn't have to be machine code. It can be another programming language as well; the compiler would still be a compiler. Obviously a compiler needs a parser to actually read its input.
Compiler always have a parser inside. Parser just process the language and return the tree representation of it, compiler generate something from that tree, actual machine codes or another language.
A parser is one element of a compiler.
Are you looking for the differences between an interpreter and a compiler?
A parser takes in raw-data and parses it into a tree structure. This syntax-tree is then passed on to generator, which will turn it into whatever it is supposed to generate.
So, a parser is a part of a compiler.
In general, parser is a part of the compiler, but compiler is designed to convert the received script generally into machine-readable code or sometimes into another language.
A compiler is a special type of computer program that translates a human readable text file into a form that the computer can more easily understand. At its most basic level, a computer can only understand two things, a 1 and a 0. At this level, a human will operate very slowly and find the information contained in the long string of 1s and 0s incomprehensible. A compiler is a computer program that bridges this gap.
A parser is a piece of software that evaluates the syntax of a script when it is executed on a web server. For scripting languages used on the web, the parser works like a compiler might work in other types of application development environments.Parsers are commonly used in script development because they can evaluate code when the script is executed and do not require that the code be compiled first.