I was wondering whether the standard Scala parser combinators contain a parser that accepts the same identifiers that the Scala language itself also accepts (as specified in the Scala Language Specification, Section 1.1).
The StdTokenParsers trait has an ident parser, but it rejects identifiers like empty_?.
(If there is indeed no such parser, I could also just instantiate the Scala parser itself, but that wouldn't be as lightweight anymore.)
Not a standard parser combinator, but there are canonical tools for testing Scala id-ness in scala.tools.nsc.util.Chars. No need to instantiate either Global or a Scala scanner.
Related
I am planning to implement a meta language on top of Xtext. In other words, I am using the Xtext grammar to define my own meta language. This meta language can then be used to define a language (using the syntax that I defined). Using the defined language, a model can be created by the user.
Hence, I would like to use Xtext/Xtend as a generator for parser generators. This would enable me to add as many meta levels as I like. My understanding is, that Xtext itself is defined using Xtext, so this should be possible?
The problem is that I don't know how to approach this, as I am not an expert in Xtext or parser generator frameworks in general. Any solutions/approaches/hints are welcomed.
Update (more details and motivation)
Xtext can be used to generate anything, so I could write a generator based on Xtext that generates a parser. This could be done by specifying my meta language's grammar, using Xtext to generate a parser for that grammar, so I would have access to an AST that represents a model written in my meta language. However, from here on, I would be left alone to do whatever I want with the AST, e.g. generate a parser (because the AST represents the grammar of a user-defined language). But as Xtext has the specific ability to generate parsers, I was thinking of reusing this feature instead of implementing my own parser generator based on the AST of a grammar.
My motivation is the wish to define my own DSL grammar language (as a replacement for Xtext), while still being able to use the infrastructure provided by the Xtext project.
I came to the following solution:
A grammar that was written using my grammar language will be parsed by Xtext. Next, the resulting AST is transformed to the Xtext grammar language AST, which can be used as input for the existing parser generator.
In general, given some grammar language l1, a model written in this language will be parsed and the resulting AST will be transformed to the AST of the grammar language l2 that was used to specify l1. This step is repeated until we have an AST representing a model of the Xtext grammar language, which will be used to generate the new parser.
Naturally, any information added with the definition of a new grammar language will be lost in each transformation step. Therefore, the infrastructure that is developed around a grammar language has the responsibility to create some kind of functionality that makes this information available to a higher language developed using the grammar language.
For a different approach, see:
WWW.XTRAN-LLC.com/xtran.html#parse-gen
In a nutshell, I got tired of creating parsers for XTRAN, our Expert System whose rules language manipulates computer languages, data, and text, so I created a parsing engine that directly executes EBNF at parse time (as opposed to creating parsing code, e.g. Lexx/YACC and ANTLR). Since XTRAN must also render code content represented in its Internal Representation / AST (after it's manipulated) as source code text, I created a corresponding rendering engine that executes (a much simpler form of) EBNF at render time.
(Background: Inspired by Is C++ context-free or context-sensitive?, while I am writing a simple compiler using jflex/cup myself. )
If they are written using a lexer/parser generator, how do we specify the grammar?
Since code like
a b(c);
could be interpreted as either a function declaration or a local variable definition, how could we handle it in the grammar definition file?
Another example could be the token ">>" in the following code:
std::vector<std::vector<int>> foo;
int a = 1000 >> 4;
Thanks
Are the compilers of C++ written using a lexer/parser generator?
It depends. Some are, some aren't.
GCC originally did use GNU bison, but was re-written a couple of years ago with a hand-written parser. If I have understood that correctly, the main reason was that writing the parser by hand gives you more control over the parser state, and specifically, how much "extraneous" data to keep in there, so that you can generate better error messages.
If they are written using a lexer/parser generator, how do we specify the grammar?
This depends on which parser generator you are using.
Since code like
a b(c);
could be interpreted as either a function declaration or a local variable definition, how could we handle it in the grammar definition file?
Some parser generators may be powerful enough to handle this directly.
Some aren't. Some parser generators which aren't powerful enough have a concept of semantic action that allow you to attach code written in an arbitrarily powerful language to parser rules. E.g. yacc allows you to attach C code to rules.
Otherwise, you will have to handle it during semantic analysis.
I know that it's possible to use, for example, bison-generated Java files in scala project, but is there any native "grammar to scala" LALR(1) generators?
Another plug here: ScalaBison is close to LALR(1) and lets you use Scala in the actions.
I'm not really answering the original question, and please excuse the plug, but you may be interested in our sbt-rats plugin for the sbt tool. It uses the Rats! parser generator for Java, but makes it easier to use from Scala.
Rats! uses parsing expression grammars as its syntax description formalism, not context-free grammars and definitely not LALR(1) grammars. sbt-rats also has a high-level syntax definition language that in most cases means you do not need to write semantic actions to get a syntax tree that represents your input. The plugin will optionally generate case classes for the tree representation and a pretty-printer for the tree structure.
Similar to Generating n statements from context-free grammars, I want to randomly generate sentences from a grammar.
What is a good parser generator for manipulating the actual grammar productions themselves? I want the parser generator to actually give me access to the productions (production objects?).
If I had a grammar akin to:
start_symbol ::= foo
foo ::= bar | baz
What is a good parser generator for:
giving me the starting production symbol
allow me to choose one production from RHS of the start symbol ( foo in this case)
give me the production options for foo
Clearly every parser has internal representations for productions and methods of associating the production with its RHS, but which parser would be easy to manipulate these internals?
Note: the blog entry linked to from the other SO question I mentioned has some sort of custom CFG parser. I want to use an actual grammar for a real parser, not generate my own grammar parser.
It should be pretty easy to write a grammar, that matches the grammar that a parser generator accepts. (With an open source parser genrator, you ought to be able to fetch such a grammar from the parser generator source code; they all then to have self-grammars). With that, you can then parse any grammar the parser generator accepts.
If you want to manipulate the parsed grammar, you'll need an abstract syntax tree of same. You can make most parser generators build a tree, either by using built-in mechanisms or ad hoc code you add.
I'm looking for a parser generator for a reasonably complex language (similar in complexity to Python itself) which works with Python3. If it can generate an AST automatically, this would be a bonus, but I'm fine if it just calls rules while parsing. I have no special requirements, nor does it have to be very efficient/fast.
LEPL isn't exactly a parser generator - it's better! The parsers are defined in Python code and constructed at runtime (hence some inefficiency, but much easier to use). It uses operator overloading to construct a quite readable DSL. Things like c = a & b | b & c for the BNF c := a b | b c..
You can pass the results of a (sub-)parser to an abritary callable, and this is very usable for AST generation (also useful for converting e.g. number literals to Python-level number objects). It's a recursive descent parser, so you better avoid left recursion in the grammar (there are memoization objets that can make left recursion work, but "Lepl's support for them has historically been unreliable (buggy)").
ANTLR can generate a lexer and/or parser in Python. You can also use it to create AST's and iterator-like structures to walk the AST (called tree grammars).
See ANTLR get and split lexer content for an ANTLR demo that produces an AST with the Python target.