I wanted to idly code a little Prolog preprocessor (sometimes in the futrue) ... and use Prolog directly to do that.
As I don't want to start reinventing square wheels immediately, does the Prolog community know of a ready grammar (DCG program?) to hoover up tokens or text at one end and spit out terms at the other?
If possible, you can use term expansion (see SWI Prolog: Conditional compilation and program transformation).
If you like to parsing Prolog, consider the paper:
Prolog Coding Guidelines: Status and Tool Support by Nogatz et al presented on the International Conference on Logic Programming 2019
There is an implementation:
https://github.com/fnogatz/plammar
Related
I am currently using the lark parser for python to try and read in some problem specifications. I am getting confused about what the "proper" syntax is for Extended Backus-Naur form, especially about how the LHS and RHS are separated. The wikipedia page uses an equals = sign, lark expects just a colon; see lark cheat sheet. Other sources use the ::= separator - e.g. the atom ebnf package.
Is there a definitive answer? The official ISO spec seems to suggest that the "defining-symbol" should be = but there seems to be wriggle room in the spec. So why all the different versions?
Since the world hasn't yet appointed a Lord High Commissioner of Grammar Formalisms, there is no definitive syntax. You're certainly free to use the ISO "Extended BNF" standard, particularly if you're writing some other ISO standard, but don't expect it to be implemented by a parser generator, even one which extends normal BNF. (There's no definitive standard for BNF, either.)
I have no way of knowing what was going on in the minds of the authors of the ISO standard, but I suspect that their expectations were realistic: it's intended to allow precise description of syntaxes for standards documents, but there are many features which are not suitable for automated implementation (including a way of writing rule restrictions in English to be used when the formalism isn't sufficiently general). It's often possible to automatically extract (most of) a grammar from an ISO standard, but the task is neither simple nor -- as far as I can see -- intended to be simple, since most ISO standards are not distributed as plain text documents and extracting formatted text from either PDF or HTML formats presents its own challenges.
The options you present for punctuation are most of the common ones, although mathematicians often write BNF using ⇒ to separate left- and right-hand sides. (Unfortunately, most keyboards lack that useful character.)
I'm personally not fond of the ::= separator, although it is used by various parser generators. It seems to me to be way too much typing for a simple punctuator, and it is also annoyingly difficult to align with alternatives flagged with |. But to each their own.
I was entirely amazed by how Coq's parser is implemented. e.g.
https://softwarefoundations.cis.upenn.edu/lf-current/Imp.html#lab347
It's so crazy that the parser seems ok to take any lexeme by giving notation command and subsequent parser is able to parse any expression as it is. So what it means is the grammar must be context sensitive. But this is so flexible that it absolutely goes beyond my comprehension.
Any pointers on how this kind of parser is theoretically feasible? How should it work? Any materials or knowledge would work. I just try to learn about this type of parser in general. Thanks.
Please do not ask me to read Coq's source myself. I want to check the idea in general but not a specific implementation.
Indeed, this notation system is very powerful and it was probably one of the reasons of Coq's success. In practice, this is a source of much complication in the source code. I think that #ejgallego should be able to tell you more about it but here is a quick explanation:
At the beginning, Coq's documents were evaluated sentence by sentence (sentences are separated by dots) by coqtop. Some commands can define notations and these modify the parsing rules when they are evaluated. Thus, later sentences are evaluated with a slightly different parser.
Since version 8.5, there is also a mechanism (the STM) to evaluate a document fully (many sentences in parallel) but there is some special mechanism for handling these notation commands (basically you have to wait for these to be evaluated before you can continue parsing and evaluating the rest of the document).
Thus, contrary to a normal programming language, where the compiler will take a document, pass it through the lexer, then the parser (parse the full document in one go), and then have an AST to give to the typer or other later stages, in Coq each command is parsed and evaluated separately. Thus, there is no need to resort to complex contextual grammars...
I'll drop my two cents to complement #Zimmi48's excellent answer.
Coq indeed features an extensible parser, which TTBOMK is mainly the work of Hugo Herbelin, built on the CAMLP4/CAMLP5 extensible parsing system by Daniel de Rauglaudre. Both are the canonical sources for information about the parser, I'll try to summarize what I know but note indeed that my experience with the system is short.
The CAMLPX system basically supports any LL1 grammar. Coq exposes to the user the whole set of grammar rules, allowing the user to redefine them. This is the base mechanism on which extensible grammars are built. Notations are compiled into parsing rules in the Metasyntax module, and unfolded in a latter post-processing phase. And that really is AFAICT.
The system itself hasn't changed much in the whole 8.x series, #Zimmi48's comments are more related to the internal processing of commands after parsing. I recently learned that Coq v7 had an even more powerful system for modifying the parser.
In words of Hugo Herbelin "the art of extensible parsing is a delicate one" and indeed it is, but Coq's achieved a pretty great implementation of it.
I'm learning Compilers Principles recently. I notice all examples from text books describes a language lexcial parser using "lex" or "flex" with regular expressions to show how to analyze input source files.
Does it indicate that, all known programming languages, can be implemented using type 3 grammar to do lexical parsing? Or it's just that text books are using simple samples to show ideas?
Most lexemes in most languages can be identified with regular expressions, but there are exceptions. (When it comes to parsing computer languages, there are always exceptions. Without exception.)
For example, you cannot match a C++ raw string literal with a regex. You cannot tell without syntactic analysis whether /= in a Javacript program is the single lexeme used to indicate divide-and-assign, or whether it is the start of a regular expression which matches a atring starting with =. Languages which allow nested comments (unlike C) require something a bit more powerful.
But it's enormously easier to write a few regexes than to write a full state machine in raw C, so there is a lot of motivation to find ways of bending flex to your will for a few exceptional cases. And flex cooperates to a certain extent by providing features which allow you to escape from the regex straightjacket when necessary. In an advanced class on lexical analysis you might learn more about these features.
I am learning erlang recently and I have a question.
I have an equation like this (~(2+1)).
I want to parse to polish notation? For eg.{unaryMin{add,2,1}}
How do I start?
If you want to parse something, from a simple formula to a programming language, you should start by learning about grammar, language and Compiler-compiler. Learning how to parse and translate/interpret something to another format is a very common task for any programmer (pretty much everything has a compiler/interpreter, even you image viewer, web browser, etc ...) so it's very important to learn about those things.
For Erlang, LYSE got a chapter about making a reverse-polish notation calculator here, and for converting from an infix equation to a prefix/postfix one, you should read about Shunt-Yard algorithm.
Erlang also have is own version of yacc & lex : yecc, leex.
I have stumbled upon the following F77 yacc grammar: http://yaxx.cvs.sourceforge.net/viewvc/yaxx/yaxx/fortran/fortran.y?revision=1.3&view=markup.
How can I make a Fortran 77 parser out of this file using Happy?
Why is there some C?/C++? code in that .y file?
UPDATE: Thank you for your replies!
I've been playing with two fresh approaches for a while now:
extracting and modifiying the parser from the source code package bundled with a paper titled Parametric Fortran,
writing a grammar from scratch with the help of BNFC.
I've got both to parse simple code excerpts already. I'll keep people in the know should something usable come into existence within this century ^__^" hehe.
P/S: Want to see whether I could gather enough momentum on my own to initiate a project for an automatic differentiation engine to replace a binary-only one we depend on for the time being. For entertainment at the initial stages: I'm watching Love Shuffle! It's a very enjoyable J-Drama! Highly recommendable ...
The C is the semantic action for reducing the stack when the syntax is read in. These actions are in C because the definition is intended for Bison/Yacc which produces a C source file.
If you want to use Happy, port the BNF to the Happy definition syntax and write your semantics in Haskell.
Just the tip of the iceberg for getting anything useful however.
If you don't have a copy already, invest in the Dragon Book (Compilers: Principles, Techniques & tools by Aho, Lam, Sethi, Ullman - Pearson)
Why the other answers are true in the general sense, in that you'll need to write your own actions to do anything meaningful the Yacc definition that you linked to actually doesn't have any actions associated with the grammar rules. What it does is that it defines the yyerror function and some code for extracting values from yylval based on the token type.
If you have no clue what yyerror/yylval are about you should read a bison/flex tutorial. The Dragon book is also a good resource if you're more serious about this. There are also some excellent handouts from a Stanford course on compilers floating around the Net, which are based on the book.
You'll need an AST to build that can be constructed in an equivalent way to the C fragments in the Yacc file.
Use BNFC and write your own grammar from scratch! BNFC works wonders and you could do your parsing exactly as you desire.