What is the definitive EBNF syntax? - parsing

I am currently using the lark parser for python to try and read in some problem specifications. I am getting confused about what the "proper" syntax is for Extended Backus-Naur form, especially about how the LHS and RHS are separated. The wikipedia page uses an equals = sign, lark expects just a colon; see lark cheat sheet. Other sources use the ::= separator - e.g. the atom ebnf package.
Is there a definitive answer? The official ISO spec seems to suggest that the "defining-symbol" should be = but there seems to be wriggle room in the spec. So why all the different versions?

Since the world hasn't yet appointed a Lord High Commissioner of Grammar Formalisms, there is no definitive syntax. You're certainly free to use the ISO "Extended BNF" standard, particularly if you're writing some other ISO standard, but don't expect it to be implemented by a parser generator, even one which extends normal BNF. (There's no definitive standard for BNF, either.)
I have no way of knowing what was going on in the minds of the authors of the ISO standard, but I suspect that their expectations were realistic: it's intended to allow precise description of syntaxes for standards documents, but there are many features which are not suitable for automated implementation (including a way of writing rule restrictions in English to be used when the formalism isn't sufficiently general). It's often possible to automatically extract (most of) a grammar from an ISO standard, but the task is neither simple nor -- as far as I can see -- intended to be simple, since most ISO standards are not distributed as plain text documents and extracting formatted text from either PDF or HTML formats presents its own challenges.
The options you present for punctuation are most of the common ones, although mathematicians often write BNF using ⇒ to separate left- and right-hand sides. (Unfortunately, most keyboards lack that useful character.)
I'm personally not fond of the ::= separator, although it is used by various parser generators. It seems to me to be way too much typing for a simple punctuator, and it is also annoyingly difficult to align with alternatives flagged with |. But to each their own.

Related

How is Coq's parser implemented?

I was entirely amazed by how Coq's parser is implemented. e.g.
https://softwarefoundations.cis.upenn.edu/lf-current/Imp.html#lab347
It's so crazy that the parser seems ok to take any lexeme by giving notation command and subsequent parser is able to parse any expression as it is. So what it means is the grammar must be context sensitive. But this is so flexible that it absolutely goes beyond my comprehension.
Any pointers on how this kind of parser is theoretically feasible? How should it work? Any materials or knowledge would work. I just try to learn about this type of parser in general. Thanks.
Please do not ask me to read Coq's source myself. I want to check the idea in general but not a specific implementation.
Indeed, this notation system is very powerful and it was probably one of the reasons of Coq's success. In practice, this is a source of much complication in the source code. I think that #ejgallego should be able to tell you more about it but here is a quick explanation:
At the beginning, Coq's documents were evaluated sentence by sentence (sentences are separated by dots) by coqtop. Some commands can define notations and these modify the parsing rules when they are evaluated. Thus, later sentences are evaluated with a slightly different parser.
Since version 8.5, there is also a mechanism (the STM) to evaluate a document fully (many sentences in parallel) but there is some special mechanism for handling these notation commands (basically you have to wait for these to be evaluated before you can continue parsing and evaluating the rest of the document).
Thus, contrary to a normal programming language, where the compiler will take a document, pass it through the lexer, then the parser (parse the full document in one go), and then have an AST to give to the typer or other later stages, in Coq each command is parsed and evaluated separately. Thus, there is no need to resort to complex contextual grammars...
I'll drop my two cents to complement #Zimmi48's excellent answer.
Coq indeed features an extensible parser, which TTBOMK is mainly the work of Hugo Herbelin, built on the CAMLP4/CAMLP5 extensible parsing system by Daniel de Rauglaudre. Both are the canonical sources for information about the parser, I'll try to summarize what I know but note indeed that my experience with the system is short.
The CAMLPX system basically supports any LL1 grammar. Coq exposes to the user the whole set of grammar rules, allowing the user to redefine them. This is the base mechanism on which extensible grammars are built. Notations are compiled into parsing rules in the Metasyntax module, and unfolded in a latter post-processing phase. And that really is AFAICT.
The system itself hasn't changed much in the whole 8.x series, #Zimmi48's comments are more related to the internal processing of commands after parsing. I recently learned that Coq v7 had an even more powerful system for modifying the parser.
In words of Hugo Herbelin "the art of extensible parsing is a delicate one" and indeed it is, but Coq's achieved a pretty great implementation of it.

Regular expression can be used to express all kinds of lexical parser requirements?

I'm learning Compilers Principles recently. I notice all examples from text books describes a language lexcial parser using "lex" or "flex" with regular expressions to show how to analyze input source files.
Does it indicate that, all known programming languages, can be implemented using type 3 grammar to do lexical parsing? Or it's just that text books are using simple samples to show ideas?
Most lexemes in most languages can be identified with regular expressions, but there are exceptions. (When it comes to parsing computer languages, there are always exceptions. Without exception.)
For example, you cannot match a C++ raw string literal with a regex. You cannot tell without syntactic analysis whether /= in a Javacript program is the single lexeme used to indicate divide-and-assign, or whether it is the start of a regular expression which matches a atring starting with =. Languages which allow nested comments (unlike C) require something a bit more powerful.
But it's enormously easier to write a few regexes than to write a full state machine in raw C, so there is a lot of motivation to find ways of bending flex to your will for a few exceptional cases. And flex cooperates to a certain extent by providing features which allow you to escape from the regex straightjacket when necessary. In an advanced class on lexical analysis you might learn more about these features.

Lexical Analysis of a Scripting Language

I am trying to create a simple script for a resource API. I have a resource API mainly creates game resources in a structured manner. What I want is dealing with this API without creating c++ programs each time I want a resource. So we (me and my instructor from uni) decided to create a simple script to create/edit resource files without compiling every time. There are also some other irrelevant factors that I need a command line interface rather than a GUI program.
Anyway, here is script sample:
<path>.<command> -<options>
/Graphics[3].add "blabla.png"
I didn't design this script language, the owner of API did. The part before '.' as you can guess is the path and part after '.' is actual command and some options, flags etc. As a first step, I tried to create grammar of left part because I thought I could use it while searching info about lexical analyzers and parser. The problem is I am inexperienced when it comes to parsing and programming languages and I am not sure if it's correct or not. Here is some more examples and grammar of left side.
dir -> '/' | '/' path
path -> object '/' path | object
object -> number | string '[' number ']'
Notation if this grammar can be a mess, I don't know. There is 5 different possibilities, they are:
String
"String"
Number
String[Number]
"String"[Number]
It has to start with '/' symbol and if it's the only symbol, I will accept it as Root.
Now my problem is how can I lexically analyze this script? Is there a special method? What should my lexical analyzer do and do not(I read some lexical analysers also do syntactic analysis up to a point). Do you think grammar, etc. is technically appropriate? What kind of parsing method I should use(Recursive Descent, LL etc.)? I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
What should my lexical analyzer do and not do?
It should:
recognize tokens
ignore ignorable whitespace and comments (if there are such things)
optionally, keep track of source location in order to produce meaningful error messages.
It should not attempt to parse the input, although that will be very tempting with such a simple language.
From what I can see, you have the following tokens:
punctuation: /, ., linear-white-space, new-line
numbers
unquoted strings (often called "atoms" or "ids")
quoted strings (possibly the same token type as unquoted strings)
I'm not sure what the syntax for -options is, but that might include more possibilities.
Choosing to return linear-white-space (that is, a sequence consisting only of tabs and spaces) as a token is somewhat questionable; it complicates the grammar considerably, particularly since there are probably places where white-space is ignorable, such as the beginning and end of a line. But I have the intuition that you do not want to allow whitespace inside of a path and that you plan to require it between the command name and its arguments. That is, you want to prohibit:
/left /right[3] .whimper "hello, world"
/left/right[3].whimper"hello, world"
But maybe I'm wrong. Maybe you're happy to accept both. That would be simpler, because if you accept both, then you can just ignore linear-whitespace altogether.
By the way, experience has shown that using new-line to separate commands can be awkward; sooner or later you will need to break a command into two lines in order to avoid having to buy an extra monitor to see the entire line. The convention (used by bash and the C preprocessor, amongst others) of putting a \ as the last character on a line to be continued is possible, but can lead to annoying bugs (like having an invisible space following the \ and thus preventing it from really continuing the line).
From here down is 100% personal opinion, offered for free. So take it for what its worth.
I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
There is a contradiction here, in my opinion. Or perhaps two contradictions.
A technically appropriate piece of work would use standard tools; at least a lexical generator and probably a parser generator. It would do that because, properly used, the lexical and grammatical descriptions provided to the tools document precisely the actual language, and the tools guarantee that the desired language is what is actually recognized. Writing ad hoc code, even simple lexical recognizers and recursive descent parsers, for all that it can be elegant, is less self-documenting, less maintainable, and provides fewer guarantees of correctness. Consequently, best practice is "use standard tools".
Secondly, I disagree with your instructor (if I understand their proposal correctly, based on your comments) that writing ad hoc lexers and parsers aids in understanding lexical and parsing theory. In fact, it may be counterproductive. Bottom-up parsing, which is incredibly elegant both theoretically and practically, is almost impossible to write by hand and totally impossible to read. Consequently, many programmers prefer to use recursive-descent or Pratt parsers, because they understand the code. However, such parsers are not as powerful as a bottom-up parser (particularly GLR or Earley parsers, which are fully general), and their use leads to unnecessary grammatical compromises.
You don't need to write a regular expression library to understand regular expressions. The libraries abstract away the awkward implementation details (and there are lots of them, and they really are awkward) and let you concentrate on the essence of creating and using regular expressions.
In the same way, you do not need to write a compiler in order to understand how to program in C. After you have a good basis in C, you can improve your understanding (maybe) by understanding how it translates into machine code, but unless you plan a career in compiler writing, knowing the details of obscure optimization algorithms are not going to make you a better programmer. Or, at least, they're not first on your agenda.
Similarly, once you really understand regular expressions, you might find writing a library interesting. Or not -- you might find it incredibly frustrating and give up after a couple of months of hard work. Either way, you will appreciate existing libraries more. But learn to use the existing libraries first.
And the same with parser generators. If you want to learn how to translate an idea for a programming language into something precise and implementable, learn how to use a parser generator. Only after you have mastered the theory of parsing should you even think of focusing on low-level implementations.

Is there a parser generator that can use the Wirth syntax?

ie: http://en.wikipedia.org/wiki/Wirth_syntax_notation
It seems like most use BNF / EBNF ...
The distinction made by the Wikipedia article looks to me like it is splitting hairs. "BNF/EBNF" has long meant writing grammar rules in roughly the following form:
nonterminal = right_hand_side end_rule_marker
As with other silly langauge differences ( "{" in C, begin in Pascal, endif vs. fi) you can get very different looking but identical meaning by choosing different, er, syntax for end_rule_marker and what you are allowed to say for the right_hand_side.
Normally people allow literal tokens in (your choice! of) quotes, other nonterminal names, and for EBNF, various "choice" operators typically | or / for alternatives, * or + for "repeat", [ ... ] or ? for optional, etc.
Because people designing language syntax are playing with syntax, they seem to invent their own every time they write some down. (Check the various syntax formalisms in the language standards; none of them are the same). Yes, we'd all be better off if there were one standard way to write this stuff. But we don't do that for C or C++ or C# (not even Microsoft could follow their own standard); why should BNF be any different?
Because people that build parser generators usually use it to parse their own syntax, they can and so easily define their own for each parser generator. I've never seen one that did precisely the "WSN" version at Wikipedia and I doubt I ever will in practice.
Does it matter much? Well, no. What really matters is the power of the parser generator behind the notation. You have to bend most grammars to match the power (well, weakness) of the parser generator; for instance, for LL-style generators, your grammar rules can't have left recursion (WSN allows that according to my reading). People building parser generators also want to make it convenient to express non-parser issues (such as, "how to build tree nodes") and so they also add extra notation for non-parsing issues.
So what really drives the parser generator syntax are weaknesses of the parser generator to handle arbitrary languages, extra goodies deemed valuable for the parser generator, and the mood of the guy implementing it.

Good grammar for date data type for recursive descent parser LL(1)

I'm building a custom expression parser and evaluator for production environment to provide a limited DSL to the users. The parser itself as the DSL, need to be simple. The parser is going to be built in an exotic language that doesn't support dynamic expression parsing nor has any parser generator tools available.
My current decision is to go for recursive descent approach with LL(1) grammar, so that even programmers with no previous experience in evaluating expression could quickly learn how the code works.
It has to handle mixed expressions made up of several data types: decimals, percentages, strings and dates. And dates in the format of dd/mm/yyyy are easy to confuse with a string of division ops.
Is where a good solution to this problem?
My own solution that is aimed at keeping the parser simple and involves prefixing dates with a special symbol, let's say apostrophe:
<date> ::= <apostr><digit><digit>/<digit><digit>/<digit><digit><digit><digit>
<apostr> ::= '
<digit> ::= '0'..'9'
First off, I'm a fan of LL parsers, so I approve of your approach heartily. Note that one of the newer popular parser generators (ANTLR) is LL. If you allow more look-ahead, rather that restricting yourself to LL(1), you can do pretty much anything you'd ever want to do with an LR(1) parser, but the code will be far clearer, more reliable, and easier to debug.
I don't know enough about your overall grammar to be able to tell. It is possible you might be able to design things so that the LL parser can always tell from context if it is an integer expression or a date constant. However, assuming you can't, yeah you'd need some kind of way to tell the difference. The only other thing I can think of would be to use backslash as a separator instead of slash, but that's kinda ugly.
An LL-like lexerless parser with an infinite lookahead is what you need. And, namely, it is PEG.
http://en.wikipedia.org/wiki/Parsing_expression_grammar
With an ordered choice it is quite easy to avoid this date vs. constant literals division confusion.
When a language is intended for human input, defining it is as much a matter of
adding syntax constraints to ensure unambiguous and easy parsing
removing/bending syntax to ensure that the language feels intuitive, "natural", to the intended human audience.
Satisfying the second requirement is much harder than the first one, and requires insight into the
intended use cases of the language
Which type of keyboard/input device is available? Are there some characters among the allowed characters which are hard to produce or to see on the display?
Which tokens / expression will be frequently used, which will be required only occasionally?
Are users frequently inputting short, ad-hoc code snippets, or are the programs meant to be reused and modified over long periods
... etc.
background/culture of the intended audience
Which common practices/idioms from other regular (and plain natural) languages can or should be reused if possible?
Should one favor a terse but cryptic style, or a more explicit, but more verbose style?
... etc.
Basically, it is hard to make suggestions about a language syntax, without a good grasp of the intended usage and users.
Nevertheless, I'd like to suggest the following for the date format question:
Use an alternative format for date values altogether; one that would be "natural" enough to users but distinctive enough that a regular grammar can describe.
For example, one that uses a 3 letters abbreviation for the month (downside DSL becomes tied to English or other tongue, but also advantage, the ambiguity for humans about which is day and which is month is removed). Tentatively:
dd-mmm-yyyy (may seem unnatural in cultures where the prevailing date order
starts with the month maybe yyyy-mmm-dd then ?)
mmm-dd-yyyy (better for the above mentioned cultures)
ddmmmyyyy (avoid the dashes, but impose leading zeros)
MnnDnnYyyyy (using "M", "D" and "Y" (or others) as explicit prefixes; now,
this is completely culture neutral, but maybe a bit awkward...)
Anyway, just ideas... Applicability will vary with the human/cultural factors mentioned, and with the rest of the syntax. For example the above may imply that variables be explicitly marked (that's one of the reasons many languages use the $ prefix for example), to avoid possible conflict with [odd, but possible] variable identifiers.
In a nutshell the idea is to substitute the need for a special character prefix (which may then collide with the use these characters for mathematical and other expressions), by making the 12 months tag an good enough discriminator for the parser.

Resources