Backus-Naur Form uses ::= between the left and right sides of the production rules of a grammar. Wikipedia tells me that notation evolved from :≡. Do either of those symbols have a name?
Based on #rici's tip that Unicode simply calls it DOUBLE COLON EQUALS, it doesn't seem there's another official name.
Related
Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.
Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C++23. So I guess it's pretty well mainstream.
UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.
That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)
That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.
You'll see that in a number of grammar documents:
Pythonidentifier ::= xid_start xid_continue*
RustIDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
| _ XID_Continue+
C++identifier:
identifier-start
identifier identifier-continue
So that's what I'd suggest. But there are many other possibilities:
SwiftCalls the sets identifier-head and identifier-characters
JavaCalls them JavaLetter and JavaLetterOrDigit
CDefines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.
The initial title question was: Why does my lexer rule not work, until I change it to a parser rule? The contents below are related to this question. Then I found new information and changed the title question. Please see my comment!
My Antlr Grammar (Only the "Spaces" rule and it's use is important).
Because my input comes from an ocr source there can be multiple whitespaces, but on the other hand i need to recognize the spaces, because they have meaning for the text structure.
For this reason in my grammar I defined
Spaces: Space (Space Space?)?;
but this throws the error above - the whitespace is not recognzied.
So when I replace it with a parser rule (lowercase!) in my grammar
spaces: Space (Space Space?)?;
the error seems to be solved (subsequent errors appear - not part of this question).
So why is the error solved then in this concrete case when using a parser rule instead of a lexer rule?
And in general - when to use a lexer rule and when a parser rule?
Thank you, guys!
A single space is being recognized as a Space and not as a Spaces, since it matches both lexical rules and Space comes first in the grammar file. (You can see that token type 1 is being recognized; Spaces would be type 9 by my count.)
Antlr uses the common "maximum munch" lexical strategy in which the lexical token recognized corresponds to the longest possible match, ordering the possibilities by order in the file in case two patterns match the same longest match. When you put Spaces first in the file, it wins the tie rule. If you make it a parser rule instead of a lexical rule, then it gets applied after the unambiguous lexical rule for Space.
Do you really only want to allow up to 3 spaces? Otherwise, you could just ditch Space and define Spaces as " "*.
I'm using ANTLR4 for a class I'm taking right now and I seem to understand most of it, but I can't figure out what '+' does. All I can tell is that it's usually after a set of characters in brackets.
The plus is one of the BNF operators in ANTLR that allow to determine the cardinality of an expression. There are 3 of them: plus, star (aka. kleene operator) and question mark. The meaning is easily understandable:
Question mark stands for: zero or one
Plus stands for: one or more
Star stands for: zero or more
Such an operator applies to the expression that directly preceeds it, e.g. ab+ (one a and one or more b), [AB]? (zero or one of either A or B) or a (b | c | d)* (a followed by zero or more occurences of either b, c or d).
ANTLR4 also uses a special construct to denote ungreedy matches. The syntax is one of the BNF operators plus a question mark (+?, *?, ??). This is useful when you have: an introducer match, any content, and then match a closing token. Take for instance a string (quote, any char, quote). With a greedy match ANTLR4 would match multiple strings as one (until the final quote). An ungreedy match however only matches until the first found end token (here the quote char).
Side note: I don't know what ?? could be useful for, since it matches a single entry, hence greedyness doesn't play a role here.
Actually, these operators are not part of traditional BNF, but rather of the Extended Backus-Naur Form. These are one of the reasons it's easier (or even possible) to document certain grammars in EBNF than in old-school BNF, which lacks many of these operators.
Neither of the two main lexer generators commonly referenced, cl-lex and lispbuilder-lexer allow for state variables in the "action blocks", making it impossible to recognize a c-style multi-line comment, for example.
What is a lexer generator in Common Lisp that can recognize a c-style multi-line comment as a token?
Correction: This lexer actually needs to recognize nested, balanced multiline comments (not exactly C-style). So I can't do away with state-variables.
You can recognize a C-style multiline comment with the following regular expression:
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/]
It should work with any library which uses Posix-compatible extended regex syntax; although a bit hard to read because * is extensively used both as an operator and as a literal character, it uses no non-regular features. It does rely on inverted character classes ([^*], for example) matching the newline character, but afaik that is pretty well universal, even for regex engines in which a wildcard does not match newline.
I am trying to understand how to use EBNF to define a formal grammar, in particular a sequence of words separated by a space, something like
<non-terminal> [<word>[ <word>[ <word>[ ...]]] <non-terminal>
What is the correct way to define a word terminal?
What is the correct way to represent required whitespace?
How are optional, repetitive lists represented?
Are there any show-by-example tutorials on EBNF anywhere?
Many thanks in advance!
You have to decide whether your lexical analyzer is going to return a token (terminal) for the spaces. You also have to decide how it (the lexical analyzer) is going to define words, or whether your grammar is going to do that (in which case, what is the lexical analyzer going to return as terminals?).
For the rest, it is mostly a question of understanding the niceties of EBNF notation, which is an ISO standard (ISO 14977:1996 — and it is available as a free download from Freely Available Standards, which you can also get to from ISO), but it is a standard that is largely ignored in practice. (The languages I deal with — C, C++, SQL — use a BNF notation in the defining documents, but it is not EBNF in any of them.)
Whatever you want to make the correct definition of a word. You need to think about how you'd want to treat the name P. J. O'Neill, for example. What tokens will the lexical analyzer return for that?
This is closely related to the previous issue; what are the terminals that lexical analyzer is going to return.
Optional repetitive lists are enclosed in { and } braces, or you can use the Kleene Star notation.
There is a paper Extended BNF — A generic base standard by R. S. Scowen that explains EBNF. There's also the Wikipedia entry on EBNF.
I think that a non-empty, space-separated word list might be defined using:
non_empty_word_list = word { space word }
where all the names there are non-terminals. You'd need to define those in terms of the relevant terminals of your system.