Classifying a metachar in a antlr4 grammar - parsing

I am writing up a document to describe a basic regular expression parser that I've written, and I'm a bit unclear on what exactly constitutes a "meta-character". For example, of the following three items:
$
\d
[a-z]
What is the definition of a 'meta-character' (not an escape sequence), and how would the above be classified properly?

Related

What exactly is a lexer's job?

I'm writing a lexer for Markdown. In the process, I realized that I do not fully understand what its core responsibility should be.
The most common definition of a lexer is that it translates an input stream of characters into an output stream of tokens.
Input → Output
(characters) (tokens)
That sounds quite simple at first, but the question that arises here is how much semantic interpretation the lexer should do before handing over its output of tokens to the parser.
Take this example of Markdown syntax:
### Headline
*This* is an emphasized word.
It might be translated by a lexer into the following series of tokens:
Lexer 1 Output
.headline("Headline")
.emphasis("This")
.text"(" is an emphasized word.")
But it might as well be translated on a more granular level, depending on the grammar (or the set of lexemes) used:
Lexer 2 Output
.controlSymbol("#")
.controlSymbol("#")
.controlSymbol("#")
.text(" Headline")
.controlSymbol("*")
.text("This")
.controlSymbol("*")
.text"(" is an emphasized word.")
It seems a lot more practical to have the lexer produce an output similar to that of Lexer 1, because the parser will then have an easier job. But it also means that the lexer needs to semantically understand what the code means. It's not merely mapping a sequence of characters to a token. It needs to look ahead and identify patterns. (For example, it needs to be able to be able to distinguish between **Hey* you* and **Hey** you. It cannot simply translate a double asterisk ** into .openingEmphasis, because that depends on the following context.)
According to this Stackoverflow post and the CommonMark definition, it seems to make sense to first break down the Markdown input into a number of blocks (representing one or more lines) and then analyze the contents of each block in a second step. With the example above, this would mean the following:
.headlineBlock("Headline")
.paragraphBlock("*This* is an emphasized word.")
But this wouldn't count as a valid sequence of tokens because some of the lexemes ("*") have not been parsed yet and it wouldn't be right to pass this paragraphBlock to the parser.
So here's my question:
Where do you draw the line?
How much semantic work should the lexer do? Is there some hard cut in the definition of a lexer that I am not aware of?
What would be the best way to define a grammar for the lexer?
BNF is used to describe many languages / create lexers and parsers
MOST use a Look right 1 to define a unambiguous format.
Recently I was looking at playing with SQL BNF
https://github.com/ronsavage/SQL/blob/master/sql-92.bnf
I made the decision that my lexer would return only terminal token strings. Similar to your option 1.
'('
')'
'KEWORDS'
'-- comment eol'
'12.34'
...
Any rule that defined the syntax tree would be left to the parser.
<Document> := <lines>
<Lines> := <line> [<Lines>]
<line> := ...

Grammar: Precedence of grammar alternatives

This is a very basic question about grammar alternatives. If you have the following alternative:
Myalternative: 'a' | .;
Myalternative2: 'a' | 'b';
Would 'a' have higher priority over the '.' and over 'b'?
I understand that this may also depend on the behaviour of the parser generated by this syntax but in pure theoretical grammar terms could you imagine these rules being matched in parallel i.e. test against 'a' and '.' at the same time and select the one with highest priority? Or is the 'a' and . ambiguous due to the lack of precedence in grammars?
The answer depends primarily on the tool you are using, and what the semantics of that tool is. As written, this is not a context-free grammar in canonical form, and you'd need to produce that to get a theoretical answer, because only in that way can you clarify the intended semantics.
Since the question is tagged antlr, I'm going to guess that this is part of an Antlr lexical definition in which . is a wildcard character. In that case, 'a' | . means exactly the same thing as ..
Since MyAlternative matches everything that MyAlternative2 matches, and since MyAlternative comes first in the Antlr lexical definition, MyAlternative2 can never match anything. Any single character will be matched by MyAlternative (unless there is some other lexical rule which matches a longer sequence of input characters).
If you put the definition of MyAlternative2 first in the grammar file, then a or b would be matched as MyAlternative2, while any other character would be matched as MyAlternative.
The question of precedence within alternatives is meaningless. It doesn't matter whether MyAlternative considers the match of an a to be a match of a or a match of .. It is, in the end, a match of MyAlternative, and that symbol can only have one associated action.
Between lexical rules, there is a precedence relationship: The first one wins. (More accurately, as far as I know, Antlr obeys the usual convention that the longest match wins; between two rules which both match the same longest sequence, the first one in the grammar file wins.) That is not in any way influenced by alternative bars in the rules themselves.

What type of parser is needed for this grammar?

I have a grammar that I do not know what type of parser I need in order to parse it other than I do not believe the grammar is LL(1). I am thinking I need a parser with backtracking or LL(*) of some sort. The grammar I have came up with (which may need some rewriting) is:
S: Rules
Rules: Rule | Rule Rules
Rule: id '=' Ids
Ids: id | Ids id
The language I am trying to generate looks something like this:
abc = def g hi jk lm
xy = aaa bbb ccc ddd eee fff jjj kkk
foo = bar ha ha
Zero or more Rule that contain a left identifier followed by an equal sign followed by one or more identifers. The part that I think I will have a problem writing a parser for is that the grammar allows any amount of id in a Rule and that the only way to tell when a new Rule starts is when it finds id =, which would require backtracking.
Does anyone know the classification of this grammar and the best method of parsing, for a hand written parser?
The grammar that generates an identifier followed by an equals sign followed by a finite sequence of identifiers is regular. This means that strings in the language can be parsed using a DFA or regular expression. No need for fancy nondeterministic or LL(*) parsers.
To see that the language is regular, let Id = U {a : a ∈ Γ}, where Γ ⊂ Σ is the set of symbols that can occur in identifiers. The language you are trying to generate is denoted by the regular expression
Id+ =( Id+)* Id+
Setting Γ = {a, b, ..., z}, examples of strings in the language of the regular expression are:
look = i am in a regular language
hey = that means i can be recognized by a dfa
cool = or even a regular expression
There is no need to parse your language using powerful parsing techniques. This is one case where parsing using regular expressions or DFA is both appropriate and optimal.
edit:
Call the above regular expression R. To parse R*, generate a DFA recognizing the language of R*. To do this, generate an NFA recognizing the language of R* using the algorithm obtainable from Kleene's theorem. Then convert the NFA into a DFA using the subset construction. The resultant DFA will recognize all strings in R*. Given a representation of the constructed DFA in your implementation language, the required actions - for instance,
Add the last identifier parsed to the right-hand side of the current declaration statement being parsed
Add the last declaration statement parsed to a list of parsed declarations, and use the last identifier parsed to begin parsing a new declaration statement
can be encoded into the states of the DFA. In reality, using Kleene's theorem and the subset construction is probably unnecessary for such a simple language. That is, you can probably just write a parser with the above two actions without implementing an automaton. Given a more complicated regular langauge (for instance, the lexical structure of a programming langauge), the conversion would be the best option.

Which parser generator would be useful for manipulating the productions themselves?

Similar to Generating n statements from context-free grammars, I want to randomly generate sentences from a grammar.
What is a good parser generator for manipulating the actual grammar productions themselves? I want the parser generator to actually give me access to the productions (production objects?).
If I had a grammar akin to:
start_symbol ::= foo
foo ::= bar | baz
What is a good parser generator for:
giving me the starting production symbol
allow me to choose one production from RHS of the start symbol ( foo in this case)
give me the production options for foo
Clearly every parser has internal representations for productions and methods of associating the production with its RHS, but which parser would be easy to manipulate these internals?
Note: the blog entry linked to from the other SO question I mentioned has some sort of custom CFG parser. I want to use an actual grammar for a real parser, not generate my own grammar parser.
It should be pretty easy to write a grammar, that matches the grammar that a parser generator accepts. (With an open source parser genrator, you ought to be able to fetch such a grammar from the parser generator source code; they all then to have self-grammars). With that, you can then parse any grammar the parser generator accepts.
If you want to manipulate the parsed grammar, you'll need an abstract syntax tree of same. You can make most parser generators build a tree, either by using built-in mechanisms or ad hoc code you add.

Does the recognition of numbers belong in the scanner or in the parser?

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?

Resources