Good way to test that lexical token are accurate - parsing

I have started writing a grammar by defining all the Lexical tokens. Let me just give a made-up basic example:
// TestLexer.g4
// Be able to lex x+4+1.2
lexer grammar TestLexer;
VAR: [a-z]+;
PLUS: '+';
INT: [0-9]+;
DEC: [0-9]+ '.' [0-9]+;
Now, I want to make sure that all my lexical tokens are correct and accurate. Of course the 'end-product' is to create an actual valid parser, but in the meantime, I want to make sure that all my tokens are OK (it is trivial in the below example, but in the actual grammar I have 100+ tokens, some quite complex).
Is there a suggested way to validate that all my tokens are OK? As an example, I thought maybe the most basic example would be doing:
program: TOKEN* EOF;
TOKEN: PLUS|INT|VAR|DEC;
It becomes a bit tedious to add in every single token (especially fixing them as they change). But is the above the best way to do it, or are there other, better ways to do this (perhaps even a mode in the TestRig itself to do this?).
What worked (from Mike's answer):
$ antlr4 TestLexer.g4
$ javac TestLexer.java
$ grun TestLexer tokens -tokens
x+1+1.2
[#0,0:0='x',<VAR>,1:0]
[#1,1:1='+',<'+'>,1:1]
[#2,2:2='1',<INT>,1:2]
[#3,3:3='+',<'+'>,1:3]
[#4,4:6='1.2',<DEC>,1:4]
[#5,7:6='<EOF>',<EOF>,1:7]

You can just use the command line tool with the -tokens option.
(This is the tool that the website recommends setting up as the grun alias.)

Related

Difference between Parser.tokens and Lexer.tokens?

Normally when I export or grun a grammar to a target language it gives me two .tokens files. For example in the following:
lexer grammar TestLexer;
NUM : [0-9]+;
OTHER : ABC;
fragment NEWER : [xyz]+;
ABC : [abc]+;
I get a token for each non-fragment, and I get two identical files:
# Parser.tokens
NUM=1
OTHER=2
ABC=3
# Lexer.tokens
NUM=1
OTHER=2
ABC=3
Are these files always the same? I tried defining a token in the parser but since I've defined it as parser grammar it doesn't allow that, so I would assume these two files would always be the same, correct?
Grammars are always processed as individual lexer and parser grammars. If a combined grammar is used then it is temporarily split into two grammars and processed individually. Each processing step produces a tokens file (the list of found lexer tokens). The tokens file is the link between lexers and parsers. When you set a tokenVocab value actually the tokens file is used. That also means you don't need a lexer grammar, if you have a tokens file.
I'm not sure about the parser.tokens file. It might be useful for grammar imports.
And then you can specify a tocenVocab for lexer grammars too, which allows you to explicitly assign number values to tokens, which can come in handy if you have to check for token ranges (e.g. all keywords) in platform code. I cannot check this currently, but it might be that using this feature leads to token files with different content.

What exactly is a lexer's job?

I'm writing a lexer for Markdown. In the process, I realized that I do not fully understand what its core responsibility should be.
The most common definition of a lexer is that it translates an input stream of characters into an output stream of tokens.
Input → Output
(characters) (tokens)
That sounds quite simple at first, but the question that arises here is how much semantic interpretation the lexer should do before handing over its output of tokens to the parser.
Take this example of Markdown syntax:
### Headline
*This* is an emphasized word.
It might be translated by a lexer into the following series of tokens:
Lexer 1 Output
.headline("Headline")
.emphasis("This")
.text"(" is an emphasized word.")
But it might as well be translated on a more granular level, depending on the grammar (or the set of lexemes) used:
Lexer 2 Output
.controlSymbol("#")
.controlSymbol("#")
.controlSymbol("#")
.text(" Headline")
.controlSymbol("*")
.text("This")
.controlSymbol("*")
.text"(" is an emphasized word.")
It seems a lot more practical to have the lexer produce an output similar to that of Lexer 1, because the parser will then have an easier job. But it also means that the lexer needs to semantically understand what the code means. It's not merely mapping a sequence of characters to a token. It needs to look ahead and identify patterns. (For example, it needs to be able to be able to distinguish between **Hey* you* and **Hey** you. It cannot simply translate a double asterisk ** into .openingEmphasis, because that depends on the following context.)
According to this Stackoverflow post and the CommonMark definition, it seems to make sense to first break down the Markdown input into a number of blocks (representing one or more lines) and then analyze the contents of each block in a second step. With the example above, this would mean the following:
.headlineBlock("Headline")
.paragraphBlock("*This* is an emphasized word.")
But this wouldn't count as a valid sequence of tokens because some of the lexemes ("*") have not been parsed yet and it wouldn't be right to pass this paragraphBlock to the parser.
So here's my question:
Where do you draw the line?
How much semantic work should the lexer do? Is there some hard cut in the definition of a lexer that I am not aware of?
What would be the best way to define a grammar for the lexer?
BNF is used to describe many languages / create lexers and parsers
MOST use a Look right 1 to define a unambiguous format.
Recently I was looking at playing with SQL BNF
https://github.com/ronsavage/SQL/blob/master/sql-92.bnf
I made the decision that my lexer would return only terminal token strings. Similar to your option 1.
'('
')'
'KEWORDS'
'-- comment eol'
'12.34'
...
Any rule that defined the syntax tree would be left to the parser.
<Document> := <lines>
<Lines> := <line> [<Lines>]
<line> := ...

Tokenising whitespace in a half-whitespace sensitive language?

I have made some searches, including taking a second look through the red Dragon Book in front of me, but I haven't found a clear answer to this. Most people are talking about whitespace-sensitivity in terms of indentation, but that's not my case.
I want to implement a transpiler for a simple language. This language has a concept of a "command", which is a reserved keyword followed by some arguments. To give you an idea of what I'm talking about, a sequence of commands may look something like this:
print "hello, world!";
set running 1;
while running #
read progname;
launch progname;
print "continue? 1 = yes, 0 = no";
readint running;
#
Informally, you can view the grammar as something along the lines of
<program> ::= <statement> <program>
<statement> ::= while <expression> <sequence>
| <command> ;
<sequence> ::= # <program> #
| <statement>
<command> ::= print <expression>
| set <variable> <expression>
| read <variable>
| readint <variable>
| launch <expression>
<expression> ::= <variable>
| <string>
| <int>
for simplicity, we can define the following as such
<string> is an arbitrary sequence of characters surrounded by quotes
<int> is a sequence of characters '0'..'9'
<variable> is a sequence of characters 'a'..'z'
Now this would ordinarily not be any problem. In fact, given just this specification I have a working implementation, where the lexer silently eats all whitespace. However, here's the catch:
Arguments to commands must be separated by whitespace!
In other words, it should be illegal to write
while running#print"hello";#
even though this clearly isn't ambiguous as far as the grammar is concerned. I have had two ideas on how to solve this.
Output a token whenever some whitespace is consumed, and include whitespace in the grammar. I suspect this will make the grammar a lot more complicated.
Rewrite the grammar so instead of "hard-coding" the arguments of each command, I have a production rule for "arguments" taking care of whitespace. It may look something like
<command> ::= <cmdtype> <arguments>
<arguments> ::= <argument> <arguments>
<argument> ::= <expression>
<cmdtype> ::= print | set | read | readint | launch
Then we can make sure the lexer somehow (?) takes care of leading whitespace whenever it encounters an <argument> token. However, this moves the complexity of dealing with the arity (among other things?) of built-in commands into the parser.
How is this normally solved? When the grammar of a language requires whitespace in particular places but leaves it optional almost everywhere else, does it make sense to deal with it in the lexer or in the parser?
I wish I could fudge the specification of the language just a teeny tiny bit because that would make it much simpler to implement, but unfortunately this is a backward-compatibility issue and not possible.
Backwards compatibility is usually taken to apply only to correct programs; accepting a program which previously would have benn rejected as a syntax error cannot alter the behaviour of any valid program and thus does not violate backwards compatibility.
That might not be relevant in this case, but since it would, as you note, simplify the problem considerably, it seemed worth mentioning.
One solution is to pass whitespace on to the parser, and then incorporate it into the grammar; normally, you would define a terminal, WS, and from that a non-terminal for optional whitespace:
<ows> ::= WS |
If you are careful to ensure that only one of the terminal and the non-terminal are valid in any context, this does not affect parsability, and the resulting grammar, while a bit cluttered, is still readable. The advantage is that it makes the whitespace rules explicit.
Another option is to handle the issue in the lexer; that might be simple but it depends on the precise nature of the language.
From your description, it appears that the goal is to produce a syntax error if two tokens are not separated by whitespace, unless one of the tokens is "self-delimiting"; in the example shown, I believe the only such token is the semicolon, since you seem to indicate that # must be whitespace-delimited. (It could be that your complete language has more self-delimiting tokens, but that does not substantially alter the problem.)
That can be handled with a single start condition in the lexer (assuming you are using a lexer generator which allows explicit states); reading whitespace puts you in a state in which any token is valid (which is the initial state, INITIAL if you are using a lex-derivative). In the other state, only self-delimiting tokens are valid. The state after reading a token will be the restricted state unless the token is self-delimiting.
This requires pretty well every lexer action to include a state transition action, but leaves the grammar unaltered. The effect is to move the clutter from the parser to the scanner, at the cost of obscuring the whitespace rules. But it might be less clutter and it will certainly simplify a future transition to a whitespace-agnostic dialect, if that is in your plans.
There is a different scenario, which is a posix-like shell in which identifiers (called "words" in the shell grammar) are not limited to alphabetic characters, but might include any non-self-delimiting character. In a posix shell, print"hello, world" is a single word, distinct from the two token sequence print "hello, world". (The first one will eventually be dequoted into the single token printhello, world.)
That scenario can really only be handled lexically, although it is not necessarily complicated. It might be a guide to your problem as well; for example, you could add a lexical rule which accepts any string of characters other than whitespace and self-delimiting characters; the maximal munch rule will ensure that action is only taken if the token cannot be recognised as an identifier or a string (or other valid tokens), so you can just throw an error in the action.
That is even simpler than the state-based lexer, but it is somewhat less flexible.

basic questions in BISON for parsing

I am new to bison, I have some basic questions if you could help me with them:
Which one is right from the following:
%left ’*’ ’/’
or
%left '*' '/'
that means instead of getting the token I use it to define it in the parser file
Can I define a rule like this:
EXP -> EXP "and" EXP
instead of
EXP -> EXP AND EXP //AND here is a token
If I have LEX and BISON files for building a parser which one should include the other and if
I have used a common header file in which one of them should the file be defined?
If the BISON algorithm found a match according to one of the rules what happens first it makes reduce then it does the action defined for the rule that matched or first it does the action and after that makes reduce to the stack?
Its tough to tell what you are asking due to your formatting, but think the answer is no. %left just defines a a token (exactly like %token does) and in addition sets a precedence level for that token. You still have to "get" the token by recognizing it in your lexer and returning the appropriate token value.
While you can use "and", you don't want to because its almost impossible to get right. Its much better to use AND or and (no quotes). The difference is that using quotes creates a token that is not output as a #define in the .tab.h file, so there's no easy way to generate that token in your lexer.
There a a number of ways to do it. The easiest is to have neither include the other and have the lex file include the header generated by bison's -d flag -- this is what most examples do. It is also possible to directly include the lex.yy.c file in the 3rd section of the .y file OR include the .tab.c in the top section of the .l file (but not both!) in which case you'll only compile the one file.
Its executes the action for the rule first (so the values for the items on the RHS are available while the action is executing), and then does the stack reduction, replacing the RHS items with the value the action put int $$.
I somewhat disagree with Chris on point 2. It's better to use "and" because then in the error messages the parser will report things about "and" rather then about TOK_AND or t_AND which certainly do not make sense to the user.
And it's not that hard to get it right: provided you inserted
%token TOK_AND "and"
somewhere, you can use either "and" or TOK_AND in the grammar file. But, IMHO, the former is much clearer.

Can CSP(Communicating sequential processes) parser be generated using ANTLR?

Can I write a parser for Communicating sequential processes(CSP) in ANTLR? I think it uses left recursion like in statement
VMS = (coin → (choc → VMS))
complete language specification can be found at CSPM : A Reference Manua
so it is not a LL grammer. Am I right?
In general, even if you have a grammar with left recursion, you can refactor the grammar to remove it. So ANTLR is reasonably likely to be able to process your grammar. There's no a priori reason you can't write a CSP grammar for ANTLR.
Whether the one you have is suitable is another question.
If your quoted phrase is a grammar rule, it doesn't have left recursion. (If it is, I don't understand the syntax of your grammar rules, especially why the parentheses [terminals?] would be unbalanced; that's pretty untradional.)
So ANTLR should be able to process it, modulo converting it to ANTRL grammar rule syntax.
You didn't show the rest of the grammar, so one can't have an opinion about the rest of it.
In the case above does not have left recursion. It would looks something like. Note this is a simplified version, CSP is much more complicated. I'm just showing it is possible.
assignment : PROCNAME EQ process
;
process : LPAREN EVENT ARROW process RPAREN
| PROCNAME
;
Besides, you can factor out left recursion with ANTLRWorks 'Remove Left Recursion' function.
CSP's are definitely possible in ANTLR, check http://www.carnap.info/ for an example.

Resources