Parser grammar rule is being ignored - parsing

The Goal
The goal is interpret plain text content and recognise patterns e.g. Arithmetic, Comments, Units of Measurements.
Example Input
This would be entered by a user.
# This is an example comment
10 + 10
// Another comment
This is one line of text
Tested
Expected Parse Tree
The goal of my grammar is to generate a tree that would look like this (if anyone has a better method I'd be interested to hear).
Note: The 10 + 10 is being recognised as an arithmetic rule.
Current Parse Tree aka The Problem
Below is the current output from the lexer and parser.
Note: The 10 + 10 is being recognised as an text and not the arithmetic rule.
Grammar Definition
The logic of the grammar at a high levels is as follows:
Parse line by line
Determine the line content if not fall back to text
grammar ContentParser;
/*
* Tokens
*/
NUMBER: '-'? [0-9]+;
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
LINE_COMMENT: '#' TEXT | '//' TEXT;
TEXT: ~[\n\r]+;
EOL: '\r'? '\n';
/*
* Rules
*/
start: file;
file: line+ EOF;
line: content EOL;
content
: comment
| arithmetic
| text
;
// Custom Content Types
comment: LINE_COMMENT;
/// Example taken from ANTLR Docs
arithmetic:
NUMBER # Number
| LPARAN inner = arithmetic RPARAN # Parentheses
| left = arithmetic operator = POW right = arithmetic # Power
| left = arithmetic operator = (MUL | DIV) right = arithmetic # MultiplicationOrDivision
| left = arithmetic operator = (ADD | SUB) right = arithmetic # AdditionOrSubtraction;
text: TEXT;
My Understanding
The content rule should check for a match of the comment rule then followed by the arithmetic rule and finally falling back to the text rule which matches any character apart from newlines.
However, I believe that the lexer is being greedy on the TEXT tokens which is causing issues but I'm not sure.
(I'm still learning ANTLR)

When you are writing a parser, it's always a good idea to print out the tokens for the input.
In the current grammar, 10 + 10 is recognized by the lexer as TEXT, which is not what is needed. The reason it is text is because that is the longest string matched by a rule. It does not matter in this case that the TEXT rule occurs after the NUMBER rule in the grammar. The rule is that Antlr lexers will always match the longest string possible of the given lexer rules. But, if it can match two or more lexer rules where the strings are of equal length, then the first rule "wins". The lexer works pretty much independently of the parser.
There is no way to reliably have spaces in a text string, and not have them in arithmetic. The fix is to push spaces and tabs into an "off-channel" stream, then reconstruct the text by looking at the start and end character indices of the first and last tokens for the text tree node. The tree is a little messier, but it does what you need.
Also, you should just name the grammar as "Context" not "ContextParser" because you end up with "ContextParserParser.java" and "ContextParserLexer.java" when you generate the parser--rather confusing. I also took liberty to remove labeling an variables (I don't used them because I work with XPath expressions on the tree). And, I reordered and reformatted the grammar to be single line, sort alphabetically in order to find rules quicker in a text editor rather than require an IDE to navigate around.
A grammar that does all this is:
grammar Content;
arithmetic: NUMBER | LPARAN arithmetic RPARAN | arithmetic POW arithmetic | arithmetic (MUL | DIV) arithmetic | arithmetic (ADD | SUB) arithmetic ;
comment: LINE_COMMENT;
content : comment | arithmetic | text ;
file: line+ EOF;
line: content? EOL;
start: file;
text: TEXT+;
ADD: '+';
DIV: '/';
LINE_COMMENT: '#' STUFF | '//' STUFF;
LPARAN: '(';
MUL: '*';
NUMBER: '-'? [0-9]+;
POW: '^';
RPARAN: ')';
SUB: '-';
fragment STUFF : ~[\n\r]* ;
EOL: '\r'? '\n';
WS : [ \t]+ -> channel(HIDDEN);
TEXT: .; // Must be last lexer rule, and only one char in length.

Related

Use character as operator between numbers, otherwise treat it as a token ANTLR4

I'm making a language in ANTLR, where a sequence of digits is a number. A sequence of digits, letters and an underscore is, however, an identifier. So, for example:
These are numbers: 234, 0243, 0, 11
These are identifiers: foo, 2foo, foo2, 2y8
However, I also have operators, like multiplication, addition, division... They all work fine, except for one operator, which is the scientific operator e (or E). Unlike in most other languages, where the e for a scientific number is considered part of the number itself (like 2e3), in my language the e is considered to be an operator itself. So for example, (2+5)e4 is valid.
However, this brings an issue: since e is a letter, unless I separe my scientific number into spaces, ANTLR recognizes 2e3 as an identifier instead of the operation 2-e-3.
I'd like the language to always treat e as an operator, and not as part of an identifier, if what's on both sides of the e is something other than a letter or an underscore, or if it's alone. So, for example:
The following are treated as operations: 2e3, 2 e 3, 2E3, 2e 3, 2 e3, 23.4e78.9, 5e($my_var), .0e4, -7e3, 7e+9, 0e-4, (2)e(3), 2e(hello)
The following are treated as identifiers: hello, 2ey9, w3e6, 6ee7, 7ep, e9, e.
I have the following minimal reproduction example:
grammar ScientificExample;
file: expr* EOF;
expr
: UNARY expr
| L_PAREN expr R_PAREN
| expr SCIENTIFIC expr
| atom;
atom: NUMBER | IDENTIFIER;
WS : [ \r\t\n]+ -> skip ;
SCIENTIFIC: [eE];
NUMBER: ( [0-9]* '.' )? [0-9]+;
L_PAREN: '(';
R_PAREN: ')';
IDENTIFIER: [a-zA-Z0-9_]+;
UNARY: [+-];
As I understand it, ANTLR gives priority to whichever parsing rule produces the largest possible result, which might be why it's not giving priority to scientific being first. Any ideas how I might engineer this so priority is given to valid scientific expressions, as long as both members are an expression that isn't a simple IDENTIFIER atom, or the lack thereof?
I've thought of over-engineering the IDENTIFIER lexing rule, however since ANTLR doesn't really have lookahead and lookbehind expressions, I'm not completely sure how I'd achieve that.

parsing maxscript - problem with newlines

I am trying to create parser for MAXScript language using their official grammar description of the language. I use flex and bison to create the lexer and parser.
However, I have run into following problem. In traditional languages (e.g. C) statements are separated by a special token (; in C). But in MAXScript expressions inside a compound expression can be separated either by ; or newline. There are other languages that use whitespace characters in their parsers, like Python. But Python is much more strict about the placement of the newline, and following program in Python is invalid:
# compile error
def
foo(x):
print(x)
# compile error
def bar
(x):
foo(x)
However in MAXScript following program is valid:
fn
foo x =
( // parenthesis start the compound expression
a = 3 + 2; // the semicolon is optional
print x
)
fn bar
x =
foo x
And you can even write things like this:
for
x
in
#(1,2,3,4)
do
format "%," x
Which will evaluate fine and print 1,2,3,4, to the output. So newlines can be inserted into many places with no special meaning.
However if you insert one more newline in the program like this:
for
x
in
#(1,2,3,4)
do
format "%,"
x
You will get a runtime error as format function expects to have more than one parameter passed.
Here is part of the bison input file that I have:
expr:
simple_expr
| if_expr
| while_loop
| do_loop
| for_loop
| expr_seq
expr_seq:
"(" expr_semicolon_list ")"
expr_semicolon_list:
expr
| expr TK_SEMICOLON expr_semicolon_list
| expr TK_EOL expr_semicolon_list
if_expr:
"if" expr "then" expr "else" expr
| "if" expr "then" expr
| "if" expr "do" expr
// etc.
This will parse only programs which use newline only as expression separator and will not expect newlines to be scattered in other places in the program.
My question is: Is there some way to tell bison to treat a token as an optional token? For bison it would mean this:
If you find newline token and you can shift with it or reduce, then do so.
Otherwise just discard the newline token and continue parsing.
Because if there is no way to do this, the only other solution I can think of is modifying the bison grammar file so that it expects those newlines everywhere. And bump the precedence of the rule where newline acts as an expression separator. Like this:
%precedence EXPR_SEPARATOR // high precedence
%%
// w = sequence of whitespace tokens
w: %empty // either nothing
| TK_EOL w // or newline followed by other whitespace tokens
expr:
w simple_expr w
| w if_expr w
| w while_loop w
| w do_loop w
| w for_loop w
| w expr_seq w
expr_seq:
w "(" w expr_semicolon_list w ")" w
expr_semicolon_list:
expr
| expr w TK_SEMICOLON w expr_semicolon_list
| expr TK_EOL w expr_semicolon_list %prec EXPR_SEPARATOR
if_expr:
w "if" w expr w "then" w expr w "else" w expr w
| w "if" w expr w "then" w expr w
| w "if" w expr w "do" w expr w
// etc.
However this looks very ugly and error-prone, and I would like to avoid such solution if possible.
My question is: Is there some way to tell bison to treat a token as an optional token?
No, there isn't. (See below for a longer explanation with diagrams.)
Still, the workaround is not quite as ugly as you think, although it's not without its problems.
In order to simplify things, I'm going to assume that the lexer can be convinced to produce only a single '\n' token regardless of how many consecutive newlines appear in the program text, including the case where there are comments scattered among the blank lines. That could be achieved with a complex regular expression, but a simpler way to do it is to use a start condition to suppress \n tokens until a regular token is encountered. The lexer's initial start condition should be the one which suppresses newline tokens, so that blank lines at the beginning of the program text won't confuse anything.
Now, the key insight is that we don't have to insert "maybe a newline" markers all over the grammar, since every newline must appear right after some real token. And that means that we can just add one non-terminal for every terminal:
tok_id: ID | ID '\n'
tok_if: "if" | "if" '\n'
tok_then: "then" | "then" '\n'
tok_else: "else" | "else" '\n'
tok_do: "do" | "do" '\n'
tok_semi: ';' | ';' '\n'
tok_dot: '.' | '.' '\n'
tok_plus: '+' | '+' '\n'
tok_dash: '-' | '-' '\n'
tok_star: '*' | '*' '\n'
tok_slash: '/' | '/' '\n'
tok_caret: '^' | '^' '\n'
tok_open: '(' | '(' '\n'
tok_close: ')' | ')' '\n'
tok_openb: '[' | '[' '\n'
tok_closeb: ']' | ']' '\n'
/* Etc. */
Now, it's just a question of replacing the use of a terminal with the corresponding non-terminal defined above. (No w non-terminal is required.) Once we do that, bison will report a number of shift-reduce conflicts in the non-terminal definitions just added; any terminal which can appear at the end of an expression will instigate a conflict, since the newline could be absorbed either by the terminal's non-terminal wrapper or by the expr_semicolon_list production. We want the newline to be part of expr_semicolon_list, so we need to add precedence declarations starting with newline, so that it is lower precedence than any other token.
That will most likely work for your grammar, but it is not 100% certain. The problem with precedence-based solutions is that they can have the effect of hiding real shift-reduce conflict issues. So I'd recommend running bison on the grammar and verifying that all the shift-reduce conflicts appear where expected (in the wrapper productions) before adding the precedence declarations.
Why token fallback is not as simple as it looks
In theory, it would be possible to implement a feature similar to the one you suggest. [Note 1]
But it's non-trivial, because of the way the LALR parser construction algorithm combines states. The result is that the parser might not "know" that the lookahead token cannot be shifted until it has done one or more reductions. So by the time it figures out that the lookahead token is not valid, it has already performed reductions which would have to be undone in order to continue the parse without the lookahead token.
Most parser generators compound the problem by removing error actions corresponding to a lookahead token if the default action in the state for that token is a reduction. The effect is again to delay detection of the error until after one or more futile reductions, but it has the benefit of significantly reducing the size of the transition table (since default entries don't need to be stored explicitly). Since the delayed error will be detected before any more input is consumed, the delay is generally considered acceptable. (Bison has an option to prevent this optimisation, however.)
As a practical illustration, here's a very simple expression grammar with only two operators:
prog: expr '\n' | prog expr '\n'
expr: prod | expr '+' prod
prod: term | prod '*' term
term: ID | '(' expr ')'
That leads to this state diagram [Note 2]:
Let's suppose that we wanted to ignore newlines pythonically, allowing the input
(
a + b
)
That means that the parser must ignore the newline after the b, since the input might be
(
a + b
* c
)
(Which is fine in Python but not, if I understand correctly, in MAXScript.)
Of course, the newline would be recognised as a statement separator if the input were not parenthesized:
a + b
Looking at the state diagram, we can see that the parser will end up in State 15 after the b is read, whether or not the expression is parenthesized. In that state, a newline is marked as a valid lookahead for the reduction action, so the reduction action will be performed, presumably creating an AST node for the sum. Only after this reduction will the parser notice that there is no action for the newline. If it now discards the newline character, it's too late; there is now no way to reduce b * c in order to make it an operand of the sum.
Bison does allow you to request a Canonical LR parser, which does not combine states. As a result, the state machine is much, much bigger; so much so that Canonical-LR is still considered impractical for non-toy grammars. In the simple two-operator expression grammar above, asking for a Canonical LR parser only increases the state count from 16 to 26, as shown here:
In the Canonical LR parser, there are two different states for the reduction term: term '+' prod. State 16 applies at the top-level, and thus the lookahead includes newline but not ) Inside parentheses the parser will instead reach state 26, where ) is a valid lookahead but newline is not. So, at least in some grammars, using a Canonical LR parser could make the prediction more precise. But features which are dependent on the use of a mammoth parsing automaton are not particularly practical.
One alternative would be for the parser to react to the newline by first simulating the reduction actions to see if a shift would eventually succeed. If you request Lookahead Correction (%define parse.lac full), bison will insert code to do precisely this. This code can create significant overhead, but many people request it anyway because it makes verbose error messages more accurate. So it would certainly be possible to repurpose this code to do token fallback handling, but no-one has actually done so, as far as I know.
Notes:
A similar question which comes up from time to time is whether you can tell bison to cause a token to be reclassified to a fallback token if there is no possibility to shift the token. (That would be useful for parsing languages like SQL which have a lot of non-reserved keywords.)
I generated the state graphs using Bison's -g option:
bison -o ex.tab.c --report=all -g ex.y
dot -Tpng -oex.png ex.dot
To produce the Canonical LR, I defined lr.type to be canonical-lr:
bison -o ex_canon.c --report=all -g -Dlr.type=canonical-lr ex.y
dot -Tpng -oex_canon.png ex_canon.dot

Flex and Bison - Grammar that sometimes care about spaces

Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
end
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
{WHITESPACE} { ; }
And the language says for example something like:
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
;
// NEW:
_:
/* empty */
| WHITESPACE _;
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
;
//...
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!
Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
name: IDENT | FUNC_CALL
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
%x SIGNED_NUMBERS
%%
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
BEGIN(SIGNED_NUMBERS);
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
BEGIN(INITIAL);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.

Dealing with overloaded symbols in ambiguous grammars in ANTLR4

I am trying to write a parser for a dialect of Answer Set Programming (ASP) which, in terms of grammar, looks like Prolog with some extensions.
One extension, for instance is expansion, meaning that fact(1..3). for instance is expanded in fact(1). fact(2). fact(3).. Notice that the language understands INT and FLOAT numbers and uses . also as a terminator.
In some cases the parser fails to distinguish between integers, floats, extensions and separators because I reckon the language is clearly ambiguous. In that cases, I have to explicitly separate tokens with white spaces. Any Prolog or ASP parser, however, correctly deals with such productions. I read that ANTLR4 can disambiguate problematic productions autonomously, but probably it needs some help but I don't know how to do! ;-) I read something like here and here, but apparently they did not help me.
Could somebody please tell me what to do to overcome this ambiguity?
Please notice that I cannot change the language because it is quite standard.
In order to simplify the experts' work, I created a minimal working example that follows.
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum: // not needed, but helps in TestRig
FLOAT;
range: // defines an expansion
INT DOTS INT ;
DOTS: '..';
DOT: '.';
FLOAT: DIGIT+ '.' DIGIT* | '.' DIGIT+ ;
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
I use the following input:
1 .
1. .
1.5 .
.5 .
1 .. 5 .
1.
1..
1.5.
.5.
1..5.
And I get the following errors which instead are parsed corrected by other tools:
line 8:0 extraneous input '1.' expecting '.'
line 11:2 extraneous input '.5' expecting '.'
Many thanks in advance!
Before your DOTS rule, add a unique rule for the statement terminal dot and disambiguate the DOTS rule (and change your other rules to use the TERMINAL):
TERMINAL: DOT { isTerminal(1) }? ;
DOTS: DOT DOT { !isTerminal(2) }? ;
DOT: '.';
where the predicate method simply looks ahead on the _input character stream to see if, at the current token index, the next character is white space. Put something like this in an #member block in your grammar:
public boolean isTerminal(int la) {
int offset = _tokenStartCharIndex + 1 + la;
String s = _input.getText(Interval.of(offset, offset));
if (Character.isWhitespace(s.charAt(0))) {
return true;
}
return false;
}
May have to do a bit more work if whitespace is valid between a DOTS and the trailing INT.
I recommend shifting the work to the parser.
If the lexer can't decide if 1..2 is 1. .2 or 1 .. 2 leave if up to the parser.
Maybe there is a context in which it can be interpreted as the first alternative and another context in which it may be interpreted as the second alternative.
Btw: 1..2. could be interpreted as 1 .. 2 . (range) or as 1. . 2 . (floatNum, intNum). How do you want to deal with this?
The following grammar should parse everything. But note that . . is treated as dots as well as 1 . 23 is a floatNum! You can check these tough while parsing or after parsing (depending on whether it should influence the parsing or not).
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum:
INT DOT INT? | DOT INT ;
range: // defines an expansion
INT dots INT ;
dots : DOT DOT;
DOT: '.';
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
Prolog does not accept 1. as a float. This feature makes your grammar significantly more ambiguous, so maybe try removing that feature.

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Resources