How to tokenize a word in multiple lines in ANTLR4 - token

I want to tokenize the next word "SINGULAR EXECUTIVE OF MINIMUM QUANTIA" wrote in multiple lines. It is pretty simple if you have the full word in one line
foo bar foo bar foo bar SINGULAR EXECUTIVE OF MINIMUM QUANTIA foo bar foo bar foo bar foo bar
foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo barfoo bar foo bar foo bar
but I can not tokenize it when I have the word split into two lines
foo bar foo bar foo bar SINGULAR EXECUTIVE OF
MINIMUM QUANTIA foo bar foo bar foo bar foo bar
foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo bar
This is my lexer
SPECIALWORD:S I N G U L A R ' ' E X E C U T I V E ' ' O F ' ' M I N I M U M ' ' Q U A N T I A
fragment A:('a'|'A'|'á'|'Á');
......
......
fragment Z:('z'|'Z');
WORDUPPER: UCASE_LETTER UCASE_LETTER+;
WORDLOWER: LCASE_LETTER LCASE_LETTER+;
WORDCAPITALIZE: UCASE_LETTER LCASE_LETTER+;
LCASE_LETTER: 'a'..'z' | 'ñ' | 'á' | 'é' | 'í' | 'ó' | 'ú';
UCASE_LETTER: 'A'..'Z' | 'Ñ' | 'Á' | 'É' | 'Í' | 'Ó' | 'Ú';
INT: DIGIT+;
DIGIT: [0-9];
WS : [ \t\r\n]+ -> skip;
ERROR: . ;
I have tried using line break into lexer rule
SPECIALWORD:S I N G U L A R ' ' E X E C U T I V E ' ' O F [\n] M I N I M U M ' ' Q U A N T I A
but it does not work, I guess because the lexer tokenize line by line.

So what you actually want is to allow a combination of the 5 words to become a certain token, while allowing an arbitrary number of whitespaces between them. This is actually the default work principle of ANTLR4 based parsers. Your attempt to put this all into one single lexer token is what makes things complicated.
Instead define your (key) words as:
SINGLUAR_SYMBOL: S I N G U L A R;
EXECUTIVE_SYBOL: E X E C U T I V E;
OF_SYMBOL: O F;
MINIMUM_SYMBOL: M I N I M U M;
QUANTIA_SYMBOL: Q U A N T I A;
and define a parser rule to parse these as a special sentence:
singularExec: SINGLUAR_SYMBOL EXECUTIVE_SYBOL OF_SYMBOL MINIMUM_SYMBOL QUANTIA_SYMBOL;
Together with your WS rule that will match any combination of whitespaces between the individiual symbols.

Your revised rule matches if there is exactly one \n and no other character between "OF" and "MINIMUM". However, your input contains a space before the line break. Thus the rule does not match.
If you remove the space from the input or you adjust your rule to allow spaces before the line break, it will match.
You'll probably want to use either [ \n]+ to allow an arbitrary number of spaces and/or line breaks (you might want to throw in \t and \r as well for good measure) or ' '* '\n' ' '* if you still want to restrict it to a single line break, but allow any number of spaces around it.
That said you'll probably have an easier time if you make each word its own token.

Related

How to support different language versions in my lexer/parser

I am wondering what is the best way to support different versions of a language in my grammar.
I am working on modifying an existing grammar for a language and there is a new version of the language, introducing new keywords and additional syntax I should be able to parse. However, existing codebase written in the language can already use these new keywords as identifiers for example, so I have to make this extension optional.
So my question is: what is the preferred way to write conditional lexer and parser rules, based on a boolean value? Semantic predicates came to my mind, but I am relatively new to antlr and I'm not sure if it is a good idea to use them for such a purpose.
I had very good success with semantic predicates in the MySQL grammar, to support various MySQL versions. This includes new features, removed features and features that were valid only for a certain MySQL version range. Additionally, you can use the semantic predicates to tell the user in which version a specific syntax would be valid. But you have to parse the predicates yourself for that.
As an example, in this line a new import statement is conditionally added:
simpleStatement:
// DDL
...
| {serverVersion >= 80000}? importStatement
I have a field serverVersion in my common recognizer class from which both generated lexer and parser classes derive. This field is set with a valid version, right before the parsing process is triggered.
Also in the lexer you can guard keywords with this approach, like shown in this and surrounding lines in the MySQL lexer:
MASTER_SYMBOL: M A S T E R;
MASTER_TLS_VERSION_SYMBOL: M A S T E R '_' T L S '_' V E R S I O N {serverVersion >= 50713}?;
MASTER_USER_SYMBOL: M A S T E R '_' U S E R;
MASTER_HEARTBEAT_PERIOD_SYMBOL: M A S T E R '_' H E A R T B E A T '_' P E R I O D?;
MATCH_SYMBOL: M A T C H; // SQL-2003-R
MAX_CONNECTIONS_PER_HOUR_SYMBOL: M A X '_' C O N N E C T I O N S '_' P E R '_' H O U R;
MAX_QUERIES_PER_HOUR_SYMBOL: M A X '_' Q U E R I E S '_' P E R '_' H O U R;
MAX_ROWS_SYMBOL: M A X '_' R O W S;
MAX_SIZE_SYMBOL: M A X '_' S I Z E;
MAX_STATEMENT_TIME_SYMBOL:
M A X '_' S T A T E M E N T '_' T I M E {50704 < serverVersion && serverVersion < 50708}?
;
MAX_SYMBOL: M A X { setType(determineFunction(MAX_SYMBOL)); }; // SQL-2003-N
MAX_UPDATES_PER_HOUR_SYMBOL: M A X '_' U P D A T E S '_' P E R '_' H O U R;
MAX_USER_CONNECTIONS_SYMBOL: M A X '_' U S E R '_' C O N N E C T I O N S;
There are two approaches you can take:
If the additional syntax is not valid with the earlier version of the grammar and the interpretation of the previously valid expressions are not changing - only then you can consider using something like semantic predicates to be able to gauge which part of input is parsed with the new grammar and which one with the old one.
Example being: extending integer calculator to support floats
1.0 is invalid with the earlier grammar and new grammar does not change semantics of 1 (integer) calculations.
This condition is not so easy to be met as it may seem - there might be quite nuanced conditions particularly if the grammar or its new versions are complex.
Have two versions of the lexer/parser and switch them on independently as #lex-li suggests. This is the safe path that does not have to deal with the semantic changes of the old expressions with the additions of the new grammar syntax.

How do try and <|> functions from parsers lib work

(I use trifecta parser lib). I'm trying to make a parser that parses integers into Right and literal sequences (alphabet, numeral symbols and "-" are allowed) into Left:
*Lib> parseString myParser mempty "123 qwe 123qwe 123-qwe-"
Success [Right 123,Left "qwe",Left "123qwe",Left "123-qwe-"]
That is what I invented:
myParser :: Parser [Either String Integer]
myParser = sepBy1 (try (Right . read <$> (some digit <* notFollowedBy (choice [letter, char '-'])))
<|> Left <$> some (choice [alphaNum, char '-']))
(char ' ')
My problem is that I don't understand why try is needed there (and in any other similar situations). When try is not used, an error appears:
*Lib> parseString myParser mempty "123 qwe 123qwe 123-qwe-"
Failure (ErrInfo {_errDoc = (interactive):1:12: error: expected: digit
1 | 123 qwe 123qwe 123-qwe-<EOF>
| ^ , _errDeltas = [Columns 11 11]})
So try puts the parsing cursor back to where we started on failure. Imagine try isn't used:
123qwe
^ failed there, the cursor position remains there
On the other hand, <|> is like "either". It should run the second parser Left <$> some (choice [alphaNum, char '-'])) (when the first parser failed) and consume just "qwe".
Somewhere I'm wrong.
The second parser would indeed consume the "qwe" part if only it was given a chance to run. But it isn't given such chance.
Look at the definition of (<|>) for Parser:
Parser m <|> Parser n = Parser $ \ eo ee co ce d bs ->
m eo (\e -> n (\a e' -> eo a (e <> e')) (\e' -> ee (e <> e')) co ce d bs) co ce d bs
Hmm... Maybe not such a good idea to look at that. But let's push through nevertheless. To make sense of all those eo, ee, etc., let's look at their explanations on the Parser definition:
The first four arguments are behavior continuations:
epsilon success: the parser has consumed no input and has a result as well as a possible Err; the position and chunk are unchanged (see pure)
epsilon failure: the parser has consumed no input and is failing with the given Err; the position and chunk are unchanged (see empty)
committed success: the parser has consumed input and is yielding the result, set of expected strings that would have permitted this parse to continue, new position, and residual chunk to the continuation.
committed failure: the parser has consumed input and is failing with a given ErrInfo (user-facing error message)
In your case we clearly have "committed failure" - i.e. the Right parser has consumed some input and failed. So in this case it's going to call the fourth continuation - denoted ce in the definition of (<|>).
And now look at the body of the definition: the fourth continuation is passed to parser m unchanged:
m eo (\e -> n (\a e' -> eo a (e <> e')) (\e' -> ee (e <> e')) co ce d bs) co ce d bs
^
|
here it is
This means that the parser returned from (<|>) will call the fourth continuation in all cases in which parser m calls it. Which means that it will fail with "committed failure" in all cases in which the parser m fails with "committed failure". Which is exactly what you observe.

Is it possible to transform this grammar to be LR(1)?

The following grammar generates the sentences a, a, a, b, b, b, ..., h, b. Unfortunately it is not LR(1) so cannot be used with tools such as "yacc".
S -> a comma a.
S -> C comma b.
C -> a | b | c | d | e | f | g | h.
Is it possible to transform this grammar to be LR(1) (or even LALR(1), LL(k) or LL(1)) without the need to expand the nonterminal C and thus significantly increase the number of productions?
Not as long as you have the nonterminal C unchanged preceding comma in some rule.
In that case it is clear that a parser cannot decide, having seen an "a", and having lookahead "comma", whether to reduce or shift. So with C unchanged, this grammar is not LR(1), as you have said.
But the solution lies in the two phrases, "having seen an 'a'" and "C unchanged". You asked if there's fix that doesn't expand C. There isn't, but you could expand C "a little bit" by removing "a" from C, since that's the source of the problem:
S -> a comma a .
S -> a comma b .
S -> C comma b .
C -> b | c | d | e | f | g | h .
So, we did not "significantly" increase the number of productions.

How to write a parsec parser for a list of interspersed elements?

Let's say the input looks something like foo#1 bar baz-3.qux [...]. I want to write a parser that only consumes the input up until the first space before the [, which means foo#1 bar baz-3.qux (without the trailing space).
How should I approach this using parsec?
I can imagine something like
foo = many1 $ letter <|> digit <|> oneOf " #-."
but this consumes even the space at the end, which I'd like to avoid. What is a general approach to parsing a list of things interspersed with another thing? (Imagine it's not just a space, but something that would also need to be parsed).
P.S: I'm looking for the most general solution possible, not a clever hack that solves this particular example.
I think what you're looking for is exactly notFollowedBy. Something like
foo = many1 $ letter
<|> digit
<|> oneOf "#-."
<|> (try $ char ' ' >> notFollowedBy (char '[') >> return ' ')
You can abstract out the pattern to get the general function of course:
endedBy :: (Show y) => Parser x -> Parser x -> Parser y -> Parser [x]
endedBy p final terminal = many1 $ p <|> t where
t = try $ do
x <- final
notFollowedBy terminal
return x
foo' = endedBy (letter <|> digit <|> oneOf "#-.") (char ' ') (char '[')

How to remove left-recursion in the following grammar?

Unfortunately, it is not possible for ANTLR to support direct-left recursion when the rule has parameters passed. The only viable option is to remove the left recursion. Is there a way to remove the left-recursion in the following grammar ?
a[int x]
: b a[$x] c
| a[$x - 1]
(
c a[$x - 1]
| b c
)
;
The problem is in the second alternative involving left recursion. Any kind of help would be much appreciated.
Without the parameters and easier formatting, it would look like this:
a
: b a c
| a (c a | b c)
;
When a's left recursive alternative is matched n times, it would just mean that (c a | b c) will be matched n times, pre-pended with the terminating b a c (the first alternative). That means that this rule will always start with b a c, followed by zero or more occurrences of (c a | b c):
a
: b a c (c a | b c)*
;

Resources