How to parsing Velocity Variables using ANTLR4

How to parsing Velocity Variables using ANTLR4 - parsing

The variables of Velocity has following notation. (see Velocity User Guide):
The shorthand notation of a variable consists of a leading "$" character followed by a VTL Identifier. A VTL Identifier must start with an alphabetic character (a .. z or A .. Z). The rest of the characters are limited to the following types of characters:
alphabetic (a .. z, A .. Z)
numeric (0 .. 9)
underscore ("_")
I want to use lexer mode to split the normal text and the variables, so I wrote something like this:
// default mode
DOLLAR : ‘$’ -> pushMode(VARIABLE);
TEXT : ~[$]+? -> skip;
mode VARIABLE:
ID : [a-zA-Z] [a-zA-Z0-9-_]*;
???? : XXX -> popMode; // how can I pop mode to default?
Because the notation of the variables has no explicit end character, so I don't know how to determine its end.
Maybe I got it wrong?

You would pop out of that scope like this:
mode VARIABLE;
ID : [a-zA-Z] [a-zA-Z0-9-_]* -> popMode;
Here's a quick demo:
lexer grammar VelocityLexer;
DOLLAR : '$' -> more, pushMode(VARIABLE);
TEXT : ~[$]+ -> skip;
mode VARIABLE;
// the `-` needs to be escaped!
ID : [a-zA-Z] [a-zA-Z0-9\-_]* -> popMode;
Note the more in the DOLLAR which will cause the $ to be included in the ID token. If you don't, you end up with two tokens ($ and foo for the input $foo)
Test the grammar with the following Java class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
VelocityLexer lexer = new VelocityLexer(CharStreams.fromString("<strong>$Mu</strong>$foo..."));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-10s '%s'\n", VelocityLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
ID '$Mu'
ID '$foo'
EOF '<EOF>'
However, I think a lexical mode is not a good choice in case of an ID. Why not simply do:
lexer grammar VelocityLexer;
DOLLAR : '$' [a-zA-Z] [a-zA-Z0-9\-_]*;
TEXT : ~[$]+ -> skip;
?

Related

How to fail a nested megaparsec parser?

I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF to yield ABC:D.
However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.
Two questions:
Why does fail end parsing instead of failing the parser?
Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?

many has to allow its sub-parser to fail once without the whole parse
failing - for example many (char 'A') *> char 'B', while parsing
"AAAB", has to fail to parse the B to know it got to the end of the
As.
You might want manyTill which allows you to recognise the terminator
explicitly. Something like this:
MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your
existing definition of pComponent and use it with sepBy or similar, like:
MP.sepBy pComponent (MP.satisfy isControlChar)
If you also check for end-of-file after this, like:
MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.

What a parser object normally does is to extract from the input stream whatever subset it is supposed to accept. That's the usual rule.
Here, it seems you want the parser to accept strings that are followed by something specific. From your examples, it is either end of file (eof) or character ':'. So you might want to consider look ahead.
Environment and auxiliary functions:
import Data.Void (Void)
import qualified Data.Text as T
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MC
type Parser = MP.Parsec Void T.Text
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
isAdmissibleChar ch = elem ch ['A' .. 'Z']
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
isControlChar ch = elem ch ":"
-- The escape character:
escChar :: Char
escChar = '\\'
Termination parser, to be used for look ahead:
termination :: Parser ()
termination = MP.eof MP.<|> do
_ <- MP.satisfy isControlChar
return ()
Modified pComponent parser:
pComponent :: Parser (Maybe T.Text)
pComponent = do
txt <- MP.many (escaped MP.<|> regular)
MP.lookAhead termination -- **CHANGE HERE**
if (null txt) then (return Nothing) else (return $ Just (T.pack txt))
where
regular = (MP.satisfy isAdmissibleChar) MP.<|> (fail "Inadmissible character")
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Testing utility:
tryParse :: String -> IO ()
tryParse str = do
let res = MP.parse pComponent "(noname)" (T.pack str)
putStrLn $ (show res)
Let's try to rerun your examples:
$ ghci
λ>
λ> :load q67809465.hs
λ>
λ> str1 = "ABC\\:D:EF"
λ> putStrLn str1
ABC\:D:EF
λ>
λ> tryParse str1
Right (Just "ABC:D")
λ>
So that is successful, as desired.
λ>
λ> tryParse "ABC&D"
Left (ParseErrorBundle {bundleErrors = TrivialError 3 (Just (Tokens ('&' :| ""))) (fromList [EndOfInput]) :| [], bundlePosState = PosState {pstateInput = "ABC&D", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "(noname)", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})
λ>
So that fails, as desired.
Trying our 2 acceptable termination contexts:
λ> tryParse "ABC:&D"
Right (Just "ABC")
λ>
λ>
λ> tryParse "ABCDEF"
Right (Just "ABCDEF")
λ>

fail does not end parsing in general. It just continues with the next alternative. In this case it selects the empty list alternative introduced by the many combinator, so it stops parsing without an error message.
I think the best way to solve your problem is to specify that the input must end in a termination character, that means that it cannot "succeed" halfway like this. You can do that with the notFollowedBy or lookAhead combinators. Here is the relevant part of the megaparsec tutorial.

A case where ANTLR4 terminates parsing successfully before the end of file is reached due to a parsing error

I gave ANTLR4 the following parser and lexer grammar in separate files (referring to a simple grammar for BNF grammar )
parser grammar BNFParser;
options {tokenVocab = BNFLexer;}
compileUnit
: grammar_rule+
;
grammar_rule : NON_TERMINAL COLON (OR? grammar_rule_alternative)* SEMICOLON
;
grammar_rule_alternative : (NON_TERMINAL|TERMINAL)+
;
and
lexer grammar BNFLexer;
TERMINAL : [A-Z][A-Za-z0-9_]*;
NON_TERMINAL : [a-z][A-Za-z0-9_]*;
OR : '|';
COLON : ':';
SEMICOLON : ';';
WS
: [ \t\r\n]+ -> skip
;
The main program
private static void Main(string[] args) {
StreamReader reader = new StreamReader(args[0]);
AntlrInputStream stream = new AntlrInputStream(reader);
BNFLexer lexer = new BNFLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
BNFParser parser = new BNFParser(tokens);
IParseTree root = parser.compileUnit();
Console.WriteLine(root.ToStringTree());
}
Also supplied the following test file for testing the grammar
compileunit : x a
;
x : S b
;
S : compileunit f
;
Please notice from the lexer grammar that Non-Terminals begin with a lowercase letters while Terminals begin with an uppercase letter. This given grammar has an error. The third rule uses a capital letter ( S ) to define Non-Terminal S. The expected behaviour would be to report this as an error. In the contrary parsing succeeds by consuming the first 2 rules and ignoring the third for S without reporting any error. I have also seen the generated files and i noticed the following
try {
EnterOuterAlt(_localctx, 1);
{
State = 7;
_errHandler.Sync(this);
_la = _input.La(1);
do {
{
{
State = 6; grammar_rule();
}
}
State = 9;
_errHandler.Sync(this);
_la = _input.La(1);
} while ( _la==NON_TERMINAL );
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.ReportError(this, re);
_errHandler.Recover(this, re);
}
The above code shows that the parser expects a Non-Terminal symbol at the start of a grammar_rule which is what i expect. However what happens when this is not the case? Also another weird issue is that the CommonTokenStream object that contains the tokens recognized by the lexer contains only the tokens until the end of the second rule but non of the tokens of the third rule (S). Is this proper behaviour?

Add an EOF token to your main rule (compileUnit). That will force the parser to use all input until EOF and report an error if that didn't fully match.

Matching a "text" in line by line file with XText

I try to write the Xtext BNF for Configuration files (known with the .ini extension)
For instance, I'd like to successfully parse
[Section1]
a = Easy123
b = This *is* valid too
[Section_2]
c = Voilà # inline comments are ignored
My problem is matching the property value (what's on the right of the '=').
My current grammar works if the property matches the ID terminal (eg a = Easy123).
PropertyFile hidden(SL_COMMENT, WS):
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE properties+=Property)+
NEWLINE+;
Property:
name=ID (':' | '=') value=ID ';'?;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
I don't know how to generalize the grammar to match any text (eg c = Voilà).
I certainly need to introduce a new terminal
Property:
name=ID (':' | '=') value=TEXT ';'?;
Question is: how should I define this TEXT terminal?
I have tried
terminal TEXT: ANY_OTHER+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT,RULE_STRING,RULE_ML_COMMENT,RULE_ANY_OTHER
(I think it doesn't matter).
Parsing Fails with
Required loop (...)+ did not match anything at input 'à'
terminal TEXT: !('\r'|'\n'|'#')+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT
(I think it doesn't matter).
Parsing Fails with
Missing EOF at [Section1]
terminal TEXT: ('!'|'$'..'~'); (which covers most characters, except # and ")
No warning during the generation of the lexer/parser.
However Parsing Fails with
Mismatch input 'Easy123' expecting RULE_TEXT
Extraneous input 'This' expecting RULE_TEXT
Required loop (...)+ did not match anything at 'is'
Thanks for your help (and I hope this grammar can be useful for others too)

This grammar does the trick:
grammar org.xtext.example.mydsl.MyDsl hidden(SL_COMMENT, WS)
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
import "http://www.eclipse.org/emf/2002/Ecore"
PropertyFile:
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE+ properties+=Property)+
NEWLINE+;
Property:
name=ID value=PROPERTY_VALUE;
terminal PROPERTY_VALUE: (':' | '=') !('\n' | '\r')*;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
Key is, that you do not try to cover the complete semantics only in the grammar but take other services into account, too. The terminal rule PROPERTY_VALUE consumes the complete value including leading assignment and optional trailing semicolon.
Now just register a value converter service for that language and take care of the insignificant parts of the input, there:
import org.eclipse.xtext.conversion.IValueConverter;
import org.eclipse.xtext.conversion.ValueConverter;
import org.eclipse.xtext.conversion.ValueConverterException;
import org.eclipse.xtext.conversion.impl.AbstractDeclarativeValueConverterService;
import org.eclipse.xtext.conversion.impl.AbstractIDValueConverter;
import org.eclipse.xtext.conversion.impl.AbstractLexerBasedConverter;
import org.eclipse.xtext.nodemodel.INode;
import org.eclipse.xtext.util.Strings;
import com.google.inject.Inject;
public class PropertyConverters extends AbstractDeclarativeValueConverterService {
#Inject
private AbstractIDValueConverter idValueConverter;
#ValueConverter(rule = "ID")
public IValueConverter<String> ID() {
return idValueConverter;
}
#Inject
private PropertyValueConverter propertyValueConverter;
#ValueConverter(rule = "PROPERTY_VALUE")
public IValueConverter<String> PropertyValue() {
return propertyValueConverter;
}
public static class PropertyValueConverter extends AbstractLexerBasedConverter<String> {
#Override
protected String toEscapedString(String value) {
return " = " + Strings.convertToJavaString(value, false);
}
public String toValue(String string, INode node) {
if (string == null)
return null;
try {
String value = string.substring(1).trim();
if (value.endsWith(";")) {
value = value.substring(0, value.length() - 1);
}
return value;
} catch (IllegalArgumentException e) {
throw new ValueConverterException(e.getMessage(), node, e);
}
}
}
}
The follow test case will succeed, after you registered the service in the runtime module like this:
#Override
public Class<? extends IValueConverterService> bindIValueConverterService() {
return PropertyConverters.class;
}
Test case:
import org.junit.runner.RunWith
import org.eclipse.xtext.junit4.XtextRunner
import org.xtext.example.mydsl.MyDslInjectorProvider
import org.eclipse.xtext.junit4.InjectWith
import org.junit.Test
import org.eclipse.xtext.junit4.util.ParseHelper
import com.google.inject.Inject
import org.xtext.example.mydsl.myDsl.PropertyFile
import static org.junit.Assert.*
#RunWith(typeof(XtextRunner))
#InjectWith(typeof(MyDslInjectorProvider))
class ParserTest {
#Inject
ParseHelper<PropertyFile> helper
#Test
def void testSample() {
val file = helper.parse('''
[Section1]
a = Easy123
b : This *is* valid too;
[Section_2]
# comment
c = Voilà # inline comments are ignored
''')
assertEquals(2, file.sections.size)
val section1 = file.sections.head
assertEquals(2, section1.properties.size)
assertEquals("a", section1.properties.head.name)
assertEquals("Easy123", section1.properties.head.value)
assertEquals("b", section1.properties.last.name)
assertEquals("This *is* valid too", section1.properties.last.value)
val section2 = file.sections.last
assertEquals(1, section2.properties.size)
assertEquals("Voilà # inline comments are ignored", section2.properties.head.value)
}
}

The problem (or one problem anyway) with parsing a format like that is that, since the text part may contain = characters, a line like foo = bar will be interpreted as a single TEXT token, not an ID, followed by a '=', followed by a TEXT. I can see no way to avoid that without disallowing (or requiring escaping of) = characters in the text part.
If that is not an option, I think, the only solution would be to make a token type LINE that matches an entire line and then take that apart yourself. You'd do that by removing TEXT and ID from your grammar and replacing them with a token type LINE that matches everything up to the next line break or comment sign and must start with a valid ID. So something like this:
LINE :
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*
WS* '=' WS*
!('\r' | '\n' | '#')+
;
This token would basically replace your Property rule.
Of course this is a rather unsatisfactory solution as it will give you the entire line as a string and you still have to pick it apart yourself to separate the ID from the text part. It also prevents you from highlighting the ID part or the = sign as the entire line is one token and you can't highlight part of a token (as far as I know). Overall this does not buy you all that much over not using XText at all, but I don't see a better way.

As a workaround, I have changed
Property:
name=ID ':' value=ID ';'?;
Now, of course, = is not in conflict any more, but this is certainly not a good solution, because properties can usually defined with name=value
Edit: Actually, my input is a specific property file, and the properties are know in advance.
My code now looks like
Section:
'[' name=ID ']'
(NEWLINE (properties+=AbstractProperty)?)+;
AbstractProperty:
ADef
| BDef
ADef:
'A' (':'|'=') ID;
BDef:
'B' (':'|'=') Float;
There is an extra benefit, the property names are know as keywords, and colored as such. However, autocompletion only suggest '[' :(

ANTLR rule to consume fixed number of characters

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :
s:6:"length";
In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).
But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.
This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;

Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.
So, you'll need a lexer rule like this:
SString
: 's:' Int ':"' ( . )* '";'
;
In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
;
Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
and that's it.
A little demo grammar:
grammar Test;
options {
language=Python;
}
parse
: (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
;
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
Int
: '0'..'9'+
;
(note that you need to escape the % inside your grammar!)
And a test script:
import antlr3
from TestLexer import TestLexer
from TestParser import TestParser
input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()
which produces the following output:
parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

Scala: Using StandardTokenParser for parsing hexadecimal numbers

I am using Scala combinatorial parser by extending scala.util.parsing.combinator.syntactical.StandardTokenParser. This class provides following methods
def ident : Parser[String] for parsing identifiers and
def numericLit : Parser[String] for parsing a number (decimal I suppose)
I am using scala.util.parsing.combinator.lexical.Scannersfrom scala.util.parsing.combinator.lexical.StdLexicalfor lexing.
My requirement is to parse a hexadecimal number (without the 0x prefix) which can be of any length. Basically a grammar like: ([0-9]|[a-f])+
I tried integrating Regex parser but there are type issues there. Other ways to extend the definition of lexer delimiter and grammar rules lead to token not found!

As I thought the problem can be solved by extending the behavior of Lexer and not the Parser. The standard lexer takes only decimal digits, so I created a new lexer:
class MyLexer extends StdLexical {
override type Elem = Char
override def digit = ( super.digit | hexDigit )
lazy val hexDigits = Set[Char]() ++ "0123456789abcdefABCDEF".toArray
lazy val hexDigit = elem("hex digit", hexDigits.contains(_))
}
And my parser (which has to be a StandardTokenParser) can be extended as follows:
object ParseAST extends StandardTokenParsers{
override val lexical:MyLexer = new MyLexer()
lexical.delimiters += ( "(" , ")" , "," , "#")
...
}
The construction of the "number" from digits is taken care by StdLexical class:
class StdLexical {
...
def token: Parser[Token] =
...
| digit~rep(digit)^^{case first ~ rest => NumericLit(first :: rest mkString "")}
}
Since StdLexical gives just the parsed number as a String it is not a problem for me, as I am not interested in numeric value either.

You can use the RegexParsers with an action associated to the token in question.
import scala.util.parsing.combinator._
object HexParser extends RegexParsers {
val hexNum: Parser[Int] = """[0-9a-f]+""".r ^^
{ case s:String => Integer.parseInt(s,16) }
def seq: Parser[Any] = repsep(hexNum, ",")
}
This will define a parser that reads comma separated hex number with no prior 0x. And it will actually return a Int.
val result = HexParser.parse(HexParser.seq, "1, 2, f, 10, 1a2b34d")
scala> println(result)
[1.21] parsed: List(1, 2, 15, 16, 27439949)
Not there is no way to distinguish decimal notation numbers. Also I'm using the Integer.parseInt, this is limited to the size of your Int. To get any length you may have to make your own parser and use BigInteger or arrays.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to parsing Velocity Variables using ANTLR4 - parsing

Related

How to fail a nested megaparsec parser?

A case where ANTLR4 terminates parsing successfully before the end of file is reached due to a parsing error

Matching a "text" in line by line file with XText

ANTLR rule to consume fixed number of characters

Scala: Using StandardTokenParser for parsing hexadecimal numbers

Categories

Resources