Tokenize one word into multiple tokens based on position ANTLR4

Tokenize one word into multiple tokens based on position ANTLR4 - parsing

I'm writing a parser for RPG 2. RPG 2 is position based and I use predicates to achieve this. However there is one place I'm stuck.
One statement in RPG is like
26 C N20 'PICK' CHAINORDOFIL 09
Here CHAIN spans from position 27 to 32 and ORDOFIL spans from 33 to 41.
My rules to match Chain and the Identifier that follows is like
CALCULATION_OPERATIONS_CHAIN : CHAIN_T {(getCharPositionInLine()>=27) &&(getCharPositionInLine()<=32)}? ->type(CHAIN_T);
CALCULATION_FACTOR2_1: IDENT_T {(getCharPositionInLine()>=33) && (getCharPositionInLine()<=42)}? ->type(IDENT_T);
But my problem is "CHAINORDOFIL" matches in the second rule (as IDENT_T).
What can I do to match CHAIN in the first rule and ORDOFIL in the second?
Any suggestion? Thanks in advance

I was able to solve this by using Modes.
Once I match the CHAIN I switch to a mode and handle the Identifier (ORDOFIL) in there.
The rules go like this
CALCULATION_OPERATIONS_CHAIN : CHAIN_T {(getCharPositionInLine()>=27) && (getCharPositionInLine()<=32)}? ->type(CHAIN_T),pushMode(CHAIN);
mode CHAIN;
CHAIN_WHITESPACE_T : ' ' ->skip;
CHAIN_FACTOR2_1: IDENT_T {(getCharPositionInLine()>=33) && (getCharPositionInLine()<=42)}? ->type(IDENT_T);
CHAIN_ENDLINE : NEWLINE -> type(EOL),popMode,popMode ;

Related

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
Inputs:
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
UPDATE:
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
SCREENSHOTS
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.

If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.

Lua Pattern, problem for combination handle

I want to capture some strings, but how come this is not working? I noticed that using [] it only detects each individual character, I wanted to know if it is possible with more characters
I want to take these combinations, but it's wrong
A ||
Z <<
O ~~~
O..
Current Code:
C = [[
A
B|
C<
Z<<
O~~~
O.
O..
]]
C = C:gsub("(\n%a[(||)(<<)(~~~)(%.%.%.)])",function(a)
print(a)
end)
Output:
B|
C<
Z<
O~
O.
O.

Your Pattern should be something like: (\n%a[|<~%.]+).
Placing a ( inside a lua pattern set just adds ( to the list of chars that could be matched it does not make a "sub-set" or force a required match length.
Lua patterns do not match multiple chars if repeated in a single set. to match multiple chars you need to use the +, * or use multiple instance of the set like this: (\n%a[|<~%.][|<~%.][|<~%.]).
Issues with this are that multiple instances of the set must all match, while if the + is used you have variability in the length of instances you could match such as one . rather than three.
You can not enforce granularity to match 2 different lengths of characters. By this, I mean you can not match specifically O<< and O~~~ in the same pattern while not matching O<<<, O~~ or O<<~.
Resources to learn more about Lua patterns:
FHUG - Understanding Lua Patterns

How can I combine words with numbers when pattern matching in LUA?

I'm trying to match any strings that come in that follow the format Word 100.00% ~(45.56, 34.76) in LUA. As such, I'm looking to do a regex close (in theory) to this:
%D%s[%d%.%d]%%(%d.%d, %d.%d)
But I'm having no luck so far. LUA's patterns are weird.
What am I missing?

Your pattern is close you neglected to allow for multiple instances of a digit you can do this by using a + at like %d+.
You also did not use [,( and . correctly in the pattern.
[s in a pattern will create a set of chars that you are trying to match such as [abc] means you are looking to match any as bs or c at that position.
( are used to define a capture so the specific values you want returned rather then the whole string in the event of a match, in order to use it as a char you for the match you need to escape it with a %.
. will match any character rather then specifically a . you will need to add a % to escape if you want to match a . specifically.
local str = "Word 100.00% ~(45.56, 34.76)"
local pattern = "%w+%s%d+%.%d+%%%s~%(%d+%.%d+, %d+%.%d+%)"
print(string.match(str, pattern))
Here you will see the input string print if it matches the pattern otherwise you will see nil.
Suggested resource: Understanding Lua Patterns

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.

Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.

This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

Regex pattern for checking name with limited characters

I have a requirement to validate username with some special characters in it('-) and white space in it. I am able to achieve this with the help of the following regex -
^[a-zA-z]+([ '-][a-zA-Z]+)*$
But I am unable to add a limit to the maximum number of characters that user can enter say 25 to this particular regex. So can any one please explain how to do the same for the above regex? Thanks.

You may add a negative lookahead at the beginning disallowing 26 or more chars:
^(?!.{26})[a-zA-Z]+([ '-][a-zA-Z]+)*$
^^^^ ^
You also have a typo [A-z], it must be [A-Z]. See Difference between regex [A-z] and [a-zA-Z].
The negative lookahead (the (?!...) construct above) is anchored at the start of the string (meaning it is placed right after ^), and the length check is performed only once, right before parsing with the main, consuming pattern part.
You can also see more on how a negative lookahead works here.

Just add a lookahead at the beginning to match only for a max of 25 characters by adding a (?=.{1,25}$).
Your new regex will be: ^(?=.{1,25}$)[a-zA-z]+([ '-][a-zA-Z]+)*$

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Tokenize one word into multiple tokens based on position ANTLR4 - parsing

Related

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

Lua Pattern, problem for combination handle

How can I combine words with numbers when pattern matching in LUA?

Handling arbitrary text blocks in an Xtext grammar

Regex pattern for checking name with limited characters

Categories

Resources