Why won't my JavaCC lexer/parser accept this input? - parsing

I am creating a lexer/parser which should accept strings that belong to an infinite set of languages.
One such string is "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>".
The set of languages is defined as follows:
Base language, L0
A string from L0 consists of several blocks separated by space characters.
At least one block must be present.
A block is an odd-length sequence of lowercase letters (a-z).
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Example of string belonging to L0:
zyx abcba m xyzvv
There is one space character between zyx and abcba, there are three spaces
between abcba and m, and only one between m and xyzvv. No other space characters are present in the string.
Language L1
A string from L1 consists of several blocks separated by space characters.
At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an even-length sequence of uppercase letters (A-Z). A block of the
second kind must have the shape <2U>. . .</2U>, where . . . stands
for any string from L0.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Example of string belonging to L1:
YZ <2U>abc zzz</2U> ABBA <2U>kkkkk</2U> KM
Note that five spaces separate YZ and <2U>abc zzz</2U>, and three spaces divide abc from zzz. Otherwise single spaces are used as separators. There is no space in front of YZ and no space follows KM.
Language L2
A string from L2 consists of several blocks separated by space characters.
At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an odd-length sequence of lowercase letters (a-z). A block of the
second kind must have the shape <2L>. . .</2L>, where . . . stands
for any string from L1.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Example of string belonging to L2:
abc <2L>AA ZZ <2U>a bcd</2U></2L> z <2L><2U>abcde</2U></2L>
Single spaces are used as separators inside the sentence given above, but any other odd number of spaces would also lead to a valid L2 sentence.
Languages L{2k + 1}, k > 0
A string from L{2k + 1} consists of several blocks separated by space characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an even-length sequence of uppercase letters (A-Z). A block of the
second kind must have the shape <2U>. . .</2U>, where . . . stands
for any string from L{2k}.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Languages L{2k + 2}, k > 0
A string from L{2k + 2} consists of several blocks separated by space
characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an odd-length sequence of lowercase letters (a-z). A block of the
second kind must have the shape <2L>. . .</2L>, where . . . stands
for any string from L{2k + 1}.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
The code for my lexer/parser is as follows:
PARSER_BEGIN(Assignment)
/** A parser which determines if user's input belongs to any one of the set of acceptable languages. */
public class Assignment {
public static void main(String[] args) {
try {
Assignment parser = new Assignment(System.in);
parser.Start();
System.out.println("YES"); // If the user's input belongs to any of the set of acceptable languages, then print YES.
} catch (ParseException e) {
System.out.println("NO"); // If the user's input does not belong to any of the set of acceptable languages, then print NO.
}
}
}
PARSER_END(Assignment)
//** A token which matches any lowercase letter from the English alphabet. */
TOKEN :
{
< #L_CASE_LETTER: ["a"-"z"] >
}
//* A token which matches any uppercase letter from the English alphabet. */
TOKEN:
{
< #U_CASE_LETTER: ["A"-"Z"] >
}
//** A token which matches an odd number of lowercase letters from the English alphabet. */
TOKEN:
{
< ODD_L_CASE_LETTER: <L_CASE_LETTER>(<L_CASE_LETTER><L_CASE_LETTER>)* >
}
//** A token which matches an even number of uppercase letters from the English alphabet. */
TOKEN:
{
< EVEN_U_CASE_LETTERS: (<U_CASE_LETTER><U_CASE_LETTER>)+ >
}
//* A token which matches the string "<2U>" . */
TOKEN:
{
< OPEN_UPPER: "<2U>" >
}
//* A token which matches the string "</2U>". */
TOKEN:
{
< CLOSE_UPPER: "</2U>" >
}
//* A token which matches the string "<2L>". */
TOKEN:
{
< OPEN_LOWER: "<2L>" >
}
//* A token which matches the string "</2L>". */
TOKEN:
{
< CLOSE_LOWER: "</2L>" >
}
//* A token which matches an odd number of white spaces. */
TOKEN :
{
< ODD_WHITE_SPACE: " "(" "" ")* >
}
//* A token which matches an EOL character. */
TOKEN:
{
< EOL: "\n" | "\r" | "\r\n" >
}
/** This production matches strings which belong to the base language L^0. */
void Start() :
{}
{
LOOKAHEAD(3)
<ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)* <EOL> <EOF>
|
NextLanguage()
|
LOOKAHEAD(3)
NextLanguageTwo()
|
EvenLanguage()
}
/** This production matches strings which belong to language L^1. */
void NextLanguage():
{}
{
(<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF>
|
(<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF>
}
/** This production matches either an even number of uppercase letters, or a string from L^0, encased within the tags <2U> and </2U>. */
void UpperOrPseudoStart() :
{}
{
<EVEN_U_CASE_LETTERS>
|
<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>
}
/** This production matches strings from L^0, in a similar way to Start(); however, the strings that it matches do not have EOL or EOF characters after them. */
void PseudoStart() :
{}
{
<ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)*
}
/** This production matches strings which belong to language L^2. */
void NextLanguageTwo() :
{}
{
(<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF>
|
(<OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF>
}
/** This production matches either an odd number of lowercase letters, or a string from L^1, encased within the tags <2L> and </2L>. */
void LowerOrPseudoNextLanguage() :
{}
{
<ODD_L_CASE_LETTER>
|
<OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER>
}
/** This production matches strings from L^1, in a similar way to NextLanguage(); however, the strings that it matches do not have EOL or EOF characters after them. */
void PseudoNextLanguage() :
{}
{
(<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())*
|
(<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())*
}
/** This production matches strings which belong to any of the languages L^{2k + 2}, where k > 0 (the infinite set of even languages). */
void EvenLanguage() :
{}
{
(<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF>
|
(CommonPattern())+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF>
}
/** This production is an auxiliary production that helps when parsing strings from any of the even set of languages. */
void EvenLanguageAuxiliary() :
{}
{
CommonPattern()
|
<ODD_L_CASE_LETTER>
}
void CommonPattern() :
{}
{
<OPEN_LOWER> <EVEN_U_CASE_LETTERS> <ODD_WHITE_SPACE> <OPEN_UPPER> <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> CommonPattern())+ <CLOSE_UPPER> <CLOSE_LOWER>
}
Several times now, I have inputted the string "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>".
However, each time, NO is printed out on the terminal.
I have looked through my code carefully several times, checking the order in which I think the input string should be parsed; but, I haven't been able to find any errors in my logic or reasons why the string isn't being accepted.
Could I have some suggestions as to why it isn't being accepted, please?

The following steps helped to solve the problem.
Run the following code:
javacc -debug_parser Assignment.jj
javac Assignment*.java
Then, run the lexer/parser (by typing java Assignment) and then input the string:
"a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>"
The resulting trace of parser actions shows that the production NextLangaugeTwo() is called on this string, rather than the desired EvenLanguage() production.
Tracing through NextLangaugeTwo() shows that it matches the first eight tokens in the input string.
So, using a lookahead of 9, although inefficient, causes the input string to be accepted. That is, modify the Start() production by changing the second lookahead value (just above the call to NextLanguageTwo()) from 3 to 9.

Are any of your inputs being accepted? I have copied your code over to my computer and have found that any correct input (as far as I can tell from the definition of your language), it always outputs 'NO'.

Related

How to achieve capturing groups in flex lex?

I wanted to match for a string which starts with a '#', then matches everything until it matches the character that follows '#'. This can be achieved using capturing groups like this: #(.)[^(?1)]*(?1)(EDIT this regex is also erroneous). This matches #$foo$, does not match #%bar&, matches first 6 characters of #"foo"bar.
But since flex lex does not support capturing groups, what is the workaround here?
As you say, (f)lex does not support capturing groups, and it certainly doesn't support backreferences.
So there is no simple workaround, but there are workarounds. Here are a few possibilities:
You can read the input one character at a time using the input() function, until you find the matching character (but you have to create your own buffer to store the characters, because characters read by input() are not added to the current token). This is not the most efficient because reading one character at a time is a bit clunky, but it's the only interface that (f)lex offers. (The following snippet assumes you have some kind of expandable stringBuilder; if you are using C++, this would just be replaced with a std::string.)
#. { StringBuilder sb = string_builder_new();
int delim = yytext[1];
for (;;) {
int next = input();
if (next == delim) break;
if (next == EOF ) { /* Signal error */; break; }
string_builder_addchar(next);
}
yylval = string_builder_release();
return DELIMITED_STRING;
}
Even less efficiently, but perhaps more conveniently, you can get (f)lex to accumulate the characters in yytext using yymore(), matching one character at a time in a start condition:
%x DELIMITED
%%
int delim;
#. { delim = yytext[1]; BEGIN(DELIMITED); }
<DELIMITED>.|\n { if (yytext[0] == delim) {
yylval = strdup(yytext);
BEGIN(INITIAL);
return DELIMITED_STRING;
}
yymore();
}
<DELIMITED><<EOF>> { /* Signal unterminated string error */ }
The most efficient solution (in (f)lex) is to just write one rule for each possible delimiter. While that's a lot of rules, they could be easily generated with a small script in whatever scripting language you prefer. And, actually, there are not that many rules, particularly if you don't allow alphabetic and non-printing characters to be delimiters. This has the additional advantage that if you want Perl-like parenthetic delimiters (#(Hello) instead of #(Hello(), you can just modify the individual pattern to suit (as I've done below). [Note 1] Since all the actions are the same; it might be easier to use a macro for the action, making it easier to modify.
/* Ordinary punctuation */
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#![^!]*! { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\.[^.]*\. { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Matched pairs */
#<[^>]*> { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\[[^]]*] { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Trap errors */
# { /* Report unmatched or invalid delimiter error */ }
If I were writing a script to generate these rules, I would use hexadecimal escapes for all the delimiter characters rather than trying to figure out which ones needed escapes.
Notes:
Perl requires nested balanced parentheses in constructs like that. But you can't do that with regular expressions; if you wanted to reproduce Perl behaviour, you'd need to use some variation on one of the other suggestions. I'll try to revisit this answer later to address that feature.

(f)lex the difference between PRINTA$ and PRINT A$

I am parsing BASIC:
530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I
The patterns that are used in this case are:
FOR { return TOK_FOR; }
TO { return TOK_TO; }
NEXT { return TOK_NEXT; }
(many lines later...)
[A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? {
yylval.s = g_string_new(yytext);
return IDENTIFIER;
}
(many lines later...)
[ \t\r\l] { /* eat non-string whitespace */ }
The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is:
530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI
Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER.
The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit.
Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique.
Here's a simple pattern for recognising keywords, using trailing context:
tail [[:alnum:]]*[$%!#]?
%%
FOR/{tail} { return TOK_FOR; }
TO/{tail} { return TOK_TO; }
NEXT/{tail} { return TOK_NEXT; }
/* etc. */
[[:alpha:]]{tail} { /* Handle an ID */ }
Effectively, that just extends the keyword match without extending the matched token.
But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?

Reducing insane flex lexer expansion?

I have written a flex lexer to handle the text in BYOND's .dmi file format. The contents inside are (key, value) pairs delimited by '='. Valid keys are all essentially keywords (such as "width"), and invalid keys are not errors: they are just ignored.
Interestingly, the current state of BYOND's .dmi parser uses everything prior to the '=' as its keyword, and simply ignores any excess junk. This means "\twidth123" is recognized as "width".
The crux of my problem is in allowing for this irregularity. In doing so my generated lexer expands from ~40-50KB to ~13-14MB. For reference, I present the following contrived example:
%option c++ noyywrap
fill [^=#\n]*
%%
{fill}version{fill} { return 0; }
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
{fill}state{fill} { return 0; }
{fill}dirs{fill} { return 0; }
{fill}frames{fill} { return 0; }
{fill}delay{fill} { return 0; }
{fill}loop{fill} { return 0; }
{fill}rewind{fill} { return 0; }
{fill}movement{fill} { return 0; }
{fill}hotspot{fill} { return 0; }
%%
fill is the rule that is used to merge the keywords with "anything before the =". Running flex on the above yields a ~13MB lex.yy.cc on my computer. Simply removing the kleene star (*) in the fill rule yields a 45KB lex.yy.cc file; however, obviously, this then makes the lexer incorrect.
Are there any tricks, flex options, or lexer hacks to avoid this insane expansion? The only things I can think of are:
Disallow "width123" to represent "width", which is undesirable as then technically-correct files could not be parsed.
Make one rule that is simply [^=\n]+ to return some identifier token, and pick out the keyword in the parser. This seems suboptimal to me as well, particularly because different keywords have different value types and it seems most natural to be able to handle "'width' '=' INT" and "'version' '=' FLOAT" in the parser instead of "ID '=' VALUE" followed by picking out the keyword in the identifier, making sure the value is of the right type, etc.
I could make the rule {fill}(width|height|version|...){fill}, which does indeed keep the generated file small. However, while regular expression parsers tend to produce "captures," flex just gives me yytext and re-parsing that for a keyword to produce the desired token seems to be very undesirable in terms of algorithmic complexity.
Make fill a separate rule of its own that does nothing, and remove it from all the other rules, and separate its definition from whitespace for clarity:
whitespace [ \t\f]
fill [^#=\n]
%%
{whitespace}+ ;
{fill}+ ;
I would probably also avoid building the keywords into the lexer and just use an identifier [a-zA-Z]+ rule that does a table lookup. And finally add a rule to catch the =:
. return yytext[0];
to let the parser handle all special characters.
This is not really a problem flex is "good at", but it can be solved if it is precisely defined. In particular, it is important to know which of the keywords should be returned if the random string of letters before the = contains more than one keyword. For example, suppose the input is:
garbage_widtheight_moregarbage = 42
Now, is that setting the width or the height?
Remember that flex scanners will choose the rule with longest match, and of rules with equally long matches, the first one in the lexical description.
So the model presented in the OP:
fill [^=#\n]*
%%
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
/* SNIP */
will always prefer width to height, because the matches will be the same length (both terminate at the last character before the =), and the width pattern comes first in the file. If the rules were written in the opposite order, height would be preferred.
On the other hand, if you removed the second {fill}:
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
then the last keyword in the input (in this case, height) will be preferred, because that one has the longer match.
The most likely requirement, however, is that the first keyword be recognized, so neither of the preceding will work. In order to match the first keyword, it is necessary to first match the shortest possible sequence of {fill}. And since flex does not implement non-greedy repetition, that can only be done with a character-by-character span.
Here's an example, using start conditions. Note that we hold onto the keyword token until we actually find the =, in case the = is not found.
/* INITIAL: beginning of a line
* FIND_EQUAL: keyword recognized, looking for the =
* VALUE: = recognized, lexing the right-hand side
* NEXT_LINE: find the next line and continue the scan
*/
%x FIND_EQUAL VALUE
%%
int keyword;
"[#=]".* /* Skip comments and lines with no recognizable keyword */
version { keyword = KW_VERSION; BEGIN(FIND_EQUAL); }
width { keyword = KW_WIDTH; BEGIN(FIND_EQUAL); }
height { keyword = KW_HEIGHT; BEGIN(FIND_EQUAL); }
/* etc. */
.|\n /* Skip any other single character, or newline */
<FIND_EQUAL>{
[^=#\n]*"=" { BEGIN(VALUE); return keyword; }
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
}
<VALUE>{
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
[[:blank:]]+ ; /* Ignore space and tab characters */
[[:digit:]]+ { yylval.ival = atoi(yytext);
BEGIN(NEXT_LINE); return INTEGER;
}
[[:digit:]]+"."[[:digit:]]*|"."[[:digit:]]+ {
yylval.fval = atod(yytext);
BEGIN(NEXT_LINE); return FLOAT;
}
\"([^"]|\\.)*\" { char* s = malloc(yyleng - 1);
yylval.sval = s;
/* Remove quotes and escape characters */
yytext[yyleng - 1] = '\0';
do {
if (*++yytext == '\\') ++yytext;
*s++ = *yytext;
} while (*yytext);
BEGIN(NEXT_LINE); return STRING;
}
/* Other possible value token types */
. BEGIN(NEXT_LINE); /* bad character in value */
}
<NEXT_LINE>.*\n? BEGIN(INITIAL);
In the escape-removal code, you might want to translate things like \n. And you might also want to avoid string values with physical newlines. And a bunch of etceteras. It's only intended as a model.

How to get such pattern matching of regular expression in lex

Hi I want to check a specific pattern in regular expression but I'm failed to do that. Input should be like
noun wordname:wordmeaning
I'm successful getting noun and wordname but couldn't design a pattern for word meaning. My code is :
int state;
char *meaning;
char *wordd;
^verb { state=VERB; }
^adj { state = ADJ; }
^adv { state = ADV; }
^noun { state = NOUN; }
^prep { state = PREP; }
^pron { state = PRON; }
^conj { state = CONJ; }
//my try but failed
[:\a-z] {
meaning=yytext;
printf(" Meaning is getting detected %s", meaning);
}
[a-zA-Z]+ {
word=yytext;
}
Example input:
noun john:This is a name
Now word should be equal to john and meaning should be equal to This is a name.
Agreeing that lex states (also known as start conditions) are the way to go (odd, but there are no useful tutorials).
Briefly:
your application can be organized as states, using one for "noun", one for "john" and one for the definition (after the colon).
at the top of the lex file, declare the states, e.g.,
%s TYPE NAME VALUE
the capitals are not necessary, but since you are defining constants, that is a good convention.
next to the patterns, put those state names in < > brackets to tell lex that the patterns are used only in those states. You can list more than one state, comma-separated, when it matters. But your lex file probably does not need that.
one state is predefined: INITIAL.
your program switches states using the BEGIN() macro, in actions, e.g.,
{ BEGIN(TYPE); }
if your input is well-formed, it's simple: as each "type" is recognized, it begins the NAME state.
in the NAME state, your lexer looks for whatever you think a name should be, e.g.,
<NAME>[[:alpha:]][[:alnum:]]+ { my_name = strdup(yytext); }
the name ends with a colon, so
<NAME>":" { BEGIN(VALUE); }
the value is then everything until the end of the line, e.g.,
<VALUE>.* { my_value = strdup(yytext); BEGIN(INITIAL); }
whether you switch to INITIAL or TYPE depends on what other things you might add to your lexer (such as ignoring comment lines and whitespace).
Further reading:
Start conditions (flex documentation)
Introduction to Flex

ANTLR best way to include meta-data in lexing/parsing (custom objects, kind of annotation)

I plan to include text metadata (like bold, font-size, etc.) in the process of parsing to achieve better recognition.
For instance, I have a given structure, where a word on its own line word/r/n which is bold and sized 24px, is the title for some article. In order to get better recognition results, I want to take the characters as well as the metadata in account. In terms of ANTRL I'm not sure how this could be done best. I'd like to do something like:
Wrap each character of the original text into a custom object with fields for the metadata and pass that to ANTLR.
Preprocess the text and insert at specific places annotations for the metadata which is considered by the grammer.
I really like to take option 1. but I'm not sure which part from ANTLR I need to subclass etc. Do I have to start at the ANTLRInputStream-Object, in order to get a proper stream for a subclassed Lexer to get custom Tokens for a subclassed Parser etc. Is there a more elegant way, especially in querying the tokens while parsing with actions in a {} block ?
If anyone has some hints and/or experiences this would be great!
EDIT:
Here is a more specific simple example: I have a file wich includes the encoding of metadata which I parse forehand. the actual text including newline look like the following:
entryOne
Here is some content one.
entryTwo
Here is some content two.
Where the titlesentryOneand entryTwo are originally font-size of 24px and the content is font-size of 12px (as exemplary given values). Char by char I create a new instance of a custom object encapsulating the character as String and the font-size.
I initialize respective objects for each of the characters with fields of the font-size, e.g for the first letter of entryOne like
MyChar aTitelChar = new MyChar("e", 24);
For the content, like the second line Here is some content one. I create instances of MyChar like:
MyChar aContentChar= new MyChar("H", 12);
All characters of the texts are wrapped in instances of the below MyChar-Class and added to a List<MyChar> in order to produce a new input for ANTLR.
below is the Java Class for the characters:
public class MyChar {
private int fontSizePx;
private String text;
public MyChar(String text, int fontSizePx) {
this.text = text;
this.fontSizePx = fontSizePx;
}
public int getFontSizePx() {
return fontSizePx;
}
public String getText() {
return text;
}
}
I want that my grammar matches the above two entries (or more formatted this way) which in turn consist each of a title and a content which is terminated with a fullstop. This grammar could look like this:
rule: entry+ NEWLINE
;
entry:
title
content
;
title:
letters NEWLINE
;
content:
(letters)+ '.' NEWLINE
;
letters:
LETTERS
;
LETTERS:
('a'..'z' | 'A'..'Z')+
;
WS:
(' ' | '\t' | 'f' ) + {$channel = HIDDEN;};
NEWLINE:'\r'? '\n';
Now, for instance, what I want to do is to find out if it's really a title of an entry by checking the font-size of all letters encompassing the title-token before titel-rule returns. In case the input conforms to the grammar but is actually some kind of mistake (the original metadata-encoded file starts with something that conforms to the title-rule but its actually the content) the author of the grammar could sort that out if he knows that the original font-size for titles is 24 and check this. If one of the letter-tokens doesn't equal to font-size 24 throw an exception/don't return/do smthg. appropriate.
The thing I'm pondering on is where to plug in the List<MyChar> to provide this functionality (to query kinds of metadata while parsing in context of ANTLR). I'm experimenting with ANTLR's Classes but as I'm new to ANTLR I thought probably some of the experienced users can point me in the right direction, like where would be a good insertion points for custom objects? should I start by implenting CharStream and override some methods? Probably there is something which ANTLR provides which I haven't found yet?
Here's one way to accomplish what I think you're going for, using the parser to manage matching input to metadata. Note that I made whitespace significant because it's part of the content and can't be skipped. I also made periods part of content to simplify the example, rather than using them as a marker.
SysEx.g
grammar SysEx;
#header {
import java.util.List;
}
#parser::members {
private List<MyChar> metadata;
private int curpos;
private boolean isTitleInput(String input) {
return isFontSizeInput(input, 24);
}
private boolean isContentInput(String input){
return isFontSizeInput(input, 12);
}
private boolean isFontSizeInput(String input, int fontSize){
List<MyChar> sublist = metadata.subList(curpos, curpos + input.length());
System.out.println(String.format("Testing metadata for input=\%s, font-size=\%d", input, fontSize));
int start = curpos;
//move our metadata pointer forward.
skipInput(input);
for (int i = 0, count = input.length(); i < count; ++i){
MyChar chardata = sublist.get(i);
char c = input.charAt(i);
if (chardata.getText().charAt(0) != c){
//This character doesn't match the metadata (ERROR!)
System.out.println(String.format("Content mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
return false;
} else if (chardata.getFontSizePx() != fontSize){
//The font is wrong.
System.out.println(String.format("Format mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
return false;
}
}
//All characters check out.
return true;
}
private void skipInput(String str){
curpos += str.length();
System.out.println("\t\tMoving metadata pointer ahead by " + str.length() + " to " + curpos);
}
}
rule[List<MyChar> metadata]
#init {
this.metadata = metadata;
}
: entry+ EOF
;
entry
: title content
{System.out.println("Finished reading entry.");}
;
title
: line {isTitleInput($line.text)}? newline {System.out.println("Finished reading title " + $line.text);}
;
content
: line {isContentInput($line.text)}? newline {System.out.println("Finished reading content " + $line.text);}
;
newline
: (NEWLINE{skipInput($NEWLINE.text);})+
;
line returns [String text]
#init {
StringBuilder builder = new StringBuilder();
}
#after {
$text = builder.toString();
}
: (ANY{builder.append($ANY.text);})+
;
NEWLINE:'\r'? '\n';
ANY: .; //whitespace can't be skipped because it's content.
A title is a line that matches the title metadata (size 24 font) followed by one or more newline characters.
A content is a line that matches the content metadata (size 12 font) followed by one or more newline characters. As mentioned above, I removed the check for a period for simplification.
A line is a sequence of characters that does not include newline characters.
A validating semantic predicate (the {...}? after line) is used to validate that the line matches the metadata.
Here is the code I used to test the grammar (minus imports, for brevity):
SysExGrammar.java
public class SysExGrammar {
public static void main(String[] args) throws Exception {
//Create some metadata that matches our input.
List<MyChar> matchingMetadata = new ArrayList<MyChar>();
appendMetadata(matchingMetadata, "entryOne\r\n", 24);
appendMetadata(matchingMetadata, "Here is some content one.\r\n", 12);
appendMetadata(matchingMetadata, "entryTwo\r\n", 24);
appendMetadata(matchingMetadata, "Here is some content two.\r\n", 12);
parseInput(matchingMetadata);
System.out.println("Finished example #1");
//Create some metadata that doesn't match our input (negative test).
List<MyChar> mismatchingMetadata = new ArrayList<MyChar>();
appendMetadata(mismatchingMetadata, "entryOne\r\n", 24);
appendMetadata(mismatchingMetadata, "Here is some content one.\r\n", 12);
appendMetadata(mismatchingMetadata, "entryTwo\r\n", 12); //content font size!
appendMetadata(mismatchingMetadata, "Here is some content two.\r\n", 12);
parseInput(mismatchingMetadata);
System.out.println("Finished example #2");
}
private static void parseInput(List<MyChar> metadata) throws Exception {
//Test setup
InputStream resource = SysExGrammar.class.getResourceAsStream("SysExTest.txt");
CharStream input = new ANTLRInputStream(resource);
resource.close();
SysExLexer lexer = new SysExLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SysExParser parser = new SysExParser(tokens);
parser.rule(metadata);
System.out.println("Parsing encountered " + parser.getNumberOfSyntaxErrors() + " syntax errors");
}
private static void appendMetadata(List<MyChar> metadata, String string,
int fontSize) {
for (int i = 0, count = string.length(); i < count; ++i){
metadata.add(new MyChar(string.charAt(i) + "", fontSize));
}
}
}
SysExTest.txt (note this uses Windows newlines (\r\n)
entryOne
Here is some content one.
entryTwo
Here is some content two.
Test output (trimmed; the second example has deliberately-mismatched metadata):
Parsing encountered 0 syntax errors
Finished example #1
Parsing encountered 2 syntax errors
Finished example #2
This solution requires that each MyChar corresponds to a character in the input (including newline characters, although you can remove that limitation if you like -- I would remove it if I didn't already have this answer written up ;) ).
As you can see, it's possible to tie the metadata to the parser and everything works as expected. I hope this helps.

Resources