How to convert this code into Antlr Groovy Grammar v4? - parsing

I am using Groovy grammar with antlr to parse Groovy files. I am using the following Groovy grammar:
https://github.com/groovy/groovy-core/blob/master/src/main/org/codehaus/groovy/antlr/groovy.g
I converted the entire Grammar into antlr v4 though it took me 2 complete days. But still I am wondering how to convert this into antlr v4 grammar :
tokens {
BLOCK; MODIFIERS; OBJBLOCK; SLIST; METHOD_DEF; VARIABLE_DEF;
INSTANCE_INIT; STATIC_INIT; TYPE; CLASS_DEF; INTERFACE_DEF; TRAIT_DEF;
PACKAGE_DEF; ARRAY_DECLARATOR; EXTENDS_CLAUSE; IMPLEMENTS_CLAUSE;
PARAMETERS; PARAMETER_DEF; LABELED_STAT; TYPECAST; INDEX_OP;
POST_INC; POST_DEC; METHOD_CALL; EXPR;
IMPORT; UNARY_MINUS; UNARY_PLUS; CASE_GROUP; ELIST; FOR_INIT; FOR_CONDITION;
FOR_ITERATOR; EMPTY_STAT; FINAL="final"; ABSTRACT="abstract";
UNUSED_GOTO="goto"; UNUSED_CONST="const"; UNUSED_DO="do";
STRICTFP="strictfp"; SUPER_CTOR_CALL; CTOR_CALL; CTOR_IDENT; VARIABLE_PARAMETER_DEF;
STRING_CONSTRUCTOR; STRING_CTOR_MIDDLE;
CLOSABLE_BLOCK; IMPLICIT_PARAMETERS;
SELECT_SLOT; DYNAMIC_MEMBER;
LABELED_ARG; SPREAD_ARG; SPREAD_MAP_ARG; //deprecated - SCOPE_ESCAPE;
LIST_CONSTRUCTOR; MAP_CONSTRUCTOR;
FOR_IN_ITERABLE;
STATIC_IMPORT; ENUM_DEF; ENUM_CONSTANT_DEF; FOR_EACH_CLAUSE; ANNOTATION_DEF; ANNOTATIONS;
ANNOTATION; ANNOTATION_MEMBER_VALUE_PAIR; ANNOTATION_FIELD_DEF; ANNOTATION_ARRAY_INIT;
TYPE_ARGUMENTS; TYPE_ARGUMENT; TYPE_PARAMETERS; TYPE_PARAMETER; WILDCARD_TYPE;
TYPE_UPPER_BOUNDS; TYPE_LOWER_BOUNDS; CLOSURE_LIST;MULTICATCH;MULTICATCH_TYPES;
}
Until these lexer tokens are not correct I won't get the correct output. Now the token file I got is like this:
T__0=1
T__1=2
T__2=3
T__3=4
T__4=5
T__5=6
T__6=7
T__7=8
T__8=9
T__9=10
T__10=11
T__11=12
T__12=13
T__13=14
T__14=15
T__15=16
T__16=17
T__17=18
T__18=19
T__19=20
T__20=21
T__21=22
T__22=23
T__23=24
T__24=25
T__25=26
T__26=27
T__27=28
T__28=29
T__29=30
T__30=31
T__31=32
T__32=33
T__33=34
T__34=35
T__35=36
T__36=37
T__37=38
T__38=39
T__39=40
T__40=41
T__41=42
T__42=43
T__43=44
T__44=45
T__45=46
T__46=47
T__47=48
T__48=49
T__49=50
T__50=51
T__51=52
QUESTION=53
LPAREN=54
RPAREN=55
LBRACK=56
RBRACK=57
LCURLY=58
RCURLY=59
COLON=60
COMMA=61
DOT=62
ASSIGN=63
COMPARE_TO=64
EQUAL=65
IDENTICAL=66
LNOT=67
BNOT=68
NOT_EQUAL=69
NOT_IDENTICAL=70
PLUS=71
PLUS_ASSIGN=72
INC=73
MINUS=74
MINUS_ASSIGN=75
DEC=76
STAR=77
STAR_ASSIGN=78
MOD=79
MOD_ASSIGN=80
SR=81
SR_ASSIGN=82
BSR=83
BSR_ASSIGN=84
GE=85
GT=86
SL=87
SL_ASSIGN=88
LE=89
LT=90
BXOR=91
BXOR_ASSIGN=92
BOR=93
BOR_ASSIGN=94
LOR=95
BAND=96
BAND_ASSIGN=97
LAND=98
SEMI=99
RANGE_INCLUSIVE=100
RANGE_EXCLUSIVE=101
TRIPLE_DOT=102
SPREAD_DOT=103
OPTIONAL_DOT=104
ELVIS_OPERATOR=105
MEMBER_POINTER=106
REGEX_FIND=107
REGEX_MATCH=108
STAR_STAR=109
STAR_STAR_ASSIGN=110
CLOSABLE_BLOCK_OP=111
WS=112
NLS=113
SL_COMMENT=114
SH_COMMENT=115
ML_COMMENT=116
STRING_LITERAL=117
REGEXP_LITERAL=118
DOLLAR_REGEXP_LITERAL=119
DOLLAR_REGEXP_CTOR_END=120
IDENT=121
AT=122
FINAL=123
ABSTRACT=124
UNUSED_GOTO=125
UNUSED_DO=126
STRICTFP=127
UNUSED_CONST=128
'package'=1
'import'=2
'static'=3
'def'=4
'class'=5
'interface'=6
'enum'=7
'trait'=8
'extends'=9
'super'=10
'void'=11
'boolean'=12
'byte'=13
'char'=14
'short'=15
'int'=16
'float'=17
'long'=18
'double'=19
'as'=20
'private'=21
'public'=22
'protected'=23
'transient'=24
'native'=25
'threadsafe'=26
'synchronized'=27
'volatile'=28
'default'=29
'implements'=30
'this'=31
'throws'=32
'if'=33
'else'=34
'while'=35
'switch'=36
'for'=37
'in'=38
'return'=39
'break'=40
'continue'=41
'throw'=42
'assert'=43
'case'=44
'try'=45
'finally'=46
'catch'=47
'false'=48
'instanceof'=49
'new'=50
'null'=51
'true'=52
'?'=53
'('=54
')'=55
'['=56
']'=57
'{'=58
'}'=59
':'=60
','=61
'.'=62
'='=63
'<=>'=64
'=='=65
'==='=66
'!'=67
'~'=68
'!='=69
'!=='=70
'+'=71
'+='=72
'++'=73
'-'=74
'-='=75
'--'=76
'*'=77
'*='=78
'%'=79
'%='=80
'>>'=81
'>>='=82
'>>>'=83
'>>>='=84
'>='=85
'>'=86
'<<'=87
'<<='=88
'<='=89
'<'=90
'^'=91
'^='=92
'|'=93
'|='=94
'||'=95
'&'=96
'&='=97
'&&'=98
';'=99
'..'=100
'..<'=101
'...'=102
'*.'=103
'?.'=104
'?:'=105
'.&'=106
'=~'=107
'==~'=108
'**'=109
'**='=110
'->'=111
'#'=122
'final'=123
'abstract'=124
'goto'=125
'do'=126
'strictfp'=127
'const'=128
All these tokens starting with T are wrong and corresponding value in Groovy Lexer is null.

Related

How to highlight QScintilla using ANTLR4?

I'm trying to learn ANTLR4 and I'm already having some issues with my first experiment.
The goal here is to learn how to use ANTLR to syntax highlight a QScintilla component. To practice a little bit I've decided I'd like to learn how to properly highlight *.ini files.
First things first, in order to run the mcve you'll need:
Download antlr4 and make sure it works, read the instructions on the main site
Install python antlr runtime, just do: pip install antlr4-python3-runtime
Generate the lexer/parser of ini.g4:
grammar ini;
start : section (option)*;
section : '[' STRING ']';
option : STRING '=' STRING;
COMMENT : ';' ~[\r\n]*;
STRING : [a-zA-Z0-9]+;
WS : [ \t\n\r]+;
by running antlr ini.g4 -Dlanguage=Python3 -o ini
Finally, save main.py:
import textwrap
from PyQt5.Qt import *
from PyQt5.Qsci import QsciScintilla, QsciLexerCustom
from antlr4 import *
from ini.iniLexer import iniLexer
from ini.iniParser import iniParser
class QsciIniLexer(QsciLexerCustom):
def __init__(self, parent=None):
super().__init__(parent=parent)
lst = [
{'bold': False, 'foreground': '#f92472', 'italic': False}, # 0 - deeppink
{'bold': False, 'foreground': '#e7db74', 'italic': False}, # 1 - khaki (yellowish)
{'bold': False, 'foreground': '#74705d', 'italic': False}, # 2 - dimgray
{'bold': False, 'foreground': '#f8f8f2', 'italic': False}, # 3 - whitesmoke
]
style = {
"T__0": lst[3],
"T__1": lst[3],
"T__2": lst[3],
"COMMENT": lst[2],
"STRING": lst[0],
"WS": lst[3],
}
for token in iniLexer.ruleNames:
token_style = style[token]
foreground = token_style.get("foreground", None)
background = token_style.get("background", None)
bold = token_style.get("bold", None)
italic = token_style.get("italic", None)
underline = token_style.get("underline", None)
index = getattr(iniLexer, token)
if foreground:
self.setColor(QColor(foreground), index)
if background:
self.setPaper(QColor(background), index)
def defaultPaper(self, style):
return QColor("#272822")
def language(self):
return self.lexer.grammarFileName
def styleText(self, start, end):
view = self.editor()
code = view.text()
lexer = iniLexer(InputStream(code))
stream = CommonTokenStream(lexer)
parser = iniParser(stream)
tree = parser.start()
print('parsing'.center(80, '-'))
print(tree.toStringTree(recog=parser))
lexer.reset()
self.startStyling(0)
print('lexing'.center(80, '-'))
while True:
t = lexer.nextToken()
print(lexer.ruleNames[t.type-1], repr(t.text))
if t.type != -1:
len_value = len(t.text)
self.setStyling(len_value, t.type)
else:
break
def description(self, style_nr):
return str(style_nr)
if __name__ == '__main__':
app = QApplication([])
v = QsciScintilla()
lexer = QsciIniLexer(v)
v.setLexer(lexer)
v.setText(textwrap.dedent("""\
; Comment outside
[section s1]
; Comment inside
a = 1
b = 2
[section s2]
c = 3 ; Comment right side
d = e
"""))
v.show()
app.exec_()
and run it, if everything went well you should get this outcome:
Here's my questions:
As you can see, the outcome of the demo is far away from being usable, you definitely don't want that, it's really disturbing. Instead, you'd like to get a similar behaviour than all IDEs out there. Unfortunately I don't know how to achieve that, how would you modify the snippet providing such a behaviour?
Right now I'm trying to mimick a similar highlighting than the below snapshot:
you can see on that screenshot the highlighting is different on variable assignments (variable=deeppink and values=yellowish) but I don't know how to achieve that, I've tried using this slightly modified grammar:
grammar ini;
start : section (option)*;
section : '[' STRING ']';
option : VARIABLE '=' VALUE;
COMMENT : ';' ~[\r\n]*;
VARIABLE : [a-zA-Z0-9]+;
VALUE : [a-zA-Z0-9]+;
WS : [ \t\n\r]+;
and then changing the styles to:
style = {
"T__0": lst[3],
"T__1": lst[3],
"T__2": lst[3],
"COMMENT": lst[2],
"VARIABLE": lst[0],
"VALUE": lst[1],
"WS": lst[3],
}
but if you look at the lexing output you'll see there won't be distinction between VARIABLE and VALUES because order precedence in the ANTLR grammar. So my question is, how would you modify the grammar/snippet to achieve such visual appearance?
The problem is that the lexer needs to be context sensitive: everything on the left hand side of the = needs to be a variable, and to the right of it a value. You can do this by using ANTLR's lexical modes. You start off by classifying successive non-spaces as being a variable, and when encountering a =, you move into your value-mode. When inside the value-mode, you pop out of this mode whenever you encounter a line break.
Note that lexical modes only work in a lexer grammar, not the combined grammar you now have. Also, for syntax highlighting, you probably only need the lexer.
Here's a quick demo of how this could work (stick it in a file called IniLexer.g4):
lexer grammar IniLexer;
SECTION
: '[' ~[\]]+ ']'
;
COMMENT
: ';' ~[\r\n]*
;
ASSIGN
: '=' -> pushMode(VALUE_MODE)
;
KEY
: ~[ \t\r\n]+
;
SPACES
: [ \t\r\n]+ -> skip
;
UNRECOGNIZED
: .
;
mode VALUE_MODE;
VALUE_MODE_SPACES
: [ \t]+ -> skip
;
VALUE
: ~[ \t\r\n]+
;
VALUE_MODE_COMMENT
: ';' ~[\r\n]* -> type(COMMENT)
;
VALUE_MODE_NL
: [\r\n]+ -> skip, popMode
;
If you now run the following script:
source = """
; Comment outside
[section s1]
; Comment inside
a = 1
b = 2
[section s2]
c = 3 ; Comment right side
d = e
"""
lexer = IniLexer(InputStream(source))
stream = CommonTokenStream(lexer)
stream.fill()
for token in stream.tokens[:-1]:
print("{0:<25} '{1}'".format(IniLexer.symbolicNames[token.type], token.text))
you will see the following output:
COMMENT '; Comment outside'
SECTION '[section s1]'
COMMENT '; Comment inside'
KEY 'a'
ASSIGN '='
VALUE '1'
KEY 'b'
ASSIGN '='
VALUE '2'
SECTION '[section s2]'
KEY 'c'
ASSIGN '='
VALUE '3'
COMMENT '; Comment right side'
KEY 'd'
ASSIGN '='
VALUE 'e'
And an accompanying parser grammar could look like this:
parser grammar IniParser;
options {
tokenVocab=IniLexer;
}
sections
: section* EOF
;
section
: COMMENT
| SECTION section_atom*
;
section_atom
: COMMENT
| KEY ASSIGN VALUE
;
which would parse your example input in the following parse tree:
I already implemented something like this in C++.
https://github.com/tora-tool/tora/blob/master/src/editor/tosqltext.cpp
Sub-classed QScintilla class and implemented custom Lexer based on ANTLR generated source.
You might even use ANTLR parser (I did not use it), QScitilla allows you to have more than one analyzer (having different weight), so you can periodically perform some semantic check on text. What can not be done easily in QScintilla is to associate token with some additional data.
Syntax highlighting in Sctintilla is done by dedicated highlighter classes, which are lexers. A parser is not well suited for such kind of work, because the syntax highlighting feature must work, even if the input contains errors. A parser is a tool to verify the correctness of the input - 2 totally different tasks.
So I recommend you stop thinking about using ANTLR4 for that and just take one of the existing Lex classes and create a new one for the language you want to highlight.

Libclang tokenKinds

Where can I find the token kinds in libclang?
for example I know these token kinds exists:
eof, r_paren, l_paren, r_brace, l_brace
Where can I find the rest of the token kinds?
Thank you.
In libclang, the token kinds are CXToken_Punctuation, CXToken_Keyword, CXToken_Identifier, CXToken_Literal, and CXToken_Comment. (link)
In clang, the list of token kinds can be found by preprocessing the file clang/Basic/TokenKinds.def after making the definition #define TOK(X) X. Doing this gives me the following list:
unknown,
eof,
eod,
code_completion,
comment,
identifier,
raw_identifier,
numeric_constant,
char_constant,
wide_char_constant,
utf8_char_constant,
utf16_char_constant,
utf32_char_constant,
string_literal,
wide_string_literal,
angle_string_literal,
utf8_string_literal,
utf16_string_literal,
utf32_string_literal,
l_square,
r_square,
l_paren,
r_paren,
l_brace,
r_brace,
period,
ellipsis,
amp,
ampamp,
ampequal,
star,
starequal,
plus,
plusplus,
plusequal,
minus,
arrow,
minusminus,
minusequal,
tilde,
exclaim,
exclaimequal,
slash,
slashequal,
percent,
percentequal,
less,
lessless,
lessequal,
lesslessequal,
greater,
greatergreater,
greaterequal,
greatergreaterequal,
caret,
caretequal,
pipe,
pipepipe,
pipeequal,
question,
colon,
semi,
equal,
equalequal,
comma,
hash,
hashhash,
hashat,
periodstar,
arrowstar,
coloncolon,
at,
lesslessless,
greatergreatergreater,
kw_auto,
kw_break,
kw_case,
kw_char,
kw_const,
kw_continue,
kw_default,
kw_do,
kw_double,
kw_else,
kw_enum,
kw_extern,
kw_float,
kw_for,
kw_goto,
kw_if,
kw_inline,
kw_int,
kw_long,
kw_register,
kw_restrict,
kw_return,
kw_short,
kw_signed,
kw_sizeof,
kw_static,
kw_struct,
kw_switch,
kw_typedef,
kw_union,
kw_unsigned,
kw_void,
kw_volatile,
kw_while,
kw__Alignas,
kw__Alignof,
kw__Atomic,
kw__Bool,
kw__Complex,
kw__Generic,
kw__Imaginary,
kw__Noreturn,
kw__Static_assert,
kw__Thread_local,
kw___func__,
kw___objc_yes,
kw___objc_no,
kw_asm,
kw_bool,
kw_catch,
kw_class,
kw_const_cast,
kw_delete,
kw_dynamic_cast,
kw_explicit,
kw_export,
kw_false,
kw_friend,
kw_mutable,
kw_namespace,
kw_new,
kw_operator,
kw_private,
kw_protected,
kw_public,
kw_reinterpret_cast,
kw_static_cast,
kw_template,
kw_this,
kw_throw,
kw_true,
kw_try,
kw_typename,
kw_typeid,
kw_using,
kw_virtual,
kw_wchar_t,
kw_alignas,
kw_alignof,
kw_char16_t,
kw_char32_t,
kw_constexpr,
kw_decltype,
kw_noexcept,
kw_nullptr,
kw_static_assert,
kw_thread_local,
kw_concept,
kw_requires,
kw_co_await,
kw_co_return,
kw_co_yield,
kw__Decimal32,
kw__Decimal64,
kw__Decimal128,
kw___null,
kw___alignof,
kw___attribute,
kw___builtin_choose_expr,
kw___builtin_offsetof,
kw___builtin_types_compatible_p,
kw___builtin_va_arg,
kw___extension__,
kw___imag,
kw___int128,
kw___label__,
kw___real,
kw___thread,
kw___FUNCTION__,
kw___PRETTY_FUNCTION__,
kw___auto_type,
kw_typeof,
kw___FUNCDNAME__,
kw___FUNCSIG__,
kw_L__FUNCTION__,
kw___is_interface_class,
kw___is_sealed,
kw___is_destructible,
kw___is_nothrow_destructible,
kw___is_nothrow_assignable,
kw___is_constructible,
kw___is_nothrow_constructible,
kw___has_nothrow_assign,
kw___has_nothrow_move_assign,
kw___has_nothrow_copy,
kw___has_nothrow_constructor,
kw___has_trivial_assign,
kw___has_trivial_move_assign,
kw___has_trivial_copy,
kw___has_trivial_constructor,
kw___has_trivial_move_constructor,
kw___has_trivial_destructor,
kw___has_virtual_destructor,
kw___is_abstract,
kw___is_base_of,
kw___is_class,
kw___is_convertible_to,
kw___is_empty,
kw___is_enum,
kw___is_final,
kw___is_literal,
kw___is_pod,
kw___is_polymorphic,
kw___is_trivial,
kw___is_union,
kw___is_trivially_constructible,
kw___is_trivially_copyable,
kw___is_trivially_assignable,
kw___underlying_type,
kw___is_lvalue_expr,
kw___is_rvalue_expr,
kw___is_arithmetic,
kw___is_floating_point,
kw___is_integral,
kw___is_complete_type,
kw___is_void,
kw___is_array,
kw___is_function,
kw___is_reference,
kw___is_lvalue_reference,
kw___is_rvalue_reference,
kw___is_fundamental,
kw___is_object,
kw___is_scalar,
kw___is_compound,
kw___is_pointer,
kw___is_member_object_pointer,
kw___is_member_function_pointer,
kw___is_member_pointer,
kw___is_const,
kw___is_volatile,
kw___is_standard_layout,
kw___is_signed,
kw___is_unsigned,
kw___is_same,
kw___is_convertible,
kw___array_rank,
kw___array_extent,
kw___private_extern__,
kw___module_private__,
kw___declspec,
kw___cdecl,
kw___stdcall,
kw___fastcall,
kw___thiscall,
kw___vectorcall,
kw___forceinline,
kw___unaligned,
kw___super,
kw___global,
kw___local,
kw___constant,
kw___private,
kw___generic,
kw___kernel,
kw___read_only,
kw___write_only,
kw___read_write,
kw___builtin_astype,
kw_vec_step,
kw___builtin_omp_required_simd_align,
kw_pipe,
kw___pascal,
kw___vector,
kw___pixel,
kw___bool,
kw_half,
kw___bridge,
kw___bridge_transfer,
kw___bridge_retained,
kw___bridge_retain,
kw___covariant,
kw___contravariant,
kw___kindof,
kw__Nonnull,
kw__Nullable,
kw__Null_unspecified,
kw___ptr64,
kw___ptr32,
kw___sptr,
kw___uptr,
kw___w64,
kw___uuidof,
kw___try,
kw___finally,
kw___leave,
kw___int64,
kw___if_exists,
kw___if_not_exists,
kw___single_inheritance,
kw___multiple_inheritance,
kw___virtual_inheritance,
kw___interface,
kw___builtin_convertvector,
kw___unknown_anytype,
annot_cxxscope,
annot_typename,
annot_template_id,
annot_primary_expr,
annot_decltype,
annot_pragma_unused,
annot_pragma_vis,
annot_pragma_pack,
annot_pragma_parser_crash,
annot_pragma_captured,
annot_pragma_dump,
annot_pragma_msstruct,
annot_pragma_align,
annot_pragma_weak,
annot_pragma_weakalias,
annot_pragma_redefine_extname,
annot_pragma_fp_contract,
annot_pragma_ms_pointers_to_members,
annot_pragma_ms_vtordisp,
annot_pragma_ms_pragma,
annot_pragma_opencl_extension,
annot_pragma_openmp,
annot_pragma_openmp_end,
annot_pragma_loop_hint,
annot_module_include,
annot_module_begin,
annot_module_end

"Expected token" using lemon parser generator

Is there a known way to generate an "Expected token" list when a syntax error happens ? I'm using Lemon as parser generator.
This seems to work:
%syntax_error {
int n = sizeof(yyTokenName) / sizeof(yyTokenName[0]);
for (int i = 0; i < n; ++i) {
int a = yy_find_shift_action(yypParser, (YYCODETYPE)i);
if (a < YYNSTATE + YYNRULE) {
printf("possible token: %s\n", yyTokenName[i]);
}
}
}
It tries all possible tokens and prints those that are applicable in the current parser state.
Note that when an incorrect token comes, the parser doesn't immediately call syntax_error, but it tries to reduce what's on stack hoping the token can be shifted afterwards. Only when nothing else can be reduced and the current token cannot be shifted, the parser calls syntax_error. The reductions will change parser state, which means that you may see less tokens than would have been applicable before the reductions. It should be sufficient for error reporting though.
There is no direct method to generate such list in Lemon. But you can try do this using debug output of Lemon tool and debug trace of generated parser. After call to ParseTrace function generated parser prints list of Shifts and Reduces it applies to the input stream. The last Shift before syntax error contains number of current state before the error. Find this state in *.out file for your parser and see list of expected tokens for it.
The modern versions of Lemon use something like the following:
%syntax_error {
for (int32_t i = 1, a = 0; i < YYNTOKEN; ++i) {
a = yy_find_shift_action((YYCODETYPE)i, yypParser->yytos->stateno);
if (a != YY_ERROR_ACTION) {
// 'a' is a valid token.
}
}
}

How to avoid building intermediates and useless AST nodes with ANTLR3?

I wrote an ANTLR3 grammar subdivided into smaller rules to increase readability.
For example:
messageSequenceChart:
'msc' mscHead bmsc 'endmsc' end
;
# Where mscHead is a shortcut to :
mscHead:
mscName mscParameterDecl? timeOffset? end
mscInstInterface? mscGateInterface
;
I know the built-in ANTLR AST building feature allows the user to declare intermediate AST nodes that won't be in the final AST. But what if you build the AST by hand?
messageSequenceChart returns [msc::MessageSequenceChart* n = 0]:
'msc' mscHead bmsc'endmsc' end
{
$n = new msc::MessageSequenceChart(/* mscHead subrules accessors like $mscHead.mscName.n ? */
$bmsc.n);
}
;
mscHead:
mscName mscParameterDecl? timeOffset? end
;
The documentation does not talk about such a thing. So it looks like I will have to create nodes for every intermediate rules to be able to access their subrules result.
Does anyone know a better solution ?
Thank you.
You can solve this by letting your sub-rule(s) return multiple values and accessing only those you're interested in.
The following demo shows how to do it. Although it is not in C, I am confident that you'll be able to adjust it so that it fits your needs:
grammar Test;
parse
: sub EOF {System.out.printf("second=\%s\n", $sub.second);}
;
sub returns [String first, String second, String third]
: a=INT b=INT c=INT
{
$first = $a.text;
$second = $b.text;
$third = $c.text;
}
;
INT
: '0'..'9'+
;
SPACE
: ' ' {$channel=HIDDEN;}
;
And if your parse the input "12 34 56" with the generated parser, second=34 is printed to the console, as you can see after running:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
TestLexer lex = new TestLexer(new ANTLRStringStream("12 34 56"));
TokenStream tokens = new TokenRewriteStream(lex);
TestParser parser = new TestParser(tokens);
parser.parse();
}
}
So, a shortcut from the parse rule like $sub.INT, or $sub.$a to access one of the three INT tokens, in not possible, unfortunately.

Can anyone help me convert this ANTLR 2.0 grammar file to ANTLR 3.0 syntax?

I've converted the 'easy' parts (fragment, #header and #member
declerations etc.), but since I'm new to Antlr I have a really hard
time converting the Tree statements etc.
I use the following migration guide.
The grammar file can be found here....
Below you can find some examples where I run into problems:
For instance, I have problems with:
n3Directive0!:
d:AT_PREFIX ns:nsprefix u:uriref
{directive(#d, #ns, #u);}
;
or
propertyList![AST subj]
: NAME_OP! anonnode[subj] propertyList[subj]
| propValue[subj] (SEMI propertyList[subj])?
| // void : allows for [ :a :b ] and empty list "; .".
;
propValue [AST subj]
: v1:verb objectList[subj, #v1]
// Reverse the subject and object
| v2:verbReverse subjectList[subj, #v2]
;
subjectList![AST oldSub, AST prop]
: obj:item { emitQuad(#obj, prop, oldSub) ; }
(COMMA subjectList[oldSub, prop])? ;
objectList! [AST subj, AST prop]
: obj:item { emitQuad(subj,prop,#obj) ; }
(COMMA objectList[subj, prop])?
| // Allows for empty list ", ."
;
n3Directive0!:
d=AT_PREFIX ns=nsprefix u=uriref
{directive($d, $ns, $u);}
;
You have to use '=' for assignments.
Tokens can then be used as '$tokenname.getText()', ...
Rule results can then be used in your code as 'rulename.result'
If you have rules having declared result names, you have to use these names iso.
'result'.

Resources