JavaCC Token is not matched - token

I'm trying to write a parser for simple language and I got to the point where I don't know how to handle this problem. Here is my.jj file
options
{
STATIC = false;
LOOKAHEAD=2;
//DEBUG_LOOKAHEAD = true;
DEBUG_TOKEN_MANAGER=true;
FORCE_LA_CHECK = true;
DEBUG_PARSER = true;
JDK_VERSION = "1.7";
}
PARSER_BEGIN(Parser)
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
public class Parser{
private static BufferedWriter bufferFileWriter;
private static FileWriter fWriter;
public static void main(String args []) throws ParseException, IOException
{
Parser parser = new Parser(System.in);
fWriter = new FileWriter("result", true);
bufferFileWriter = new BufferedWriter(fWriter);
parser.program();
// TO DO
}
}
PARSER_END(Parser)
SKIP :
{
" "
| "\r"
| "\t"
| "\n"
}
TOKEN : /* OPERATORS */
{
< PLUS : "+" >
| < MINUS : "-" >
| < MULTIPLY : "*" >
| < DIVIDE : "/" >
| < MODULO : "%" >
| < ASSIG : ":=" >
| < EQUAL : "==" >
| < DIFF : "!=" >
| < SMALLER : "<" >
| < GRATER : ">" >
| < S_OR_EQU: "<=" >
| < G_OR_EQU: "=>" >
}
TOKEN : /*KEY WORDS FROM LANGUAGE */
{
< VAR: "VAR">
| < BEGIN : "BEGIN" >
| < END : "END" >
| < IF : "IF" >
| < ELSE : "ELSE" >
| < THEN : "THEN" >
| < WHILE: "WHILE" >
| < DO : "DO" >
| < READ : "READ" >
| < WRITE : "WRITE" >
| < SEMICOL : ";" >
}
TOKEN :
{
< VALUE : < ID > | < NUMBER > >
| < NUMBER : (< DIGIT >)+ >
| < #DIGIT : [ "0"-"9" ] >
| < ID : (["a"-"z"])+ >
}
void program():
{}
{
varDeclarations()< BEGIN > commands() < END >
}
void varDeclarations():
{}
{
< VAR >
{
System.out.println("past VAR token");
}
(< ID >
)+
}
void commands():
{}
{
(LOOKAHEAD(3)
command())+
}
void command():
{
Token t;
}
{
assign()
|< IF >condition()< THEN >commands()< ELSE >commands()< END >
|< WHILE >condition()< DO >commands()< END >
|< READ >
t=< ID >
{
try
{
fWriter.append("LOAD "+t.image);
System.out.println("LOAD "+t.image);
}
catch(IOException e)
{
};
}
< SEMICOL >
|< WRITE >
t = < VALUE >< SEMICOL >
}
void assign():
{
Token t;
}
{
t=< ID >
{
}
< ASSIG >expression(t)< SEMICOL >
}
void condition():
{}
{
< VALUE > condOperator() < VALUE >
}
void condOperator():
{}
{
< EQUAL > | < DIFF > | < SMALLER > | < S_OR_EQU > | < GRATER > | < G_OR_EQU >
}
Token operator():
{
Token tok;
}
{
tok=< PLUS >
{
System.out.println(tok.image);
return tok;
}
|tok=< MINUS >
{
System.out.println(tok.image);
return tok;
}
|tok=< MULTIPLY >
{
System.out.println(tok.image);
return tok;
}
|tok=< DIVIDE >
{
System.out.println(tok.image);
return tok;
}
|tok=< MODULO >
{
System.out.println(tok.image);
return tok;
}
}
void expression(Token writeTo):
{
Symbol s;
Token t1, t2, t3;
}
{
t1 = < VALUE >
t2 = operator()
t3 = < VALUE >
< SEMICOL >
{
if(t2.image.equals("+"))
{
try
{
fWriter.append("ADD "+t1.image+" "+t2.image);
System.out.println("ADD "+t1.image+" "+t2.image);
}catch(IOException e)
{
}
}
}
}
Writing to file is not important at this moment.
And this is the text i want to parse:
VAR
a b
BEGIN
READ a ;
READ b ;
WHILE a != b DO
IF a < b THEN (* a <-> b *)
a := a + b ;
b := a - b ;
a := a - b ;
ELSE
END
a := a - b ;
END
WRITE a ;
END
and this is output I get from the debugger:
mother-ship $ java Parser test
Call: program
Call: varDeclarations
As you can see parser enters varDeclaration method but why can't he match token with word VAR?
I'll be grateful for any help.
#Theodore I did as you suggest but it didn't work. Maybe I'm compiling and executing the program the wrong way?
This is copy of my console:
$javacc Parser.jj
Java Compiler Compiler Version 5.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file Parser.jj . . .
File "TokenMgrError.java" is being rebuilt.
File "ParseException.java" is being rebuilt.
File "Token.java" is being rebuilt.
File "SimpleCharStream.java" is being rebuilt.
Parser generated successfully.
$ javac *.java
$ java Parser VAR a
Call: program
Call: varDeclarations

I had no problem getting your parser to recognize the "VAR" keyword. The problem is that the "a" is tokenized as a 'VALUE' token, while the parser expects an 'ID` token after the "VAR" keyword. (See input and output below.)
The rule for VALUE' has precedence over the rule forID` by virtue of being first. (See question 3.3 in the FAQ.)
What you probably should do is replace the rule you have now for VALUE with the following rule.
void Value() : {} { <ID> | <NUMBER> }
Input:
VAR
a
Output:
Call: program
Call: varDeclarations
Current character : V (86) at line 1 column 1
Possible string literal matches : { "VAR" }
Current character : A (65) at line 1 column 2
Possible string literal matches : { "VAR" }
Current character : R (82) at line 1 column 3
No more string literal token matches are possible.
Currently matched the first 3 characters as a "VAR" token.
****** FOUND A "VAR" MATCH (VAR) ******
Consumed token: <"VAR" at line 1 column 1>
past VAR token
Skipping character : \n (10)
Current character : a (97) at line 2 column 1
No string literal matches possible.
Starting NFA to match one of : { <VALUE> }
Current character : a (97) at line 2 column 1
Currently matched the first 1 characters as a <VALUE> token.
Possible kinds of longer matches : { <VALUE>, <ID> }
Current character : \n (10) at line 2 column 2
Currently matched the first 1 characters as a <VALUE> token.
Putting back 1 characters into the input stream.
****** FOUND A <VALUE> MATCH (a) ******
Return: varDeclarations
Return: program
Exception in thread "main" tokenNotMatched.ParseException: Encountered " <VALUE> "a "" at line 2, column 1.
Was expecting:
<ID> ...

Related

F# "This value is not a function and cannot be applied" when trying to add (+)

The m.Count + m.StepSize in the F# code, 4th line from the bottom, returns the error This value is not a function and cannot be applied.
I can't see why + isn't being interpreted as an infix function instead of m.Count.
Why is this line a problem?
type Model =
{ Count: int
StepSize: int }
type Msg =
| Increment
| Decrement
| SetStepSize of int
| Reset
let init =
{ Count = 0
StepSize = 1 }
let canReset = (<>) init
let update msg m =
match msg with
| Increment -> { m with Count = m.Count + m.StepSize }
| Decrement -> { m with Count = m.Count - m.StepSize }
| SetStepSize x -> { m with StepSize = x }
| Reset -> init

Resolving ANTLR ambiguity while matching specific Types

I'm starting exploring ANTLR and I'm trying to match this format: (test123 A0020 )
Where :
test123 is an Identifier of max 10 characters ( letters and digits )
A : Time indicator ( for Am or Pm ), one letter can be either "A" or "P"
0020 : 4 digit format representing the time.
I tried this grammar :
IDENTIFIER
:
( LETTER | DIGIT ) +
;
INT
:
DIGIT+
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[A-Z]
;
WS : [ \t\r\n(\s)+]+ -> channel(HIDDEN) ;
formatter: '(' information ')';
information :
information '/' 'A' INT
|IDENTIFIER ;
How can I resolve the ambiguity and get the time format matched as 'A' INT not as IDENTIFIER?
Also how can I add checks like length of token to the identifier?
I tknow that this doesn't work in ANTLR : IDENTIFIER : (DIGIT | LETTER ) {2,10}
UPDATE:
I changed the rules to have semantic checks but I still have the same ambiguity between the identifier and the Time format. here's the modified rules:
formatter
: information
| information '-' time
;
time :
timeMode timeCode;
timeMode:
{ getCurrentToken().getText().matches("[A,C]")}? MOD
;
timeCode: {getCurrentToken().getText().matches("[0-9]{4}")}? INT;
information: {getCurrentToken().getText().length() <= 10 }? IDENTIFIER;
MOD: 'A' | 'C';
So the problem is illustrated in the production tree, A0023 is matched to timeMode and the parser is complaining that the timeCode is missing
Here is a way to handle it:
grammar Test;
#lexer::members {
private boolean isAhead(int maxAmountOfCharacters, String pattern) {
final Interval ahead = new Interval(this._tokenStartCharIndex, this._tokenStartCharIndex + maxAmountOfCharacters - 1);
return this._input.getText(ahead).matches(pattern);
}
}
parse
: formatter EOF
;
formatter
: information ( '-' time )?
;
time
: timeMode timeCode
;
timeMode
: TIME_MODE
;
timeCode
: {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\d{4}")}?
IDENTIFIER_OR_INTEGER
;
information
: {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\w*[a-zA-Z]\\w*")}?
IDENTIFIER_OR_INTEGER
;
IDENTIFIER_OR_INTEGER
: {!isAhead(6, "[AP]\\d{4}(\\D|$)")}? [a-zA-Z0-9]+
;
TIME_MODE
: [AP]
;
SPACES
: [ \t\r\n] -> skip
;
A small test class:
public class Main {
private static void indent(String lispTree) {
int indentation = -1;
for (final char c : lispTree.toCharArray()) {
if (c == '(') {
indentation++;
for (int i = 0; i < indentation; i++) {
System.out.print(i == 0 ? "\n " : " ");
}
}
else if (c == ')') {
indentation--;
}
System.out.print(c);
}
}
public static void main(String[] args) throws Exception {
TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
indent(parser.parse().toStringTree(parser));
}
}
will print:
(parse
(formatter
(information 1P23) -
(time
(timeMode A)
(timeCode 0023))) <EOF>)
for the input "1P23 - A0023".
EDIT
ANTLR also can output the parse tree on UI component. If you do this instead:
public class Main {
public static void main(String[] args) throws Exception {
TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
new TreeViewer(Arrays.asList(TestParser.ruleNames), parser.parse()).open();
}
}
the following dialog will appear:
Tested with ANTLR version 4.5.2-1
Using semantic predicates (check this amazing QA), you can define parser rules for your specific model, having logic checks that the information can be parsed. Note this is only an option for parser rules, not lexer rules.
information
: information '/' meridien time
| text
;
meridien
: am
| pm
;
am: {input.LT(1).getText() == "A"}? IDENTIFIER;
pm: {input.LT(1).getText() == "P"}? IDENTIFIER;
time: {input.LT(1).getText().length == 4}? INT;
text: {input.LT(1).getText().length <= 10}? IDENTIFIER;
compileUnit
: alfaNum time
;
alfaNum : (ALFA | MOD | NUM)+;
time : MOD NUM+;
MOD: 'A' | 'P';
ALFA: [a-zA-Z];
NUM: [0-9];
WS
: ' ' -> channel(HIDDEN)
;
You need to avoid ambiguity by including MOD into alfaNum rule.

Using record types in FSYACC

In FSYACC it is common to have terminals that result in tuples. However, for convenience I want to use a record type instead. For example, if I have the following in my Abstract Syntax Tree (AbstractSyntaxTree.fsl):
namespace FS
module AbstractSyntaxTree =
type B = { x : int; y : int }
type Either =
| Record of B
| Tuple of int * string
type A =
| Int of int
| String of string
| IntTuple of Either
I'm not clear on the correct syntax in FSYACC (parser.fsy), because if I use:
%start a
%token <string> STRING
%token <System.Int32> INT
%token ATOMTOKEN TUPLETOKEN EOF
%type < A > a
%%
a:
| atomS { $1 }
| atomI { $1 }
| either { $1 }
atomI:
| ATOMTOKEN INT { Int($2) }
atomS:
| ATOMTOKEN STRING { String($2) }
either:
| TUPLETOKEN INT INT { Record {x=$2;y=$3} } // !!!
| TUPLETOKEN TUPLETOKEN INT STRING { Tuple( $3, $4) } // !!!
I would expect the type B and the Tuple to be inferred. However, FSYACC gives the error for both of the lines marked with "!!!":
This expression was expected to have type A but here has type Either
What is the correct syntax to for the "either" production on the last two lines?
Don't you mean IntTuple($2, $3) as opposed to B($2, $3)? I'd try IntTuple{x=$2; y=$3}
EDIT: this works:
module Ast
type B = { x : int; y : int }
type A =
| Int of int
| String of string
| IntTuple of B
and
%{
open Ast
%}
%start a
%token <string> STRING
%token <System.Int32> INT
%token ATOMTOKEN TUPLETOKEN
%type < Ast.A > a
%%
a:
| atom { $1 }
| tuple { $1 }
atom:
| ATOMTOKEN INT { Int($2) }
| ATOMTOKEN STRING { String($2) }
tuple:
| TUPLETOKEN INT INT { IntTuple {x = $2; y = $3} }
EDIT 2: Take good care, that the line %type < Ast.A > a requires your non-terminal a to be of type Ast.A. So therefore, since you are using the non-terminal tuple directly, tuple needs to be of type Ast.A. As such, you have to wrap the record in IntTuple, so the syntax is IntTuple {x = $2; y = $3} as opposed to just {x = $2; y = $3}.

Error "Cannot find a constructor"

I'm currently trying Rascal to create a small DSL. I tried to modify the Pico example, however I'm currently stuck. The following code parses examples like a = 3, b = 7 begin declare x : natural, field real # cells blubb; x := 5.7 end parses perfectly, but the implode function fails with the error message "Cannot find a constructor for PROGRAM". I tried various constructor declarations, however none seemed to fit. Is there a way to see what the expected constructor looks like?
Syntax:
module BlaTest::Syntax
import Prelude;
lexical Identifier = [a-z][a-z0-9]* !>> [a-z0-9];
lexical NaturalConstant = [0-9]+;
lexical IntegerConstant = [\-+]? NaturalConstant;
lexical RealConstant = IntegerConstant "." NaturalConstant;
lexical StringConstant = "\"" ![\"]* "\"";
layout Layout = WhitespaceAndComment* !>> [\ \t\n\r%];
lexical WhitespaceAndComment
= [\ \t\n\r]
| #category="Comment" "%" ![%]+ "%"
| #category="Comment" "%%" ![\n]* $
;
start syntax Program
= program: {ExaOption ","}* exadomain "begin" Declarations decls {Statement ";"}* body "end"
;
syntax Domain = "domain" "{" ExaOption ", " exaoptions "}"
;
syntax ExaOption = Identifier id "=" Expression val
;
syntax Declarations
= "declare" {Declaration ","}* decls ";" ;
syntax Declaration
= variable_declaration: Identifier id ":" Type tp
| field_declaration: "field" Type tp "#" FieldLocation fieldLocation Identifier id
;
syntax FieldLocation
= exacell: "cells"
| exanode: "nodes"
;
syntax Type
= natural:"natural"
| exareal: "real"
| string :"string"
;
syntax Statement
= asgStat: Identifier var ":=" Expression val
| ifElseStat: "if" Expression cond "then" {Statement ";"}* thenPart "else" {Statement ";"}* elsePart "fi"
| whileStat: "while" Expression cond "do" {Statement ";"}* body "od"
;
syntax Expression
= id: Identifier name
| stringConstant: StringConstant stringconstant
| naturalConstant: NaturalConstant naturalconstant
| realConstant: RealConstant realconstant
| bracket "(" Expression e ")"
> left conc: Expression lhs "||" Expression rhs
> left ( add: Expression lhs "+" Expression rhs
| sub: Expression lhs "-" Expression rhs
)
;
public start[Program] program(str s) {
return parse(#start[Program], s);
}
public start[Program] program(str s, loc l) {
return parse(#start[Program], s, l);
}
Abstract:
module BlaTest::Abstract
public data TYPE = natural() | string() | exareal();
public data FIELDLOCATION = exacell() | exanode();
public alias ExaIdentifier = str;
public data PROGRAM = program(list[OPTION] exadomain, list[DECL] decls, list[STATEMENT] stats);
public data DOMAIN
= domain_declaration(list[OPTION] options)
;
public data OPTION
= exaoption(ExaIdentifier name, EXP exp)
;
public data DECL
= variable_declaration(ExaIdentifier name, TYPE tp)
| field_declaration(TYPE tp, FIELDLOCATION fieldlocation, ExaIdentifier name)
;
public data EXP
= id(ExaIdentifier name)
| naturalConstant(int iVal)
| stringConstant(str sVal)
| realConstant(real rVal)
| add(EXP left, EXP right)
| sub(EXP left, EXP right)
| conc(EXP left, EXP right)
;
public data STATEMENT
= asgStat(ExaIdentifier name, EXP exp)
| ifElseStat(EXP exp, list[STATEMENT] thenpart, list[STATEMENT] elsepart)
| whileStat(EXP exp, list[STATEMENT] body)
;
anno loc TYPE#location;
anno loc PROGRAM#location;
anno loc DECL#location;
anno loc EXP#location;
anno loc STATEMENT#location;
anno loc OPTION#location;
public alias Occurrence = tuple[loc location, ExaIdentifier name, STATEMENT stat];
Load:
module BlaTest::Load
import IO;
import Exception;
import Prelude;
import BlaTest::Syntax;
import BlaTest::Abstract;
import BlaTest::ControlFlow;
import BlaTest::Visualize;
public PROGRAM exaload(str txt) {
PROGRAM p;
try {
p = implode(#PROGRAM, parse(#Program, txt));
} catch ParseError(loc l): {
println("Parse error at line <l.begin.line>, column <l.begin.column>");
}
return p; // return will fail in case of error
}
public Program exaparse(str txt) {
Program p;
try {
p = parse(#Program, txt);
} catch ParseError(loc l): {
println("Parse error at line <l.begin.line>, column <l.begin.column>");
}
return p; // return will fail in case of error
}
Thanks a lot,
Chris
Unfortunately the current implode facility depends on a hidden semantic assumption, namely that the non-terminals in the syntax definition have the same name as the types in the data definitions. So if the non-terminal is called "Program", it should not be called "PROGRAM" but "Program" in the data definition.
We are looking for a smoother way of integrating concrete and abstract syntax trees, but for now please decapitalize your data names.

Translate one term differently in one program

I try to make a frontend for a kind of programs... there are 2 particularities:
1) When we meet a string beginning with =, I want to read the rest of the string as a formula instead of a string value. For instance, "123", "TRUE", "TRUE+123" are considered having string as type, while "=123", "=TRUE", "=TRUE+123" are considered having Syntax.formula as type. By the way,
(* in syntax.ml *)
and expression =
| E_formula of formula
| E_string of string
...
and formula =
| F_int of int
| F_bool of bool
| F_Plus of formula * formula
| F_RC of rc
and rc =
| RC of int * int
2) Inside the formula, some strings are interpreted differently from outside. For instance, in a command R4C5 := 4, R4C5 which is actually a variable, is considered as a identifier, while in "=123+R4C5" which tries to be translated to a formula, R4C5 is translated as RC (4,5): rc.
So I don't know how to realize this with 1 or 2 lexers, and 1 or 2 parsers.
At the moment, I try to realize all in 1 lexer and 1 parser. Here is part of code, which doesn't work, it still considers R4C5 as identifier, instead of rc:
(* in lexer.mll *)
let begin_formula = double_quote "="
let end_formula = double_quote
let STRING = double_quote ([^ "=" ])* double_quote
rule token = parse
...
| begin_formula { BEGIN_FORMULA }
| 'R' { R }
| 'C' { C }
| end_formula { END_FORMULA }
| lex_identifier as li
{ try Hashtbl.find keyword_table (lowercase li)
with Not_found -> IDENTIFIER li }
| STRING as s { STRING s }
...
(* in parser.mly *)
expression:
| BEGIN_FORMULA f = formula END_FORMULA { E_formula f }
| s = STRING { E_string s }
...
formula:
| i = INTEGER { F_int i }
| b = BOOL { F_bool b }
| f0 = formula PLUS f1 = formula { F_Plus (f0, f1) }
| rc { F_RC $1 }
rc:
| R i0 = INTEGER C i1 = INTEGER { RC (i0, i1) }
Could anyone help?
New idea: I am thinking of sticking on 1 lexer + 1 parser, and create a entrypoint for formula in lexer as what we do normally for comment... here are some updates in lexer.mll and parser.mly:
(* in lexer.mll *)
rule token = parse
...
| begin_formula { formula lexbuf }
...
| INTEGER as i { INTEGER (int_of_string i) }
| '+' { PLUS }
...
and formula = parse
| end_formula { token lexbuf }
| INTEGER as i { INTEGER_F (int_of_string i) }
| 'R' { R }
| 'C' { C }
| '+' { PLUS_F }
| _ { raise (Lexing_error ("unknown in formula")) }
(* in parser.mly *)
expression:
| formula { E_formula f }
...
formula:
| i = INTEGER_F { F_int i }
| f0 = formula PLUS_F f1 = formula { F_Plus (f0, f1) }
...
I have done some tests, for instance to parse "=R4", the problem is that it can parse well R, but it considers 4 as INTEGER instead of INTEGER_F, it seems that formula lexbuf needs to be added from time to time in the body of formula entrypoint (Though I don't understand why parsing in the body of token entrypoint works without always mentioning token lexbuf). I have tried several possibilities: | 'R' { R; formula lexbuf }, | 'R' { formula lexbuf; R }, etc. but it didn't work... ... Could anyone help?
I think the simplest choice would be to have two different lexers and two different parsers; call the lexer&parser for formulas from inside the global parser. After the fact you can see how much is shared between the two grammars, and factorize things when possible.

Resources