Why are my syntax errors in Jison not being "propagated"? - parsing

This is the code that I have:
%lex
%options flex
%{
// Used to store the parsed data
if (!('regions' in yy)) {
yy.regions = {
settings: {},
tables: [],
relationships: []
};
}
%}
text [a-zA-Z][a-zA-Z0-9]*
%%
\n\s* return 'NEWLINE';
[^\S\n]+ ; // ignore whitespace other than newlines
"." return '.';
"," return ',';
"-" return '-';
"=" return '=';
"=>" return '=>';
"<=" return '<=';
"[" return '[';
"settings]" return 'SETTINGS';
"tables]" return 'TABLES';
"relationships]" return 'RELATIONSHIPS';
"]" return ']';
{text} return 'TEXT';
<<EOF>> return 'EOF';
/lex
%left ','
%start source
%%
source
: content EOF
{
console.log(yy.regions);
console.log("\n" + JSON.stringify(yy.regions));
return yy.regions;
}
| NEWLINE content EOF
{
console.log(yy.regions);
console.log("\n" + JSON.stringify(yy.regions));
return yy.regions;
}
| NEWLINE EOF
| EOF
;
content
: '[' section content
| '[' section
;
section
: SETTINGS NEWLINE settings_content
| TABLES NEWLINE tables_content
| RELATIONSHIPS NEWLINE relationships_content
;
settings_content
: settings_line NEWLINE settings_content
| settings_line NEWLINE
| settings_line
;
settings_line
: text '=' text
{ yy.regions.settings[$1] = $3; }
;
tables_content
: tables_line NEWLINE tables_content
| tables_line NEWLINE
| tables_line
;
tables_line
: table_name
{ yy.regions.tables.push({ name: $table_name, fields: [] }); }
| field_list
{
var tableCount = yy.regions.tables.length;
var tableIndex = tableCount - 1;
yy.regions.tables[tableIndex].fields.push($field_list);
}
;
table_name
: '-' text
{ $$ = $text; }
;
field_list
: text
{ $$=[]; $$.push($text); }
| field_list ',' text
{ $field_list.push($text); $$ = $field_list; }
;
relationships_content
: relationships_line NEWLINE relationships_content
| relationships_line NEWLINE
| relationships_line
;
relationships_line
: relationship_key '=>' relationship_key
{
yy.regions.relationships.push({
pkTable: $1,
fkTable: $3
});
}
| relationship_key '<=' relationship_key
{
yy.regions.relationships.push({
pkTable: $3,
fkTable: $1
});
}
;
relationship_key
: text '.' text
{ $$ = { name: $1, field: $3 }; }
| text
{ $$ = { name: $1 }; }
;
text
: TEXT
{ $$ = $TEXT; }
;
It's used to parse this kind of code:
[settings]
DefaultFieldType = string
[tables]
-table1
id, int, PK
username, string, NULL
password, string
-table2
id, int, PK
itemName, string
itemCount, int
[relationships]
table1 => table2
foo.test => bar.test2
Into this kind of JSON:
{ settings: { DefaultFieldType: 'string' },
tables:
[ { name: 'table1', fields: [Object] },
{ name: 'table2', fields: [Object] } ],
relationships:
[ { pkTable: [Object], fkTable: [Object] },
{ pkTable: [Object], fkTable: [Object] } ] }
However I don't get syntax error. When I go to Jison demo and try to parse 5*PI 3^2, I get the following error:
Parse error on line 1:
5*PI 3^2
-----^
Expecting 'EOF', '+', '-', '*', '/', '^', ')', got 'NUMBER'
which is expected. But when I change the last line of the code which I wish to parse from:
foo.test => bar.test2
to something like
foo.test => a bar.test2
I get the following error:
throw new _parseError(str, hash);
^
TypeError: Function.prototype.toString is not generic
I traced this to the generated parser code which looks like this:
if (hash.recoverable) {
this.trace(str);
} else {
function _parseError (msg, hash) {
this.message = msg;
this.hash = hash;
}
_parseError.prototype = Error;
throw new _parseError(str, hash);
}
So this leads me to believe that there is something wrong in how I structured my code and how I handled parsing but I have no idea what that might be.
It seems like it might have something to do with error recovery. If that is correct, how is that supposed to be used? Am I supposed to add the 'error' rule upwards to every element all the way to the source root?

Your grammar seems to work as expected in the Jison demo page, at least with the browser I'm using (Firefox 46.0.1). From the amount of activity in the git repository around the code that you cite, I suspect that the version of jison you are using has one of the bugs:
https://github.com/zaach/jison/issues/328
https://github.com/zaach/jison/issues/318
I think the jison version on the demo page is older, not newer, so if grabbing the current code from github doesn't work, you could try using an older version.

Related

How can I tell pest.rs to flatten grammar?

Let's say I have a rule like,
key = { ASCII_ALPHA ~ ( ASCII_ALPHA | "_" )+ }
value = { (!NEWLINE ~ ANY)+ }
keyvalue = { key ~ "=" ~ value? }
option = { key }
This supports a
K=V
K=
K
Which is want to set/unset a key, and to specify an option, what I don't like is the syntax for option which produces an AST like this,
rule: option,
span: Span {
str: "check_local_user",
start: 302,
end: 318,
},
inner: [
Pair {
rule: key,
span: Span {
str: "check_local_user",
start: 302,
end: 318,
},
inner: [],
},
],
I don't like that my option has inner with key. I'm just wanting to the option to have the same grammar as a key. Is there any method in Pest.rs to write the grammar such that
inner { myStuff }
outer = { inner }
gets flattened to
outer = { myStuff }
Using the Atomic Parsing Token #, I could accomplish this.
option = #{ key }
It's documented as,
Any rules called by atomic rules do not generate token pairs.

Processing character returned by yyless within a start condition in yacc

For the code snippet below, the "ASSN: =" block for {EQ} is not triggered for an input of "CC=gcc\n" - I don't understand why this is, the equals character is being passed, as it is being processed by the next rule for {CHAR}.
How can I ensure that the {EQ} rule for is processed when the equals character is 'pushed' back by yyless?
The byacc code is pretty much empty with a single dummy rule, but with the relevant %token lines.
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include "y.tab.h"
extern YYSTYPE yylval;
%}
%x ASSIGNMENT
%option noyywrap
DIGIT [0-9]
ALPHA [A-Za-z]
SPACE [ ]
TAB [\t]
WS [ \t]+
NEWLINE (\n|\r|\r\n)
IDENT [A-Za-z_][A-Za-z_0-9]+
EQ =
CHAR [^\r\n]+
%%
<*>"#"{CHAR}{NEWLINE}
({IDENT}{EQ})|({IDENT}{WS}{EQ}) {
yylval.strval = strndup(yytext,
strlen(yytext)-1);
printf("NORM: %s\n", yylval.strval);
yyless(strlen(yytext)-1);
BEGIN(ASSIGNMENT);
return TOK_IDENT;
}
<ASSIGNMENT>{
{EQ} {
printf("ASSN: =\n");
return TOK_ASSIGN;
}
{CHAR} {
printf("ASSN: %s\n", yytext);
return TOK_STRING;
}
{NEWLINE} {
BEGIN(INITIAL);
}
}
{WS}
{NEWLINE}
. {
printf("DOT : %s\n", yytext);
}
<*><<EOF>> {
printf("EOF\n");
return 0;
}
%%
int main()
{
printf("Start\n\n");
int ret;
while( (ret = yylex()) ) {
printf("LEX : %u\n", ret);
}
printf("\nEnd\n");
}
Example output:
Start
NORM: CC
LEX : 257
ASSN: =gcc
LEX : 259
EOF
End
My issue was that flex matches the longest rule first, so {CHAR} was always winning over {EQ}. I solved this by introducing another Start Condition to consume the {EQ}{WS}? before passing to

How to JvmModelInferrer method body from XExpression and append boilerplate code

In a JvmModelInferrer, when constructing the body of a method or constructor, how do you insert both an XExpression from the grammar
body = op.body
and additional "boilerplate" code, for example
body = [
append(
'''
System.out.println("BOILERPLATE");
'''
)
]
I can achieve either but not both.
For a minimal working example, consider the following canonical Xbase grammar,
grammar org.example.xbase.entities.Entities with org.eclipse.xtext.xbase.Xbase
generate entities "http://www.example.org/xbase/entities/Entities"
Model:
importSection=XImportSection?
entities+=Entity*;
Entity:
'entity' name=ID ('extends' superType=JvmParameterizedTypeReference)? '{'
attributes += Attribute*
operations += Operation*
'}';
Attribute:
'attr' (type=JvmTypeReference)? name=ID ('=' initexpression=XExpression)? ';';
Operation:
'op' (type=JvmTypeReference)? name=ID
'(' (params+=FullJvmFormalParameter (',' params+=FullJvmFormalParameter)*)? ')'
body=XBlockExpression;
and JvmModelInferrer,
package org.example.xbase.entities.jvmmodel
import com.google.inject.Inject
import org.eclipse.xtext.xbase.jvmmodel.AbstractModelInferrer
import org.eclipse.xtext.xbase.jvmmodel.IJvmDeclaredTypeAcceptor
import org.eclipse.xtext.xbase.jvmmodel.JvmTypesBuilder
import org.example.xbase.entities.entities.Entity
class EntitiesJvmModelInferrer extends AbstractModelInferrer {
#Inject extension JvmTypesBuilder
def dispatch void infer(Entity entity, IJvmDeclaredTypeAcceptor acceptor, boolean isPreIndexingPhase) {
acceptor.accept(entity.toClass("entities."+entity.name)) [
documentation = entity.documentation
if (entity.superType != null)
superTypes += entity.superType.cloneWithProxies
entity.attributes.forEach[
a |
val type = a.type ?: a.initexpression?.inferredType
members += a.toField(a.name, type) [
documentation = a.documentation
if (a.initexpression != null)
initializer = a.initexpression
]
members += a.toGetter(a.name, type)
members += a.toSetter(a.name, type)
]
entity.operations.forEach[
op |
members += op.toMethod(op.name, op.type ?: inferredType) [
documentation = op.documentation
for (p : op.params) {
parameters += p.toParameter(p.name, p.parameterType)
}
// body = [
// append(
// '''
// System.out.println("BOILERPLATE");
// '''
// )
// ]
body = op.body
]
]
]
}
}
As the comments suggest, I would like to insert "boilerplate" code into the body of the method, before the XExpression itself. While I can insert the boilerplate, or the expression, I cannot work out how to do both.
this does not work. the only thing you can do is to infer two methods
methodWithBoilerplate() {
//pre
methodwithoutboilerplate
//post
}
methodwithoutboilerplate() {
usercode goes here
}
for the first use body = '''code here'''
for the second use body = exp.body

Parsing alphanumeric identifiers with underscores and calling from top level parser

Trying to do something similar to this question except allow underscores from the second character onwards. Not just camel case.
I can test the parser in isolation successfully but when composed in a higher level parser, I get errors
Take the following example:
#![allow(dead_code)]
#[macro_use]
extern crate nom;
use nom::*;
type Bytes<'a> = &'a [u8];
#[derive(Clone,PartialEq,Debug)]
pub enum Statement {
IF,
ELSE,
ASSIGN((String)),
IDENTIFIER(String),
EXPRESSION,
}
fn lowerchar(input: Bytes) -> IResult<Bytes, char>{
if input.is_empty() {
IResult::Incomplete(Needed::Size(1))
} else if (input[0] as char)>='a' && 'z'>=(input[0] as char) {
IResult::Done(&input[1..], input[0] as char)
} else {
IResult::Error(error_code!(ErrorKind::Custom(1)))
}
}
named!(identifier<Bytes,Statement>, map_res!(
recognize!(do_parse!(
lowerchar >>
//alt_complete! is not having the effect it's supposed to so the alternatives need to be ordered from shortest to longest
many0!(alt!(
complete!(is_a!("_"))
| complete!(take_while!(nom::is_alphanumeric))
)) >> ()
)),
|id: Bytes| {
//println!("{:?}",std::str::from_utf8(id).unwrap().to_string());
Result::Ok::<Statement, &str>(
Statement::IDENTIFIER(std::str::from_utf8(id).unwrap().to_string())
)
}
));
named!(expression<Bytes,Statement>, alt_complete!(
identifier //=> { |e: Statement| e }
//| assign_expr //=> { |e: Statement| e }
| if_expr //=> { |e: Statement| e }
));
named!(if_expr<Bytes,Statement>, do_parse!(
if_statement: preceded!(
tag!("if"),
delimited!(tag!("("),expression,tag!(")"))
) >>
//if_statement: delimited!(tag!("("),tag!("hello"),tag!(")")) >>
if_expr: expression >>
//else_statement: opt_res!(tag!("else")) >>
(Statement::IF)
));
#[cfg(test)]
mod tests {
use super::*;
use IResult::*;
//use IResult::Done;
#[test]
fn ident_token() {
assert_eq!(identifier(b"iden___ifiers"), Done::<Bytes, Statement>(b"" , Statement::IDENTIFIER("iden___ifiers".to_string())));
assert_eq!(identifier(b"iden_iFIErs"), Done::<Bytes, Statement>(b"" , Statement::IDENTIFIER("iden_iFIErs".to_string())));
assert_eq!(identifier(b"Iden_iFIErs"), Error(ErrorKind::Custom(1))); // Supposed to fail since not valid
assert_eq!(identifier(b"_den_iFIErs"), Error(ErrorKind::Custom(1))); // Supposed to fail since not valid
}
#[test]
fn if_token() {
assert_eq!(if_expr(b"if(a)a"), Error(ErrorKind::Alt)); // Should have passed
assert_eq!(if_expr(b"if(hello)asdas"), Error(ErrorKind::Alt)); // Should have passed
}
#[test]
fn expr_parser() {
assert_eq!(expression(b"iden___ifiers"), Done::<Bytes, Statement>(b"" , Statement::IDENTIFIER("iden___ifiers".to_string())));
assert_eq!(expression(b"if(hello)asdas"), Error(ErrorKind::Alt)); // Should have been able to recognise an IF statement via expression parser
}
}

changing text of rule in antlr4 using setText

I want to change every entry in csv file to 'BlahBlah'
For that I have antlr grammar as
grammar CSV;
file : hdr row* row1;
hdr : row;
row : field (',' value1=field)* '\r'? '\n'; // '\r' is optional at the end of a row of CSV file ..
row1 : field (',' field)* '\r'? '\n'?;
field
: TEXT
{
$setText("BlahBlah");
}
| STRING
|
;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""' | ~'"')* '"' ;
But when I run this on antlr4
error(63): CSV.g4:13:3: unknown attribute reference setText in $setText
make: *** [run] Error 1
why is setText not supported in antlr4 and is there any other alternative to replace text?
Couple of problems here:
First, have to identify the receiver of the setText method. Probably want
field : TEXT { $TEXT.setText("BlahBlah"); }
| STRING
;
Second is that setText is not defined in the Token class.
Typically, create your own token class extending CommonToken and corresponding token factory class. Set the TokenLableType (in the options block) to your token class name. The setText method in CommonToken will then be visible.
tl;dr:
Given the following grammar (derived from original CSV.g4 sample and grammar attempt of OP (cf. question)):
grammar CSVBlindText;
#header {
import java.util.*;
}
/** Derived from rule "file : hdr row+ ;" */
file
locals [int i=0]
: hdr ( rows+=row[$hdr.text.split(",")] {$i++;} )+
{
System.out.println($i+" rows");
for (RowContext r : $rows) {
System.out.println("row token interval: "+r.getSourceInterval());
}
}
;
hdr : row[null] {System.out.println("header: '"+$text.trim()+"'");} ;
/** Derived from rule "row : field (',' field)* '\r'? '\n' ;" */
row[String[] columns] returns [Map<String,String> values]
locals [int col=0]
#init {
$values = new HashMap<String,String>();
}
#after {
if ($values!=null && $values.size()>0) {
System.out.println("values = "+$values);
}
}
// rule row cont'd...
: field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
( ',' field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
)* '\r'? '\n'
;
field
: TEXT
| STRING
|
;
TEXT : ~[',\n\r"]+ {setText( "BlahBlah" );} ;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
One has:
$> antlr4 -no-listener CSVBlindText.g4
$> grep setText CSVBlindText*java
CSVBlindTextLexer.java: setText( "BlahBlah" );
Compiling it works flawlessly:
$> javac CSVBlindText*.java
Testdata (the users.csv file just renamed):
$> cat blinded_by_grammar.csv
User, Name, Dept
parrt, Terence, 101
tombu, Tom, 020
bke, Kevin, 008
Yields in test:
$> grun CSVBlindText file blinded_by_grammar.csv
header: 'BlahBlah,BlahBlah,BlahBlah'
values = {BlahBlah=BlahBlah}
values = {BlahBlah=BlahBlah}
values = {BlahBlah=BlahBlah}
3 rows
row token interval: 6..11
row token interval: 12..17
row token interval: 18..23
So it looks as if the setText() should be injected before the semicolon of a production and not between alternatives (wild guessing here ;-)
Previous iterations below:
Just guessing, as I 1) have no working antlr4 available currently and 2) did not write ANTLR4 grammars for quite some time now - maybe without the Dollar ($) ?
grammar CSV;
file : hdr row* row1;
hdr : row;
row : field (',' value1=field)* '\r'? '\n'; // '\r' is optional at the end of a row of CSV file ..
row1 : field (',' field)* '\r'? '\n'?;
field
: TEXT
{
setText("BlahBlah");
}
| STRING
|
;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""' | ~'"')* '"' ;
Update: Now that an antlr 4.5.2 (at least via brew) instead of a 4.5.3 is available, I digged into that and answering some comment below from OP: the setText() will be generated in lexer java module if the grammar is well defined. Unfortunately debugging antlr4 grammars for a dilettant like me is ... but nevertheless very nice language construction kit IMO.
Sample session:
$> antlr4 -no-listener CSV.g4
$> grep setText CSVLexer.java
setText( String.valueOf(getText().charAt(1)) );
The grammar used:
(hacked up from example code retrieved via:
curl -O http://media.pragprog.com/titles/tpantlr2/code/tpantlr2-code.tgz )
grammar CSV;
#header {
import java.util.*;
}
/** Derived from rule "file : hdr row+ ;" */
file
locals [int i=0]
: hdr ( rows+=row[$hdr.text.split(",")] {$i++;} )+
{
System.out.println($i+" rows");
for (RowContext r : $rows) {
System.out.println("row token interval: "+r.getSourceInterval());
}
}
;
hdr : row[null] {System.out.println("header: '"+$text.trim()+"'");} ;
/** Derived from rule "row : field (',' field)* '\r'? '\n' ;" */
row[String[] columns] returns [Map<String,String> values]
locals [int col=0]
#init {
$values = new HashMap<String,String>();
}
#after {
if ($values!=null && $values.size()>0) {
System.out.println("values = "+$values);
}
}
// rule row cont'd...
: field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
( ',' field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
)* '\r'? '\n'
;
field
: TEXT
| STRING
| CHAR
|
;
TEXT : ~[',\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
/** Convert 3-char 'x' input sequence to string x */
CHAR: '\'' . '\'' {setText( String.valueOf(getText().charAt(1)) );} ;
Compiling works:
$> javac CSV*.java
Now test with a matching weird csv file:
a,b
"y",'4'
As:
$> grun CSV file foo.csv
line 1:0 no viable alternative at input 'a'
line 1:2 no viable alternative at input 'b'
header: 'a,b'
values = {a="y", b=4}
1 rows
row token interval: 4..7
So in conclusion, I suggest to rework the logic of the grammar (I presume inserting "BlahBlahBlah" was not essential but a mere debugging hack).
And citing http://www.antlr.org/support.html :
ANTLR Discussions
Please do not start discussions at stackoverflow. They have asked us to
steer discussions (i.e., non-questions/answers) away from Stackoverflow; we
have a discussion forum at Google specifically for that:
https://groups.google.com/forum/#!forum/antlr-discussion
We can discuss ANTLR project features, direction, and generally argue about
whatever we want at the google discussion forum.
I hope this helps.

Resources