Basic problem with yacc/lex - parsing

I have some problems with a very simple yacc/lex program. I have maybe forgotten some basic steps (it's been a long time since I've used these tools).
In my lex program I give some basic values like :
word [a-zA-Z][a-zA-Z]*
%%
":" return(PV);
{word} {
yylval = yytext;
printf("yylval = %s\n",yylval);
return(WORD);
}
"\n" return(ENDLINE);
In my yacc program the beginning of my grammar is (where TranslationUnit is my %start) :
TranslationUnit:
/* Nothing */
| InfoBlock Data
;
InfoBlock:
/* Nothing */
| InfoBlock InfoExpression {}
;
InfoExpression:
WORD PV WORD ENDLINE { printf("$1 = %s\n",$1);
printf("$2 = %s\n",$2);
printf("$3 = %s\n",$3);
printf("$4 = %s\n",$4);
}
| ... /* other things */
;
Data:
... /* other things */
When I run my program with input :
keyword : value
I thought that I would get at least :
$1 = keyword
$2 = keyword // yylval not changed for token PV
$3 = value
$4 = value // yylval not changed for token ENDLINE
Actually I get :
$1 = keyword : value
$2 = keyword : value
$3 = value
$4 = value
I do not understand this result. I have studied grammars some time ago and even if I don't remember everything right now, I do not see any important mistake...
Thanks in advance for your help.

The trouble is that unless you save the token, Lex/Yacc goes on to over-write the space, or point to different space, etc. So you need to stash the information that's crucial to you before it gets modified. Your printing in the Lex code should have showed you that the yylval values were accurate at the point when the lexer (lexical analyzer) was called.
See also SO 2696470 where the same basic problem was encountered and diagnosed.

Related

Implement heredocs with trim indent using PEG.js

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

How to highlight QScintilla using ANTLR4?

I'm trying to learn ANTLR4 and I'm already having some issues with my first experiment.
The goal here is to learn how to use ANTLR to syntax highlight a QScintilla component. To practice a little bit I've decided I'd like to learn how to properly highlight *.ini files.
First things first, in order to run the mcve you'll need:
Download antlr4 and make sure it works, read the instructions on the main site
Install python antlr runtime, just do: pip install antlr4-python3-runtime
Generate the lexer/parser of ini.g4:
grammar ini;
start : section (option)*;
section : '[' STRING ']';
option : STRING '=' STRING;
COMMENT : ';' ~[\r\n]*;
STRING : [a-zA-Z0-9]+;
WS : [ \t\n\r]+;
by running antlr ini.g4 -Dlanguage=Python3 -o ini
Finally, save main.py:
import textwrap
from PyQt5.Qt import *
from PyQt5.Qsci import QsciScintilla, QsciLexerCustom
from antlr4 import *
from ini.iniLexer import iniLexer
from ini.iniParser import iniParser
class QsciIniLexer(QsciLexerCustom):
def __init__(self, parent=None):
super().__init__(parent=parent)
lst = [
{'bold': False, 'foreground': '#f92472', 'italic': False}, # 0 - deeppink
{'bold': False, 'foreground': '#e7db74', 'italic': False}, # 1 - khaki (yellowish)
{'bold': False, 'foreground': '#74705d', 'italic': False}, # 2 - dimgray
{'bold': False, 'foreground': '#f8f8f2', 'italic': False}, # 3 - whitesmoke
]
style = {
"T__0": lst[3],
"T__1": lst[3],
"T__2": lst[3],
"COMMENT": lst[2],
"STRING": lst[0],
"WS": lst[3],
}
for token in iniLexer.ruleNames:
token_style = style[token]
foreground = token_style.get("foreground", None)
background = token_style.get("background", None)
bold = token_style.get("bold", None)
italic = token_style.get("italic", None)
underline = token_style.get("underline", None)
index = getattr(iniLexer, token)
if foreground:
self.setColor(QColor(foreground), index)
if background:
self.setPaper(QColor(background), index)
def defaultPaper(self, style):
return QColor("#272822")
def language(self):
return self.lexer.grammarFileName
def styleText(self, start, end):
view = self.editor()
code = view.text()
lexer = iniLexer(InputStream(code))
stream = CommonTokenStream(lexer)
parser = iniParser(stream)
tree = parser.start()
print('parsing'.center(80, '-'))
print(tree.toStringTree(recog=parser))
lexer.reset()
self.startStyling(0)
print('lexing'.center(80, '-'))
while True:
t = lexer.nextToken()
print(lexer.ruleNames[t.type-1], repr(t.text))
if t.type != -1:
len_value = len(t.text)
self.setStyling(len_value, t.type)
else:
break
def description(self, style_nr):
return str(style_nr)
if __name__ == '__main__':
app = QApplication([])
v = QsciScintilla()
lexer = QsciIniLexer(v)
v.setLexer(lexer)
v.setText(textwrap.dedent("""\
; Comment outside
[section s1]
; Comment inside
a = 1
b = 2
[section s2]
c = 3 ; Comment right side
d = e
"""))
v.show()
app.exec_()
and run it, if everything went well you should get this outcome:
Here's my questions:
As you can see, the outcome of the demo is far away from being usable, you definitely don't want that, it's really disturbing. Instead, you'd like to get a similar behaviour than all IDEs out there. Unfortunately I don't know how to achieve that, how would you modify the snippet providing such a behaviour?
Right now I'm trying to mimick a similar highlighting than the below snapshot:
you can see on that screenshot the highlighting is different on variable assignments (variable=deeppink and values=yellowish) but I don't know how to achieve that, I've tried using this slightly modified grammar:
grammar ini;
start : section (option)*;
section : '[' STRING ']';
option : VARIABLE '=' VALUE;
COMMENT : ';' ~[\r\n]*;
VARIABLE : [a-zA-Z0-9]+;
VALUE : [a-zA-Z0-9]+;
WS : [ \t\n\r]+;
and then changing the styles to:
style = {
"T__0": lst[3],
"T__1": lst[3],
"T__2": lst[3],
"COMMENT": lst[2],
"VARIABLE": lst[0],
"VALUE": lst[1],
"WS": lst[3],
}
but if you look at the lexing output you'll see there won't be distinction between VARIABLE and VALUES because order precedence in the ANTLR grammar. So my question is, how would you modify the grammar/snippet to achieve such visual appearance?
The problem is that the lexer needs to be context sensitive: everything on the left hand side of the = needs to be a variable, and to the right of it a value. You can do this by using ANTLR's lexical modes. You start off by classifying successive non-spaces as being a variable, and when encountering a =, you move into your value-mode. When inside the value-mode, you pop out of this mode whenever you encounter a line break.
Note that lexical modes only work in a lexer grammar, not the combined grammar you now have. Also, for syntax highlighting, you probably only need the lexer.
Here's a quick demo of how this could work (stick it in a file called IniLexer.g4):
lexer grammar IniLexer;
SECTION
: '[' ~[\]]+ ']'
;
COMMENT
: ';' ~[\r\n]*
;
ASSIGN
: '=' -> pushMode(VALUE_MODE)
;
KEY
: ~[ \t\r\n]+
;
SPACES
: [ \t\r\n]+ -> skip
;
UNRECOGNIZED
: .
;
mode VALUE_MODE;
VALUE_MODE_SPACES
: [ \t]+ -> skip
;
VALUE
: ~[ \t\r\n]+
;
VALUE_MODE_COMMENT
: ';' ~[\r\n]* -> type(COMMENT)
;
VALUE_MODE_NL
: [\r\n]+ -> skip, popMode
;
If you now run the following script:
source = """
; Comment outside
[section s1]
; Comment inside
a = 1
b = 2
[section s2]
c = 3 ; Comment right side
d = e
"""
lexer = IniLexer(InputStream(source))
stream = CommonTokenStream(lexer)
stream.fill()
for token in stream.tokens[:-1]:
print("{0:<25} '{1}'".format(IniLexer.symbolicNames[token.type], token.text))
you will see the following output:
COMMENT '; Comment outside'
SECTION '[section s1]'
COMMENT '; Comment inside'
KEY 'a'
ASSIGN '='
VALUE '1'
KEY 'b'
ASSIGN '='
VALUE '2'
SECTION '[section s2]'
KEY 'c'
ASSIGN '='
VALUE '3'
COMMENT '; Comment right side'
KEY 'd'
ASSIGN '='
VALUE 'e'
And an accompanying parser grammar could look like this:
parser grammar IniParser;
options {
tokenVocab=IniLexer;
}
sections
: section* EOF
;
section
: COMMENT
| SECTION section_atom*
;
section_atom
: COMMENT
| KEY ASSIGN VALUE
;
which would parse your example input in the following parse tree:
I already implemented something like this in C++.
https://github.com/tora-tool/tora/blob/master/src/editor/tosqltext.cpp
Sub-classed QScintilla class and implemented custom Lexer based on ANTLR generated source.
You might even use ANTLR parser (I did not use it), QScitilla allows you to have more than one analyzer (having different weight), so you can periodically perform some semantic check on text. What can not be done easily in QScintilla is to associate token with some additional data.
Syntax highlighting in Sctintilla is done by dedicated highlighter classes, which are lexers. A parser is not well suited for such kind of work, because the syntax highlighting feature must work, even if the input contains errors. A parser is a tool to verify the correctness of the input - 2 totally different tasks.
So I recommend you stop thinking about using ANTLR4 for that and just take one of the existing Lex classes and create a new one for the language you want to highlight.

Reducing insane flex lexer expansion?

I have written a flex lexer to handle the text in BYOND's .dmi file format. The contents inside are (key, value) pairs delimited by '='. Valid keys are all essentially keywords (such as "width"), and invalid keys are not errors: they are just ignored.
Interestingly, the current state of BYOND's .dmi parser uses everything prior to the '=' as its keyword, and simply ignores any excess junk. This means "\twidth123" is recognized as "width".
The crux of my problem is in allowing for this irregularity. In doing so my generated lexer expands from ~40-50KB to ~13-14MB. For reference, I present the following contrived example:
%option c++ noyywrap
fill [^=#\n]*
%%
{fill}version{fill} { return 0; }
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
{fill}state{fill} { return 0; }
{fill}dirs{fill} { return 0; }
{fill}frames{fill} { return 0; }
{fill}delay{fill} { return 0; }
{fill}loop{fill} { return 0; }
{fill}rewind{fill} { return 0; }
{fill}movement{fill} { return 0; }
{fill}hotspot{fill} { return 0; }
%%
fill is the rule that is used to merge the keywords with "anything before the =". Running flex on the above yields a ~13MB lex.yy.cc on my computer. Simply removing the kleene star (*) in the fill rule yields a 45KB lex.yy.cc file; however, obviously, this then makes the lexer incorrect.
Are there any tricks, flex options, or lexer hacks to avoid this insane expansion? The only things I can think of are:
Disallow "width123" to represent "width", which is undesirable as then technically-correct files could not be parsed.
Make one rule that is simply [^=\n]+ to return some identifier token, and pick out the keyword in the parser. This seems suboptimal to me as well, particularly because different keywords have different value types and it seems most natural to be able to handle "'width' '=' INT" and "'version' '=' FLOAT" in the parser instead of "ID '=' VALUE" followed by picking out the keyword in the identifier, making sure the value is of the right type, etc.
I could make the rule {fill}(width|height|version|...){fill}, which does indeed keep the generated file small. However, while regular expression parsers tend to produce "captures," flex just gives me yytext and re-parsing that for a keyword to produce the desired token seems to be very undesirable in terms of algorithmic complexity.
Make fill a separate rule of its own that does nothing, and remove it from all the other rules, and separate its definition from whitespace for clarity:
whitespace [ \t\f]
fill [^#=\n]
%%
{whitespace}+ ;
{fill}+ ;
I would probably also avoid building the keywords into the lexer and just use an identifier [a-zA-Z]+ rule that does a table lookup. And finally add a rule to catch the =:
. return yytext[0];
to let the parser handle all special characters.
This is not really a problem flex is "good at", but it can be solved if it is precisely defined. In particular, it is important to know which of the keywords should be returned if the random string of letters before the = contains more than one keyword. For example, suppose the input is:
garbage_widtheight_moregarbage = 42
Now, is that setting the width or the height?
Remember that flex scanners will choose the rule with longest match, and of rules with equally long matches, the first one in the lexical description.
So the model presented in the OP:
fill [^=#\n]*
%%
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
/* SNIP */
will always prefer width to height, because the matches will be the same length (both terminate at the last character before the =), and the width pattern comes first in the file. If the rules were written in the opposite order, height would be preferred.
On the other hand, if you removed the second {fill}:
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
then the last keyword in the input (in this case, height) will be preferred, because that one has the longer match.
The most likely requirement, however, is that the first keyword be recognized, so neither of the preceding will work. In order to match the first keyword, it is necessary to first match the shortest possible sequence of {fill}. And since flex does not implement non-greedy repetition, that can only be done with a character-by-character span.
Here's an example, using start conditions. Note that we hold onto the keyword token until we actually find the =, in case the = is not found.
/* INITIAL: beginning of a line
* FIND_EQUAL: keyword recognized, looking for the =
* VALUE: = recognized, lexing the right-hand side
* NEXT_LINE: find the next line and continue the scan
*/
%x FIND_EQUAL VALUE
%%
int keyword;
"[#=]".* /* Skip comments and lines with no recognizable keyword */
version { keyword = KW_VERSION; BEGIN(FIND_EQUAL); }
width { keyword = KW_WIDTH; BEGIN(FIND_EQUAL); }
height { keyword = KW_HEIGHT; BEGIN(FIND_EQUAL); }
/* etc. */
.|\n /* Skip any other single character, or newline */
<FIND_EQUAL>{
[^=#\n]*"=" { BEGIN(VALUE); return keyword; }
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
}
<VALUE>{
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
[[:blank:]]+ ; /* Ignore space and tab characters */
[[:digit:]]+ { yylval.ival = atoi(yytext);
BEGIN(NEXT_LINE); return INTEGER;
}
[[:digit:]]+"."[[:digit:]]*|"."[[:digit:]]+ {
yylval.fval = atod(yytext);
BEGIN(NEXT_LINE); return FLOAT;
}
\"([^"]|\\.)*\" { char* s = malloc(yyleng - 1);
yylval.sval = s;
/* Remove quotes and escape characters */
yytext[yyleng - 1] = '\0';
do {
if (*++yytext == '\\') ++yytext;
*s++ = *yytext;
} while (*yytext);
BEGIN(NEXT_LINE); return STRING;
}
/* Other possible value token types */
. BEGIN(NEXT_LINE); /* bad character in value */
}
<NEXT_LINE>.*\n? BEGIN(INITIAL);
In the escape-removal code, you might want to translate things like \n. And you might also want to avoid string values with physical newlines. And a bunch of etceteras. It's only intended as a model.

Weird antlr grammar rule

I have found an old file that define antlr grammar rules like that:
rule_name[ ParamType *param ] > [ReturnType *retval]:
<<
$retval = NULL;
OtherType1 *new_var1 = NULL;
OtherType2 *new_var2 = NULL;
>>
subrule1[ param ] > [ $retval ]
| subrule2 > [new_var2]
<<
if( new_var2 == SOMETHING ){
$retval = something_related_to_new_var2;
}
else{
$retval = new_var2;
}
>>
{
somethingelse > [new_var_1]
<<
/* Do something with new_var_1 */
$retval = new_var_1;
>>
}
;
I'm not an Antlr expert and It's the first time that i see this kind of semantic for a rule definition.
Does anybody know where I can find documentation/informations about this?
Even a keyword for a google search is welcome.
Edit:
It should be ANTLR Version 1.33MR33.
Ok, I found! Here is the guide:
http://www.antlr2.org/book/pcctsbk.pdf
I quote the interesting part of the pdf that answer to my question.
1) Page 47:
poly > [float r]
: <<float f;>>
term>[$r] ( "\+" term>[f] <<$r += f;>> )*
;
Rule poly is defined to have a return value called $r via the "> [float r]" notation; this is similar to the output redirection character of UNIX shells. Setting the value of $r sets the return value of poly. he first action after the ":" is an init-action (because it is the first action of a rule or subrule). The init-action defines a local variable called f that will be used in the (...)* loop to hold the return value of the term.
2) Page 85:
A rule looks like:
rule : alternative1
| alternative2
...
| alternativen
;
where each alternative production is composed of a list of elements that can be references to rules, references to tokens, actions, predicates, and subrules. Argument and return value definitions looks like the following where there are n arguments and m return values:
rule[arg1,...,argn] > [retval1,...,retvalm] : ... ;
The syntax for using a rule mirrors its definition:
a : ... rule[arg1,...,argn] > [v1,...,vm] ...
;
Here, the various vi receive the return values from the rule rule, each vi must be an l-value.
3) Page 87:
Actions are of the form <<...>> and contain user-supplied C or C++ code that must be executed during the parse.

ANTLR Tree Grammar for loops

I'm trying to implement a parser by directly reading a treeWalker and implementing the commands needed for the compiler on the fly. So if I have a command like:
statement
:
^('WRITE' expression)
{
//Here is the command that is created by my Tree Parser
ch.emitRO("OUT",0,0,0,"write out the value of ac");
//and then I handle it in my other classes
}
;
I want it to write OUT 0,0,0; to a file. That's my grammar.
I have a problem though with the loop section in my grammar it is:
'WHILE'^ expression 'DO' stat_seq 'ENDDO'
and in the tree parser:
doWhileStatement
:
^('WHILE' expression 'DO' stat_seq 'ENDDO')
;
What I want to do is directly parse the code from the while loop into the commands I need. I came up with this solution but it doesn't work:
doWhileStatement
:
^('WHILE' e=expression head='DO'
{
int loopHead =((CommonTree) head).getTokenStartIndex();
}
stat_seq
{
if ($e.result==1) {
input.seek(loopHead);
doWhileStatement();
}
}
'ENDDO')
;
for the record here are some of the other commands I've written:
(ignore the code written in brackets, it's for the generation of the commands in a text file.)
stat_seq
:
(statement)+
;
statement
:
^(':=' ID e=expression) { variables.put($ID.text,e); }
| ^('WRITE' expression)
{
ch.emitRM("LDC",ac,$expression.result,0,"pass the expression value to the ac reg");
ch.emitRO("OUT",ac,0,0,"write out the value of ac");
}
| ^('READ' ID)
{
ch.emitRO("IN",ac,0,0,"read value");
}
| ^('IF' expression 'THEN'
{
ch.emitRM("LDC",ac1,$expression.result,0,"pass the expression result to the ac reg");
int savedLoc1 = ch.emitSkip(1);
}
sseq1=stat_seq
'ELSE'
{
int savedLoc2 = ch.emitSkip(1);
ch.emitBackup(savedLoc1);
ch.emitRM("JEQ",ac1,savedLoc2+1,0,"skip as many places as needed depending on the expression");
ch.emitRestore();
}
sseq2=stat_seq
{
int savedLoc3 = ch.emitSkip(0);
ch.emitBackup(savedLoc2);
ch.emitRM("LDC",PC_REG,savedLoc3,0,"skip for the else command");
ch.emitRestore();
}
'ENDIF')
| doWhileStatement
;
Any help would be appreciated, thank you
I found it for everyone who has the same problem I did it like this and it's working:
^('WHILE'
{int c = input.index();}
expression
{int s=input.index();}
.* )// .* is a sequence of statements
{
int next = input.index(); // index of node following WHILE
input.seek(c);
match(input, Token.DOWN, null);
pushFollow(FOLLOW_expression_in_statement339);
int condition = expression();
state._fsp--;
//there is a problem here
//expression() seemed to be reading from the grammar file and I couldn't
//get it to read from the tree walker rule somehow
//It printed something like no viable alt at input 'DOWN'
//I googled it and found this mistake
// So I copied the code from the normal while statement
// And pasted it here and it works like a charm
// Normally there should only be int condition = expression()
while ( condition == 1 ) {
input.seek(s);
stat_seq();//stat_seq is a sequence of statements: (statement ';')+
input.seek(c);
match(input, Token.DOWN, null); //Copied value from EvaluatorWalker.java
//cause couldn't find another way to do it
pushFollow(FOLLOW_expression_in_statement339);
condition = expression();
state._fsp--;
System.out.println("condition:"+condition + " i:"+ variables.get("i"));
}
input.seek(next);
}
I wrote the problem at the comments of my code. If anyone can help me out and answer this for me how to do it I would be grateful. It's so weird that there is nearly no feedback on a correct way to implement loops within a tree grammar on the fly.
Regards,
Alex

Resources