Is $1 a usable token in Lex? - flex-lexer

If I have:
if {
yylval = $1;
}
Is this legal? If not, is there another way to say I want to reference what I put in?
(please dont say yylval = 'if', it's not dynamic, and I want to use it in some more complicated scenarios)

No. $1 and friends are non-terminal or terminal symbols in the grammar. I don't know what you're trying to do exactly, but normally you would have a set of rules like this:
"if" { return IF; }
"else" { return ELSE; }
[0-9]+ { yylval.intValue = atoi(yytext); return INTEGER; }
etc., where IF and ELSE are defined in y.tab.h as a result of being declared in your .y file via the %token directive.
please don't say yylval = 'if', it's not dynamic
Neither is a lex rule. Your purpose remains obscure.

Related

Flex find substring until character

This is my lexer.l file:
%{
#include "../h/Tokens.h"
%}
%option yylineno
%%
[+-]?([1-9]*\.[0-9]+)([eE][+-]?[0-9])? return FLOAT;
[+-]?[1-9]([eE][+-]?[0-9])? return INTEGER;
\"(\\\\|\\\"|[^\"])*\" return STRING;
(true|false) return BOOLEAN;
(func|val|if|else|while|for)* return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
. printf("Unexpected or invalid token: '%s'\n", yytext);
%%
int yywrap(void)
{
return 1;
}
Now, if my lexer finds an unexpected token, it sends an error for every character. I want it to send an error message for every substring until a whitespace or operator.
Example:
Input:
foo bar baz
~±`≥ hello
Output:
Identifier.
Identifier.
Identifier.
Unexpected or invalid token: '~±`≥'
Identifier.
Is there a way to do this with a regex pattern?
Thanks.
Certainly it is possible to do with a regex. But you can't do it with a regex independent of your other token rules. And it may not be trivial to find a correct regex.
In this fairly simple example, though, it's reasonably simple, although there is a corner case. Since there are no multicharacter operators, a character cannot start a token unless it is alphabetic, numeric, one of the operators (-+*.,:;) or a double-quote. And therefore any sequence of such characters is an invalid sequence. Also, I think that you really want ignore whitespace characters (based on the example output), even though your question doesn't show any rule which matches whitespace. So on the assumption that you just left out the whitespace rule, which would be something like
[[:space:]]+ { /* Ignore whitespace */ }
your regex to match a sequence of illegal characters would be
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr, "Invalid sequence %s\m", yytext); }
The corner-case is an unterminated string literal; that is, a token which starts with a " but does not include the matching closing quote. Such a token must necessarily extend to the end of the input, and it can easily be matched by using your string pattern, leaving out the final ". (That works because (f)lex always uses the longest matching pattern, so if there is a terminating " the correct string literal will be matched.)
There are a number of errors in your patterns:
It's almost always a bad idea to match +- at the start of a numeric literal. If you do that, then x+2 will not be correctly analysed; your lexer will return two tokens, an IDENTIFIER and an INTEGER, instead of the correct three tokens (IDENTIFIER, PLUS, INTEGER).
Your FLOAT pattern won't accept numbers starting which contain a 0 before the decimal point, so 0.5 and 10.3 will both fail. Also, you force the exponent to be a single digit, so 1.3E11 won't be matched either. And you force the user to put a digit after the decimal point; most languages accept 3. as equivalent to 3.0. (That last one is not necessarily an error, but it's unconventional.)
Your INTEGER pattern won't accept numbers containing a 0, such as 10. But it will accept scientific notation, which is a little odd; in most languages 3E10 is a floating point constant, not an integer.
Your KEYWORD pattern accepts keywords which are made up of a concatenated series of words, such as forwhilefuncif. You probably didn't intend to put a * at the end of the pattern.
Your string literal pattern allows any sequence of characters other than ", which means a backslash \ will be allowed to match as a single character, even if it is followed by a quote or a backslash. That will result in some string literals not being correctly terminated. For example, given the string literal
"\\"
(which is a string literal containing a single backslash), the regex will match the initial ", then the \ as a single character, and then the \" sequence, and then whatever follows the string literal until it encounters another quote.
The error is the result of flex requiring \ to be escaped inside bracket expressions, unlike Posix regular expressions where \ loses special significance inside brackets.
So that would leave you with something like this:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
[[:space:]]+ /* Ignore whitespace */
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? {
return FLOAT;
}
0|[1-9][0-9]* return INTEGER;
true|false return BOOLEAN;
func|val|if|else|while|for return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
\"(\\\\|\\\"|[^\\"])*\" return STRING;
\"(\\\\|\\\"|[^\\"])* { fprintf(stderr,
"Unterminated string literal\n"); }
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr,
"Invalid sequence %s\m", yytext); }
(If any of those patterns look mysterious, you might want to review the description of flex patterns in the flex manual.)
But I have a feeling that you were looking for something different: a way of magically adapting to any change in the token patterns without excess analysis.
That's possible, too, but I don't know how to do it without code repetition. The basic idea is simple enough: when we encounter an unmatchable character, we just append it to the end of an error token and when we find a valid token, we emit the error message and clear the error token.
The problem is the "when we find a valid token" part, because that means that we need to insert an action at the beginning of every rule other than the error rule. The easiest way to do that is to use a macro, which at least avoids writing out the code for every action.
(F)lex does provide us with some useful tools we can build this on. We'll use one of (f)lex's special actions, yymore(), which causes the current match to be appended to the token being built, which is useful to build up the error token.
In order to know the length of the error token (and therefore to know if there is one), we need an additional variable. Fortunately, (f)lex allows us to define our own local variables inside the scanner. Then we define the macro E_ (whose name was chosen to be short, in order to avoid cluttering the rule actions), which prints the error message, moves yytext over the error token, and resets the error count.
Putting that together:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
int nerrors = 0; /* To keep track of the length of the error token */
/* This macro must be inserted at the beginning of every rule,
* except the fallback error rule.
*/
#define E_ \
if (nerrors > 0) { \
fprintf(stderr, "Invalid sequence %.*s\n", nerrors, yytext); \
yytext += nerrors; yyleng -= nerrors; nerrors = 0; \
} else /* Absorb the following semicolon */
[[:space:]]+ { E_; /* Ignore whitespace */ }
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? { E_; return FLOAT; }
0|[1-9][0-9]* { E_; return INTEGER; }
true|false { E_; return BOOLEAN; }
func|val|if|else|while|for { E_; return KEYWORD; }
[A-Za-z_][A-Za-z_0-9]* { E_; return IDENTIFIER; }
"+" { E_; return PLUS; }
"-" { E_; return MINUS; }
"*" { E_; return MULTI; }
"." { E_; return DOT; }
"," { E_; return COMMA; }
":" { E_; return COLON; }
";" { E_; return SEMICOLON; }
\"(\\\\|\\\"|[^\\"])*\" { E_; return STRING; }
\"(\\\\|\\\"|[^\\"])* { E_;
fprintf(stderr,
"Unterminated string literal\n"); }
. { yymore(); ++nerror; }
That all assumes that we're happy to just produce an error message inside the scanner, and otherwise ignore the erroneous characters. But it may be better to actually return an error indication and let the caller decide how to handle the error. That introduces an extra wrinkle because it requires us to return two tokens in a single action.
For a simple solution, we use another (f)lex feature, yyless(), which allows us to rescan part or all of the current token. We can use that to remove the error token from the current token, instead of adjusting yytext and yyleng. (yyless will do that adjustment for us.) That means that after an error, the next correct token is scanned twice. That may seem inefficient, but it's probably acceptable because:
Most tokens are short,
There's not really much point in optimising for errors. It's much more useful to optimise processing of correct inputs.
To accomplish that, we just need a small change to the E_ macro:
#define E_ \
if (nerrors > 0) { \
yyless(nerrors); \
fprintf(stderr, "Invalid sequence %s\n", yytext); \
nerrors = 0; \
return BAD_INPUT; \
} else /* Absorb the following semicolon */

PEG grammar to accept late definition

I want to write a PEG parser with PackCC (but also peg/leg or other libraries are possible) which is able to calculate some fields with variables on random position.
The first simplified approach is the following grammar:
%source {
int vars[256];
}
statement <- e:term EOL { printf("answer=%d\n", e); }
term <- l:primary
( '+' r:primary { l += r; }
/ '-' r:primary { l -= r; }
)* { $$ = l; }
/ i:var '=' s:term { $$ = vars[i] = s; }
/ e:primary { $$ = e; }
primary <- < [0-9]+ > { $$ = atoi($1); }
/ i:var !'=' { $$ = vars[i]; }
var <- < [a-z] > { $$ = $1[0]; }
EOL <- '\n' / ';'
%%
When testing with sequential order, it works fine:
a=42;a+1
answer=42
answer=43
But when having the variable definition behind the usage, it fails:
a=42;a+b;b=1
answer=42
answer=42
answer=1
And even deeper chained late definitions shall work, like:
a=42;a+b;b=c;c=1
answer=42
answer=42
answer=0
answer=1
Lets think about the input not as a sequential programming language, but more as a Excel-like spreadsheet e.g.:
A1: 42
A2: =A1+A3
A3: 1
Is it possible to parse and handle such kind of text with a PEG grammar?
Is two-pass or multi-pass an option here?
Or do I need to switch over to old style lex/yacc flex/bison?
I'm not familiar with PEG per se, but it looks like what you have is an attributed grammar where you perform the execution logic directly within the semantic action.
That won't work if you have use before definition.
You can use the same parser generator but you'll probably have to define some sort of abstract syntax tree to capture the semantics and postpone evaluation until you've parsed all input.
Yes, it is possible to parse this with a PEG grammar. PEG is effectively greedy LL(*) with infinite lookahead. Expressions like this are easy.
But the grammar you have written is left recursive, which is not PEG. Although some PEG parsers can handle left recursion, until you're an expert it's best to avoid it, and use only right recursion if needed.

ANTLR4: semantic predicate depending on state does not work

In order to have the lexer of ANTLR4 recognize different kinds of tokens in one rule I use a semantic predicate. This predicate evaluates a static field of a helper class. Have a look at some grammar excerpts:
// very simplified
#header {
import static ParserAndLexerState.*;
}
#members {
private boolean fooAllowed() {
System.out.println(fooAllowed);
}
...
methodField
: t = type
{ fooAllowed = false; }
id = Identifier
{ fooAllowed = true; /* do something with t and id*/ }
...
fragment CHAR_NO_OUT_1 : [a-eg-zA-Z_] ;
fragment CHAR_NO_OUT_2 : [a-nq-zA-Z_0-9] ;
fragment CHAR_NO_OUT_3 : [a-nq-zA-Z_0-9] ;
fragment CHAR_1 : [a-zA-Z_] ;
fragment CHAR_N : CHAR_1 | [0-9] ;
Identifier
// returns every possible identifier
: { fooAllowed() }? (CHAR_1 CHAR_N*)
// returns everything but 'foo'
| { !fooAllowed() }? CHAR_NO_OUT_1 (CHAR_NO_OUT_2 (CHAR_NO_OUT_3 CHAR_N*)?)? ;
Identifier will now always behave as if fooAllowed had the initial value of the definition in ParserAndLexerState. So if this was true Identifier will only use the first alternative of the rule, otherwise always the second. This is some weird behavior, especially considering that fooAllowed prints the right values to the console.
Is there anything in ANTLR4 that could discourages me from using global state from within semantic predicates? How can I avoid this behavior?
ANTLR 4 uses unbounded lookahead with non-deterministic termination conditions for the prediction process. While the TokenStream implementations do call TokenSource.nextToken lazily, it is not safe to ever assume that the number of tokens consumed so far is bounded.
In other words, the actual semantics of using a parser action to change the behavior of the lexer are undefined. Different versions of ANTLR 4, or even subtle changes in the input you give it, could produce completely different results.

Flex use constants Container

i have to program a compiler with flex.
But i don't like the given code and want to make my self.
lexfile.l:
{%
typedef enum { EQ=0, NE, PLUS, MINUS, SEMICOLON } punctuationType;
typedef enum { PRINT=100, WHILE, IDENT } keywordType;
%}
%%
"!=" { return NEQ; }
"=" { return EQ; }
"+" { return PLUS; }
"-" { return MINUS; }
";" { return SEMICOLON; }
%%
Is there a better solution?
I have searched for a solution but the other solution is to define the Constants.
#define EQ 0
#define NE 1
...
Output Example:
Operator EQ
Operator NE
The Question is only, if there is a better type instead the Enum
Whatever you return has to be understood by the compiler. If you're using yacc, you don't get the choice: you have to abide by whatever %token generates, which are defined for you in y.tab.h.: you don't have to do anything at all.
On the other hand there's no need to have either names or flex rules for the single-char special characters: you can just return yytext[0] for all of them and use the corresponding literals in the .y file.
You don't really give enough details for further comment.

How to parse subnodes that depended on parents' information?

If I write grammar file in Yacc/Bison like this:
Module
:ModuleName "=" Functions
{ $$ = Builder::concat($1, $2, ","); }
Functions
:Functions Function
{ $$ = Builder::concat($1, $2, ","); }
| Function
{ $$ = $1; }
Function
: DEF ID ARGS BODY
{
/** Lacks module name to do name mangling for the function **/
/** How can I obtain the "parent" node's module name here ?? **/
module_name = ; //????
$$ = Builder::def_function(module_name, $ID, $ARGS, $BODY);
}
And this parser should parse codes like this:
main_module:
def funA (a,b,c) { ... }
In my AST, the name "funA" should be renamed as main_module.funA. But I can't get the module's information while the parser is processing Function node !
Is there any Yacc/Bison facilities can help me to handle this problem, or should I change my parsing style to avoid such embarrassing situations ?
There is a bison feature, but as the manual says, use it with care:
$N with N zero or negative is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:
foo: expr bar '+' expr { ... }
| expr bar '-' expr { ... }
;
bar: /* empty */
{ previous_expr = $0; }
;
As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo.
More cleanly, you could use a mid-rule action (in Module) to push the module name on a name stack (which would have to be part of the parsing context). You would then pop the stack at the end of the rule.
For more information and examples of mid-rules actions, see the manual.

Resources