Using Lex tokenizers in an embedded system - parsing

I'm trying to write a config-file parser for use in a non-standard C environment. Specifically, I can't rely on the utilities provided by <stdio.h>.
I'm looking to use Flex, but I need to use my own input structures rather than <stdio.h>'s FILE pointers.

you can define your own input method by defining the YY_INPUT method:
%{
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}

Ragel is a generic state machine compiler, which you can use the generated code inside a C function. It has special support for building tokenizers.

Related

Implementing Macro in a Rascal language project

Any idea on how to implement macro syntax with Rascal and also how to implement the typing and expansion(translation) of the macro syntax in Rascal? Any link to projects or repositories on this problem would also be appreciated.
Macro's are definitions of code substitutions in syntax trees, which is definitely one of the main features of Rascal. Questions I would have before advicing specific techniques:
adding macro's to an existing languages, or to a new language?
macro's at refactoring time, at compile-time or a run-time?
which would inform the question whether or not to implement macro's on concrete syntaxt trees or abstract syntax trees.
I would not say macros are a "problem" per se. The raw substitutions in syntax trees are trivial with Rascal. However, "hygienic macros" are more involved. Here we have to consider the capturing of variables by the expanded macro bodies, and what we can do about this (renaming) to avoid it. The literature on how to make macros hygienic is plenty. The complexity of hygienic macros depends on the type and name analysis (scoping) system of the base language that macros are added to.
If you have a DSL that you want to translate in stages to the target code, that can also be called "macros", but you will not find that name in the documentation. Here is an example: https://github.com/usethesource/flybytes/blob/main/src/lang/flybytes/macros/ControlFlow.rsc where "macro" is used to rewrite an additional AST node to its semantics in the "core" language.
The basic mechanisms are:
pattern matching: detects what you want to expand, with macros this is often a single ADT constructor but it can also be a more complex special case like matching i+=1 to substitute it with i++ .
substitution: at the location where the match was found, we create a new AST value in a simpler language but with the same semantics. This is done with AST expressions in Rascal, the => operator in visit and insert statements, and return and = in functions.
traversal: guiding the pattern matching and substitution without having to write to much boilerplate recursive functions.
Small example:
data Bool(loc src=|unknown:///|)
= \and(Bool l, Bool r)
| \or(Bool r, Bool r)
| \true()
| \false()
| \not(Bool a)
;
I extend the language with a "macro":
data Bool = impl(Bool l, Bool r)
A first option is to rewrite the constructor immediately and always with an overloaded function:
Bool impl(Bool l, Bool r) = or(not(l), r);
However, we lose some information here for debugging purposes, so let's try to keep the information intact:
Bool impl(Bool l, Bool r, src=loc s) = or(not(l), r, src=s);
Sometimes we want to delay the expansion for a specific stage in the compiler. In particular with the above "rewrite rule" a type-checker will not see the different anymore between ==> and || which sometimes creates usability issues with error messages.
In that case we wrap the expansion in a visit and stage it as a function:
Bool macroExpansion(Bool input) = visit(input) {
case impl(Bool l, Bool r, src=loc s) => or(not(l), r, src=s)
// add more rules here
}
It is also possible to encapsulate rewrite rules as reusable functions:
Bool expand1(impl(Bool l, Bool r, src=loc s) = or(not(l), r, src=s);
Bool expand2(not(not(Bool b))) = b;
and then pass those or apply those: (expand1 + expand2)(myBool)
So to wrap this up:
pattern matching is the key to macro expansion, patterns can be wrapped in functions or visit cases or both, and functions can be passed around and combined.
watch out to do some "origin tracking" and forward src fields to the right-hand sides of rewrite rules, otherwise the generated code does not know where it comes from.

flex scanner push-back overflow with automata

I am having a hard time with this problem.
"Write a flex code which recognizes a chain with alphabet {0,1}, with at least 5 char's, and to every consecutive 5 char's there will bee at least 3 1's"
I thought I have solved, but I am new using flex, so I am getting this "flex scanner push-back overflow".
here's my code
%{
#define ACCEPT 1
#define DONT 2
%}
delim [ \t\n\r]
ws {delim}+
comb01 00111|{comb06}1
comb02 01011|{comb07}1
comb03 01101|{comb08}1
comb04 01110|({comb01}|{comb09})0
comb05 01111|({comb01}|{comb09})1
comb06 10011|{comb10}1
comb07 10101|{comb11}1
comb08 10110|({comb02}|{comb12})0
comb09 10111|({comb02}|{comb12})1
comb10 11001|{comb13}1
comb11 11010|({comb03}|{comb14})0
comb12 11011|({comb03}|{comb14})1
comb13 11100|({comb04}|{comb15})0
comb14 11101|({comb04}|{comb15})1
comb15 11110|({comb05}|{comb16})0
comb16 11111|({comb05}|{comb16})1
accept {comb01}|{comb02}|{comb03}|{comb04}|{comb05}|{comb06}|{comb07}|{comb08}|{comb09}|{comb10}|{comb11}|{comb12}|{comb13}|{comb14}|{comb15}|{comb16}
string [^ \t\n\r]+
%%
{ws} { ;}
{accept} {return ACCEPT;}
{string} {return DONT;}
%%
void main () {
int i;
while (i = yylex ())
switch (i) {
case ACCEPT:
printf ("%-20s: ACCEPT\n", yytext);
break;
case DONT:
printf ("%-20s: Reject\n", yytext);
break;
}
}
Flex definitions are macros, and flex implements them that way: when it sees {defn} in a pattern, it replaces it with whatever defn was defined as (in parentheses, usually, to avoid operator precedence issues). It doesn't expand the macros in the macro definition, so the macro substitution might contain more definition references which in turn need to be substituted.
Since macro substitution is unconditional, it is not possible to use recursive macros, including macros which are indirectly recursive. Which yours are. Flex doesn't check for this condition, unlike the C preprocessor; it just continues substituting in an endless loop until it runs out of space.
(Flex is implemented using itself; it does the macro substitution using unput. unput will not resize the input buffer, so "runs out of space" here means that flex's internal flex's input buffer became full of macro substitutions.)
The strategy you are using would work fine as a context-free grammar. But that's not flex. Flex is about regular expressions. The pattern you want to match can be described by a regular expression -- the "grammar" you wrote with flex macros is a regular grammar -- but it is not a regular expression and flex won't make one out of it for you, unfortunately. That's your job.
I don't think it's going to be a very pretty regular expression. In fact, I think it's likely to be enormous. But I didn't try working it out..
There are flex tricks you could use to avoid constructing the regular expression. For example, you could build your state machine out of flex start conditions and then scan one character at a time, where each character scanned does a state transition or throws an error. (Use more() if you want to return the entire string scanned at the end.)

It's possible yylval be a struct instead a union?

On Bison, it's possible yylval be a struct instead a union ? I know that i can define yylval as union with %union{} but is there a way to define yylval as struct ? to return the line and the string of a identifier for exemple and access
these information on a action of some gramar rule on bison.
Yes, you can #define YYSTYPE to be any type you want instead of using %union. However, it is rarely useful to do so1 -- if you want source position info, you're much better off using %position in combination with %union.
Its also possible (and common) to use structs within the %union declaration. This makes it easy for some rules to return multiple values (effectively).
1The main problem being that if you use %type to specify the use of one struct field, its painful to use other fields in the same action. You need to do everything manually, thus losing the benefit of bison's union type checking
If you want to keep location information (line number and column number) for your tokens, you can use Bison's location facility, which keeps a location object for each token and non-terminal separately from the semantic value. In an action, you refer to a symbol's location as #n.
The location stack is created and maintained automatically by bison if it sees that you have referred to a location anywhere in a rule.
By default, the location datatype is:
typedef struct YYLTYPE {
int first_line;
int first_column;
int last_line;
int last_column;
} YYLTYPE;
The location information for tokens must be set by the lexer. If you are using the default API, it is stored in the global variable yylloc. The parser will create location information for non-terminals by using the range from the beginning of the first item of a production up to the end of the last item. (For empty productions, a zero-length location object is generated, starting and ending with the start position of the lookahead token.)
Both of these defaults can be overridden if necessary. See the Bison manual for details.
Flex will track line numbers if asked to with %option yylineno, but it does not track column positions, which is a bit annoying. Also, yylloc requires both a starting and an ending line number; yylineno in a flex action will be the line number at the end of the token. Most commonly, you will use the YY_USER_ACTION macro to maintain the value of yylloc; an example implementation (taken from this answer, which you should read if you use this code) is:
%option yylineno
%{
#define YY_USER_ACTION \
yylloc.first_line = yylloc.last_line; \
yylloc.first_column = yylloc.last_column; \
if (yylloc.first_line == yylineno) \
yylloc.last_column += yyleng; \
else { \
int col; \
for (col = 1; yytext[yyleng - col] != '\n'; ++col) {} \
yylloc.last_column = col; \
yylloc.last_line = yylineno; \
}
%}

Binary expression in Lua

Does Lua contain binary expressions like PHP? For example:
$v = 5;
for ($i=0; $i < $v; $i++) {
if($v & $i) {
echo $i." ";
}
}
Echo result:
1 3 4
If so, how to use them?
Since version 5.2 Lua comes with bit32 library. bit32.band is equivalent to & operator in php. LuaJIT also has bit operations.
Edit
Well, they're not exactly equivalent, but serve the same purpose.
See Lua logical operators as described at http://www.lua.org/manual/5.1/manual.html#2.5.3.

Question in Flex (parser)

I want to ask you a question about Flex, the program for parsing code.
Supposing I have an instruction like this one, in the rules part:
"=" BEGIN(attribution);
<attribution>{var_name} { fprintf(yyout, "="); ECHO; }
<attribution>";" BEGIN(INITIAL);
{var_name} is a regular expression that matches a variable's name, and all I want to do is to copy at the output all the attribution instructions, such as
a = 3;
or
b = a;
My rule though cannot write with fprintf the left member of the attribution, but only
= 3;
or
=a;
One solution for that might be that, after I make the match "=" and I am in the attribution state, to go 2 positions back as to get the left operand as well.
How can I do that in Flex?
Why are you using flex for syntactical analysis?
What you are doing sounds like a bison stuff not a flex job.
You'll be able to store previous token.
If you still want to use flex, you can use the / pattern.
Using this may lead to inefficiencies and the lexer can be bogus; it depends of the whole rule set.
{var_name}/"=" { ECHO; BEGIN(attribution); }
See the flex manual.

Resources