Match this text but only if it comes at the end of the file - flex-lexer

Page 78 of the Flex user's manual says:
There is no way to write a rule which is "match this text, but only if
it comes at the end of the file". You can fake it, though, if you
happen to have a character lying around that you don't allow in your
input. Then you can redefine YY_INPUT to call your own routine which,
if it sees an EOF, returns the magic character first (and remembers to
return a real EOF next time it's called.
I am trying to implement that approach. In fact I managed to get it working (see below). For this input:
Hello, world
How are you?#
I get this (correct) output:
Here's some text Hello, world
Saw this string at EOF How are you?
But I had to do two things in my implementation to get it to work; two things that I shouldn't have to do:
I had to call yyterminate(). If I don't call yyterminate() then the output is this:
Here's some text Hello, world
Saw this string at EOF How are you?
Saw this string at EOF
I shouldn't be getting that last line. Why am I getting that last line?
I don't understand why I had to do this: tmp[yyleng-1] = '\0'; (subtract 1). I should be able to do this: tmp[yyleng] = '\0'; (not subtract 1) Why do I need to subtract 1?
%option noyywrap
%{
int sawEOF = 0;
#define YY_INPUT(buf,result,max_size) \
{ \
if (sawEOF == 1) \
result = YY_NULL; \
else { \
int c = fgetc(yyin); \
if (c == EOF) { \
sawEOF = 1; \
buf[0] = '#'; \
result = 1; \
} \
else { \
buf[0] = c; \
result = 1; \
} \
} \
}
%}
EOF_CHAR #
%%
[^\n#]*{EOF_CHAR} { char *tmp = strdup(yytext);
tmp[yyleng-1] = '\0';
printf("Saw this string at EOF %s\n", tmp);
yyterminate();
}
[^\n#]+ { printf("Here's some text %s\n", yytext); }
\n { }
%%
int main(int argc, char *argv[])
{
yyin = fopen(argv[1], "r");
yylex();
fclose(yyin);
return 0;
}

Related

Match rule only at the begining of a file

Problem
I'm writing a sort of a script language interpreter.
I would like it to be able to handle (ignore) things like shebang, utf-bom, or other such thing that can appear on the beginning of a file.
The problem is that I cannot be sure that my growing grammar won't at some point have a rule that could match one of those things. (It's unlikely but you don't get reliable programs by just ignoring unlikely problems.)
Therefore, I would like to do it properly and ignore those things only if they are at the beginning of a file.
Let's focus on a shebang in the example.
I've written some simple grammar that illustrates the problems I'm facing.
Lexer:
%%
#!.+ { printf("shebang: \"%s\"\n", yytext + 2); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Input file:
#!my-program
# some multiline
thingy #
aaa bbb
ccc#!not a shebang#ddd
eee
Expected output:
shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Actual output:
thingy: "#!my-program
#"
id: some
id: multiline
id: thingy
thingy: "#
aaa bbb
ccc#"
error: '!'
id: not
id: a
id: shebang
error: '#'
id: ddd
id: eee
My (bad?) solution
I figured that this is a good case to use start conditions.
I managed to use them to write a lexer that does work, however, it's rather ugly:
%s MAIN
%%
<INITIAL>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<INITIAL>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
<MAIN>#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Notice that I had to specify the start condition MAIN before the rule #[^#]*#.
It's because it would otherwise collide with the shebang rule #!.+.
Unfortunately, the INITIAL start condition is inclusive, which means I had to specifically exclude from it any rule that would cause problems. I have to remember about it every time I write a new rule (AKA I'll forget about it).
Is there some way to make the INITIAL exclusive or choose a different start condition to be the default?
Here's a simpler solution, assuming you're using Flex (as per your flex-lexer tag):
%option noinput nounput noyywrap nodefault yylineno
%{
#define YY_USER_INIT BEGIN(STARTUP);
%}
%x STARTUP
%%
<STARTUP>#!.* { BEGIN(INITIAL); printf("Shebang: \"%s\"\n", yytext+2); }
<STARTUP>.|\n { BEGIN(INITIAL); yyless(0); }
/* Rest is INITIAL */
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
Test:
rici$ flex -o shebang.c shebang.l
rici$ gcc -Wall -o shebang shebang.c -lfl
rici$ ./shebang <<"EOF"
> #!my-program
> # some multiline
> thingy #
> aaa bbb
> ccc#!not a shebang#ddd
> eee
> EOF
Shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Notes:
The %option line:
prevents "Unused function" warnings;
removes the need for yywrap;
shows an error if there's some possible input which doesn't match any pattern;
counts input lines in the global yylineno
The macro YY_USER_INIT is executed precisely once, when the scanner starts up. It executes before any of Flex's initialization code; fortunately, Flex's initialization code does not change the start condition if it's already been set.
yyless(0) causes the current token to be rescanned. (The argument doesn't have to be 0; it truncates the current token to that length and efficiently puts the rest back into the input stream.)
The library -lfl includes yywrap() (although in this case, it's not used), and a simple main() definition rather similar to the one in your example.
(1) and (2) are Flex extensions. (3) and (4) should be available in any lex which conforms to Posix, with the exception that the Posix lex libary is linked with -ll.
There is an indirect way to select a different start condition.
The start condition is an integer variable (e.g. INITIAL is 0).
You can get its current value using a macro YY_START and if it equals INITIAL, change it to another value, effectively replacing it.
%x BEGINNING
%s MAIN
%%
%{
if (YY_START == INITIAL)
BEGIN(BEGINNING);
%}
<BEGINNING>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<BEGINNING>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
The disadvantages of this solution are:
The code block will execute every time you call yylex (not only the first time, when it's actually needed). The overhead is small enough to be ignored, though.
Flex will still generate the whole state machine as if INITIAL was used. I have no idea if this creates a lot of code or if it would matter in a big parser.

Parsing an integer and HEX value in Ragel

I am trying to design a parser using Ragel and C++ as host langauge.
There is a particular case where a parameter can be defined in two formats :
a. Integer : eg. SignalValue = 24
b. Hexadecimal : eg. SignalValue = 0x18
I have the below code to parse such a parameter :
INT = ((digit+)$incr_Count) %get_int >!(int_error); #[0-9]
HEX = (([0].'x'.[0-9A-F]+)$incr_Count) %get_hex >!(hex_error); #[hexadecimal]
SIGNAL_VAL = ( INT | HEX ) %/getSignalValue;
However in the above defined parser command, only the integer values(as defined in section a) gets recognized and parsed correctly.
If an hexadecimal number(eg. 0x24) is provided, then the number gets stored as ´0´ . There is no error called in case of hexadecimal number. The parser recognizes the hexadecimal, but the value stored is '0'.
I seem to be missing out some minor details with Ragel. Has anyone faced a similar situation?
The remaning part of the code :
//Global
int lInt = -1;
action incr_Count {
iGenrlCount++;
}
action get_int {
int channel = 0xFF;
std::stringstream str;
while(iGenrlCount > 0)
{
str << *(p - iGenrlCount);
iGenrlCount--;
}
str >> lInt; //push the values
str.clear();
}
action get_hex {
std::stringstream str;
while(iGenrlCount > 0)
{
str << std::hex << *(p - iGenrlCount);
iGenrlCount--;
}
str >> lInt; //push the values
}
action getSignalValue {
cout << "lInt = " << lInt << endl;
}
It's not a problem with your FSM (which looks fine for the task you have), it's more of a C++ coding issue. Try this implementation of get_hex():
action get_hex {
std::stringstream str;
cout << "get_hex()" << endl;
while(iGenrlCount > 0)
{
str << *(p - iGenrlCount);
iGenrlCount--;
}
str >> std::hex >> lInt; //push the values
}
Notice that it uses str just as a string buffer and applies std::hex to >> from std::stringstream to int. So in the end you get:
$ ./a.out 245
lInt = 245
$ ./a.out 0x245
lInt = 581
Which probably is what you want.

Cling API available?

How to use Cling in my app via API to interpret C++ code?
I expect it to provide terminal-like way of interaction without need to compile/run executable. Let's say i have hello world program:
void main() {
cout << "Hello world!" << endl;
}
I expect to have API to execute char* = (program code) and get char *output = "Hello world!". Thanks.
PS. Something similar to ch interpeter example:
/* File: embedch.c */
#include <stdio.h>
#include <embedch.h>
char *code = "\
int func(double x, int *a) { \
printf(\"x = %f\\n\", x); \
printf(\"a[1] in func=%d\\n\", a[1]);\
a[1] = 20; \
return 30; \
}";
int main () {
ChInterp_t interp;
double x = 10;
int a[] = {1, 2, 3, 4, 5}, retval;
Ch_Initialize(&interp, NULL);
Ch_AppendRunScript(interp,code);
Ch_CallFuncByName(interp, "func", &retval, x, a);
printf("a[1] in main=%d\n", a[1]);
printf("retval = %d\n", retval);
Ch_End(interp);
}
}
There is finally a better answer: example code! See https://github.com/root-project/cling/blob/master/tools/demo/cling-demo.cpp
And the answer to your question is: no. cling takes code and returns C++ values or objects, across compiled and interpreted code. It's not a "string in / string out" kinda thing. There's perl for that ;-) This is what code in, value out looks like:
// We could use a header, too...
interp.declare("int aGlobal;\n");
cling::Value res; // Will hold the result of the expression evaluation.
interp.process("aGlobal;", &res);
std::cout << "aGlobal is " << res.getAs<long long>() << '\n';
Apologies for the late reply!
Usually the way one does it is:
[cling$] #include "cling/Interpreter/Interpreter.h"
[cling$] const char* someCode = "int i = 123;"
[cling$] gCling->declare(someCode);
[cling$] i // You will have i declared:
(int) 123
The API is documented in: http://cling.web.cern.ch/cling/doxygen/classcling_1_1Interpreter.html
Of course you can create your own 'nested' interpreter in cling's runtime too. (See the doxygen link above)
I hope it helps and answers the question, more usage examples you can find under the test/ folder.
Vassil

Is there a way to localize error messages from bison/flex?

Do bison and flex allow user to natively localize error messages?
For example, I would like to translate following message: syntax error, unexpected NUMBER, expecting $end to other language and replace NUMBER/$end with something more human-readable.
Use yyerror and YY_USER_ACTION for additional data.
void yyerror(const char *s) {
sprintf(dummmy, "%s line %d col %d word '%s'\n", s, myline, mycolumn, yytext);
print_error(dummmy);
in the lex file
#define YY_USER_ACTION \
addme(yy_start, yytext); \
mycolumn += yyleng;\
if(*yytext == '\n') { myline++; mycolumn = 0; } else 0; \

How to find functions in a cpp file that contain a specific word

using grep, vim's grep, or another unix shell command, I'd like to find the functions in a large cpp file that contain a specific word in their body.
In the files that I'm working with the word I'm looking for is on an indented line, the corresponding function header is the first line above the indented line that starts at position 0 and is not a '{'.
For example searching for JOHN_DOE in the following code snippet
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
should give me
void bar ( std::string arg2 )
The algorithm that I hope to catch in grep/vim/unix shell scripts would probably best use the indentation and formatting assumptions, rather than attempting to parse C/C++.
Thanks for your suggestions.
I'll probably get voted down for this!
I am an avid (G)VIM user but when I want to review or understand some code I use Source Insight. I almost never use it as an actual editor though.
It does exactly what you want in this case, e.g. show all the functions/methods that use some highlighted data type/define/constant/etc... in a relations window...
(source: sourceinsight.com)
Ouch! There goes my rep.
As far as I know, this can't be done. Here's why:
First, you have to search across lines. No problem, in vim adding a _ to a character class tells it to include new lines. so {_.*} would match everything between those brackets across multiple lines.
So now you need to match whatever the pattern is for a function header(brittle even if you get it to work), then , and here's the problem, whatever lines are between it and your search string, and finally match your search string. So you might have a regex like
/^\(void \+\a\+ *(.*)\)\_.*JOHN_DOE
But what happens is the first time vim finds a function header, it starts matching. It then matches every character until it finds JOHN_DOE. Which includes all the function headers in the file.
So the problem is that, as far as I know, there's no way to tell vim to match every character except for this regex pattern. And even if there was, a regex is not the tool for this job. It's like opening a beer with a hammer. What we should do is write a simple script that gives you this info, and I have.
fun! FindMyFunction(searchPattern, funcPattern)
call search(a:searchPattern)
let lineNumber = line(".")
let lineNumber = lineNumber - 1
"call setpos(".", [0, lineNumber, 0, 0])
let lineString = getline(lineNumber)
while lineString !~ a:funcPattern
let lineNumber = lineNumber - 1
if lineNumber < 0
echo "Function not found :/"
endif
let lineString = getline(lineNumber)
endwhile
echo lineString
endfunction
That should give you the result you want and it's way easier to share, debug, and repurpose than a regular expression spit from the mouth of Cthulhu himself.
Tough call, although as a starting point I would suggest this wonderful VIM Regex Tutorial.
You cannot do that reliably with a regular expression, because code is not a regular language. You need a real parser for the language in question.
Arggh! I admit this is a bit over the top:
A little program to filter stdin, strip comments, and put function bodies on the same line. It'll get fooled by namespaces and function definitions inside class declarations, besides other things. But it might be a good start:
#include <stdio.h>
#include <assert.h>
int main() {
enum {
NORMAL,
LINE_COMMENT,
MULTI_COMMENT,
IN_STRING,
} state = NORMAL;
unsigned depth = 0;
for(char c=getchar(),prev=0; !feof(stdin); prev=c,c=getchar()) {
switch(state) {
case NORMAL:
if('/'==c && '/'==prev)
state = LINE_COMMENT;
else if('*'==c && '/'==prev)
state = MULTI_COMMENT;
else if('#'==c)
state = LINE_COMMENT;
else if('\"'==c) {
state = IN_STRING;
putchar(c);
} else {
if(('}'==c && !--depth) || (';'==c && !depth)) {
putchar(c);
putchar('\n');
} else {
if('{'==c)
depth++;
else if('/'==prev && NORMAL==state)
putchar(prev);
else if('\t'==c)
c = ' ';
if(' '==c && ' '!=prev)
putchar(c);
else if(' '<c && '/'!=c)
putchar(c);
}
}
break;
case LINE_COMMENT:
if(' '>c)
state = NORMAL;
break;
case MULTI_COMMENT:
if('/'==c && '*'==prev) {
c = '\0';
state = NORMAL;
}
break;
case IN_STRING:
if('\"'==c && '\\'!=prev)
state = NORMAL;
putchar(c);
break;
default:
assert(!"bug");
}
}
putchar('\n');
return 0;
}
Its c++, so just it in a file, compile it to a file named 'stripper', and then:
cat my_source.cpp | ./stripper | grep JOHN_DOE
So consider the input:
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
The output of "cat example.cpp | ./stripper" is:
int foo ( int arg1 ) { }
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The output of "cat example.cpp | ./stripper | grep JOHN_DOE" is:
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The job of finding the function name (guess its the last identifier to precede a "(") is left as an exercise to the reader.
For that kind of stuff, although it comes to primitive searching again, I would recommend compview plugin. It will open up a search window, so you can see the entire line where the search occured and automatically jump to it. Gives a nice overview.
(source: axisym3.net)
Like Robert said Regex will help. In command mode start a regex search by typing the "/" character followed by your regex.
Ctags1 may also be of use to you. It can generate a tag file for a project. This tag file allows a user to jump directly from a function call to it's definition even if it's in another file using "CTRL+]".
u can use grep -r -n -H JOHN_DOE * it will look for "JOHN_DOE" in the files recursively starting from the current directory
you can use the following code to practically find the function which contains the text expression:
public void findFunction(File file, String expression) {
Reader r = null;
try {
r = new FileReader(file);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
BufferedReader br = new BufferedReader(r);
String match = "";
String lineWithNameOfFunction = "";
Boolean matchFound = false;
try {
while(br.read() > 0) {
match = br.readLine();
if((match.endsWith(") {")) ||
(match.endsWith("){")) ||
(match.endsWith("()")) ||
(match.endsWith(")")) ||
(match.endsWith("( )"))) {
// this here is because i guessed that method will start
// at the 0
if((match.charAt(0)!=' ') && !(match.startsWith("\t"))) {
lineWithNameOfFunction = match;
}
}
if(match.contains(expression)) {
matchFound = true;
break;
}
}
if(matchFound)
System.out.println(lineWithNameOfFunction);
else
System.out.println("No matching function found");
} catch (IOException ex) {
ex.printStackTrace();
}
}
i wrote this in JAVA, tested it and works like a charm. has few drawbacks though, but for starters it's fine. didn't add support for multiple functions containing same expression and maybe some other things. try it.

Resources