How to find functions in a cpp file that contain a specific word - grep

using grep, vim's grep, or another unix shell command, I'd like to find the functions in a large cpp file that contain a specific word in their body.
In the files that I'm working with the word I'm looking for is on an indented line, the corresponding function header is the first line above the indented line that starts at position 0 and is not a '{'.
For example searching for JOHN_DOE in the following code snippet
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
should give me
void bar ( std::string arg2 )
The algorithm that I hope to catch in grep/vim/unix shell scripts would probably best use the indentation and formatting assumptions, rather than attempting to parse C/C++.
Thanks for your suggestions.

I'll probably get voted down for this!
I am an avid (G)VIM user but when I want to review or understand some code I use Source Insight. I almost never use it as an actual editor though.
It does exactly what you want in this case, e.g. show all the functions/methods that use some highlighted data type/define/constant/etc... in a relations window...
(source: sourceinsight.com)
Ouch! There goes my rep.

As far as I know, this can't be done. Here's why:
First, you have to search across lines. No problem, in vim adding a _ to a character class tells it to include new lines. so {_.*} would match everything between those brackets across multiple lines.
So now you need to match whatever the pattern is for a function header(brittle even if you get it to work), then , and here's the problem, whatever lines are between it and your search string, and finally match your search string. So you might have a regex like
/^\(void \+\a\+ *(.*)\)\_.*JOHN_DOE
But what happens is the first time vim finds a function header, it starts matching. It then matches every character until it finds JOHN_DOE. Which includes all the function headers in the file.
So the problem is that, as far as I know, there's no way to tell vim to match every character except for this regex pattern. And even if there was, a regex is not the tool for this job. It's like opening a beer with a hammer. What we should do is write a simple script that gives you this info, and I have.
fun! FindMyFunction(searchPattern, funcPattern)
call search(a:searchPattern)
let lineNumber = line(".")
let lineNumber = lineNumber - 1
"call setpos(".", [0, lineNumber, 0, 0])
let lineString = getline(lineNumber)
while lineString !~ a:funcPattern
let lineNumber = lineNumber - 1
if lineNumber < 0
echo "Function not found :/"
endif
let lineString = getline(lineNumber)
endwhile
echo lineString
endfunction
That should give you the result you want and it's way easier to share, debug, and repurpose than a regular expression spit from the mouth of Cthulhu himself.

Tough call, although as a starting point I would suggest this wonderful VIM Regex Tutorial.

You cannot do that reliably with a regular expression, because code is not a regular language. You need a real parser for the language in question.

Arggh! I admit this is a bit over the top:
A little program to filter stdin, strip comments, and put function bodies on the same line. It'll get fooled by namespaces and function definitions inside class declarations, besides other things. But it might be a good start:
#include <stdio.h>
#include <assert.h>
int main() {
enum {
NORMAL,
LINE_COMMENT,
MULTI_COMMENT,
IN_STRING,
} state = NORMAL;
unsigned depth = 0;
for(char c=getchar(),prev=0; !feof(stdin); prev=c,c=getchar()) {
switch(state) {
case NORMAL:
if('/'==c && '/'==prev)
state = LINE_COMMENT;
else if('*'==c && '/'==prev)
state = MULTI_COMMENT;
else if('#'==c)
state = LINE_COMMENT;
else if('\"'==c) {
state = IN_STRING;
putchar(c);
} else {
if(('}'==c && !--depth) || (';'==c && !depth)) {
putchar(c);
putchar('\n');
} else {
if('{'==c)
depth++;
else if('/'==prev && NORMAL==state)
putchar(prev);
else if('\t'==c)
c = ' ';
if(' '==c && ' '!=prev)
putchar(c);
else if(' '<c && '/'!=c)
putchar(c);
}
}
break;
case LINE_COMMENT:
if(' '>c)
state = NORMAL;
break;
case MULTI_COMMENT:
if('/'==c && '*'==prev) {
c = '\0';
state = NORMAL;
}
break;
case IN_STRING:
if('\"'==c && '\\'!=prev)
state = NORMAL;
putchar(c);
break;
default:
assert(!"bug");
}
}
putchar('\n');
return 0;
}
Its c++, so just it in a file, compile it to a file named 'stripper', and then:
cat my_source.cpp | ./stripper | grep JOHN_DOE
So consider the input:
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
The output of "cat example.cpp | ./stripper" is:
int foo ( int arg1 ) { }
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The output of "cat example.cpp | ./stripper | grep JOHN_DOE" is:
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The job of finding the function name (guess its the last identifier to precede a "(") is left as an exercise to the reader.

For that kind of stuff, although it comes to primitive searching again, I would recommend compview plugin. It will open up a search window, so you can see the entire line where the search occured and automatically jump to it. Gives a nice overview.
(source: axisym3.net)

Like Robert said Regex will help. In command mode start a regex search by typing the "/" character followed by your regex.
Ctags1 may also be of use to you. It can generate a tag file for a project. This tag file allows a user to jump directly from a function call to it's definition even if it's in another file using "CTRL+]".

u can use grep -r -n -H JOHN_DOE * it will look for "JOHN_DOE" in the files recursively starting from the current directory
you can use the following code to practically find the function which contains the text expression:
public void findFunction(File file, String expression) {
Reader r = null;
try {
r = new FileReader(file);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
BufferedReader br = new BufferedReader(r);
String match = "";
String lineWithNameOfFunction = "";
Boolean matchFound = false;
try {
while(br.read() > 0) {
match = br.readLine();
if((match.endsWith(") {")) ||
(match.endsWith("){")) ||
(match.endsWith("()")) ||
(match.endsWith(")")) ||
(match.endsWith("( )"))) {
// this here is because i guessed that method will start
// at the 0
if((match.charAt(0)!=' ') && !(match.startsWith("\t"))) {
lineWithNameOfFunction = match;
}
}
if(match.contains(expression)) {
matchFound = true;
break;
}
}
if(matchFound)
System.out.println(lineWithNameOfFunction);
else
System.out.println("No matching function found");
} catch (IOException ex) {
ex.printStackTrace();
}
}
i wrote this in JAVA, tested it and works like a charm. has few drawbacks though, but for starters it's fine. didn't add support for multiple functions containing same expression and maybe some other things. try it.

Related

Implement heredocs with trim indent using PEG.js

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

How to work with char types in Dart? (Print alphabet)

I am trying to learn the Dart language, by transposing the exercices given by my school for C programming.
The very first exercice in our C pool is to write a function print_alphabet() that prints the alphabet in lowercase; it is forbidden to print the alphabet directly.
In POSIX C, the straightforward solution would be:
#include <unistd.h>
void print_alphabet(void)
{
char c;
c = 'a';
while (c <= 'z')
{
write(STDOUT_FILENO, &c, 1);
c++;
}
}
int main(void)
{
print_alphabet();
return (0);
}
However, as far as I know, the current version of Dart (1.1.1) does not have an easy way of dealing with characters. The farthest I came up with (for my very first version) is this:
void print_alphabet()
{
var c = "a".codeUnits.first;
var i = 0;
while (++i <= 26)
{
print(c.toString());
c++;
}
}
void main() {
print_alphabet();
}
Which prints the ASCII value of each character, one per line, as a string ("97" ... "122"). Not really what I intended…
I am trying to search for a proper way of doing this. But the lack of a char type like the one in C is giving me a bit of a hard time, as a beginner!
Dart does not have character types.
To convert a code point to a string, you use the String constructor String.fromCharCode:
int c = "a".codeUnitAt(0);
int end = "z".codeUnitAt(0);
while (c <= end) {
print(String.fromCharCode(c));
c++;
}
For simple stuff like this, I'd use "print" instead of "stdout", if you don't mind the newlines.
There is also:
int char_a = 'a'.codeUnitAt(0);
print(String.fromCharCodes(new Iterable.generate(26, (x) => char_a + x)));
or, using newer list literal syntax:
int char_a = 'a'.codeUnitAt(0);
int char_z = 'z'.codeUnitAt(0);
print(String.fromCharCodes([for (var i = char_a; i <= char_z; i++) i]));
As I was finalizing my post and rephrasing my question’s title, I am no longer barking up the wrong tree thanks to this question about stdout.
It seems that one proper way of writing characters is to use stdout.writeCharCode from the dart:io library.
import 'dart:io';
void ft_print_alphabet()
{
var c = "a".codeUnits.first;
while (c <= "z".codeUnits.first)
stdout.writeCharCode(c++);
}
void main() {
ft_print_alphabet();
}
I still have no clue about how to manipulate character types, but at least I can print them.

Any suggestions about how to implement a BASIC language parser/interpreter?

I've been trying to implement a BASIC language interpreter (in C/C++) but I haven't found any book or (thorough) article which explains the process of parsing the language constructs. Some commands are rather complex and hard to parse, especially conditionals and loops, such as IF-THEN-ELSE and FOR-STEP-NEXT, because they can mix variables with constants and entire expressions and code and everything else, for example:
10 IF X = Y + Z THEN GOTO 20 ELSE GOSUB P
20 FOR A = 10 TO B STEP -C : PRINT C$ : PRINT WHATEVER
30 NEXT A
It seems like a nightmare to be able to parse something like that and make it work. And to make things worse, programs written in BASIC can easily be a tangled mess. That's why I need some advice, read some book or whatever to make my mind clear about this subject. What can you suggest?
You've picked a great project - writing interpreters can be lots of fun!
But first, what do we even mean by an interpreter? There are different types of interpreters.
There is the pure interpreter, where you simply interpret each language element as you find it. These are the easiest to write, and the slowest.
A step up, would be to convert each language element into some sort of internal form, and then interpret that. Still pretty easy to write.
The next step, would be to actually parse the language, and generate a syntax tree, and then interpret that. This is somewhat harder to write, but once you've done it a few times, it becomes pretty easy.
Once you have a syntax tree, you can fairly easily generate code for a custom stack virtual machine. A much harder project is to generate code for an existing virtual machine, such as the JVM or CLR.
In programming, like most engineering endeavors, careful planning greatly helps, especially with complicated projects.
So the first step is to decide which type of interpreter you wish to write. If you have not read any of a number of compiler books (e.g., I always recommend Niklaus Wirth's "Compiler Construction" as one of the best introductions to the subject, and is now freely available on the web in PDF form), I would recommend that you go with the pure interpreter.
But you still need to do some additional planning. You need to rigorously define what it is you are going to be interpreting. EBNF is great for this. For a gentile introduction EBNF, read the first three parts of a Simple Compiler at http://www.semware.com/html/compiler.html It is written at the high school level, and should be easy to digest. Yes, I tried it on my kids first :-)
Once you have defined what it is you want to be interpreting, you are ready to write your interpreter.
Abstractly, you're simple interpreter will be divided into a scanner (technically, a lexical analyzer), a parser, and an evaluator. In the simple pure interpolator case, the parser and evaluator will be combined.
Scanners are easy to write, and easy to test, so we won't spend any time on them. See the aforementioned link for info on crafting a simple scanner.
Lets (for example) define your goto statement:
gotostmt -> 'goto' integer
integer -> [0-9]+
This tells us that when we see the token 'goto' (as delivered by the scanner), the only thing that can follow is an integer. And an integer is simply a string a digits.
In pseudo code, we might handle this as so:
(token - is the current token, which is the current element just returned via the scanner)
loop
if token == "goto"
goto_stmt()
elseif token == "gosub"
gosub_stmt()
elseif token == .....
endloop
proc goto_stmt()
expect("goto") -- redundant, but used to skip over goto
if is_numeric(token)
--now, somehow set the instruction pointer at the requested line
else
error("expecting a line number, found '%s'\n", token)
end
end
proc expect(s)
if s == token
getsym()
return true
end
error("Expecting '%s', found: '%s'\n", curr_token, s)
end
See how simple it is? Really, the only hard thing to figure out in a simple interpreter is the handling of expressions. A good recipe for handling those is at: http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm Combined with the aforementioned references, you should have enough to handle the sort of expressions you would encounter in BASIC.
Ok, time for a concrete example. This is from a larger 'pure interpreter', that handles a enhanced version of Tiny BASIC (but big enough to run Tiny Star Trek :-) )
/*------------------------------------------------------------------------
Simple example, pure interpreter, only supports 'goto'
------------------------------------------------------------------------*/
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <setjmp.h>
#include <ctype.h>
enum {False=0, True=1, Max_Lines=300, Max_Len=130};
char *text[Max_Lines+1]; /* array of program lines */
int textp; /* used by scanner - ptr in current line */
char tok[Max_Len+1]; /* the current token */
int cur_line; /* the current line number */
int ch; /* current character */
int num; /* populated if token is an integer */
jmp_buf restart;
int error(const char *fmt, ...) {
va_list ap;
char buf[200];
va_start(ap, fmt);
vsprintf(buf, fmt, ap);
va_end(ap);
printf("%s\n", buf);
longjmp(restart, 1);
return 0;
}
int is_eol(void) {
return ch == '\0' || ch == '\n';
}
void get_ch(void) {
ch = text[cur_line][textp];
if (!is_eol())
textp++;
}
void getsym(void) {
char *cp = tok;
while (ch <= ' ') {
if (is_eol()) {
*cp = '\0';
return;
}
get_ch();
}
if (isalpha(ch)) {
for (; !is_eol() && isalpha(ch); get_ch()) {
*cp++ = (char)ch;
}
*cp = '\0';
} else if (isdigit(ch)) {
for (; !is_eol() && isdigit(ch); get_ch()) {
*cp++ = (char)ch;
}
*cp = '\0';
num = atoi(tok);
} else
error("What? '%c'", ch);
}
void init_getsym(const int n) {
cur_line = n;
textp = 0;
ch = ' ';
getsym();
}
void skip_to_eol(void) {
tok[0] = '\0';
while (!is_eol())
get_ch();
}
int accept(const char s[]) {
if (strcmp(tok, s) == 0) {
getsym();
return True;
}
return False;
}
int expect(const char s[]) {
return accept(s) ? True : error("Expecting '%s', found: %s", s, tok);
}
int valid_line_num(void) {
if (num > 0 && num <= Max_Lines)
return True;
return error("Line number must be between 1 and %d", Max_Lines);
}
void goto_line(void) {
if (valid_line_num())
init_getsym(num);
}
void goto_stmt(void) {
if (isdigit(tok[0]))
goto_line();
else
error("Expecting line number, found: '%s'", tok);
}
void do_cmd(void) {
for (;;) {
while (tok[0] == '\0') {
if (cur_line == 0 || cur_line >= Max_Lines)
return;
init_getsym(cur_line + 1);
}
if (accept("bye")) {
printf("That's all folks!\n");
exit(0);
} else if (accept("run")) {
init_getsym(1);
} else if (accept("goto")) {
goto_stmt();
} else {
error("Unknown token '%s' at line %d", tok, cur_line); return;
}
}
}
int main() {
int i;
for (i = 0; i <= Max_Lines; i++) {
text[i] = calloc(sizeof(char), (Max_Len + 1));
}
setjmp(restart);
for (;;) {
printf("> ");
while (fgets(text[0], Max_Len, stdin) == NULL)
;
if (text[0][0] != '\0') {
init_getsym(0);
if (isdigit(tok[0])) {
if (valid_line_num())
strcpy(text[num], &text[0][textp]);
} else
do_cmd();
}
}
}
Hopefully, that will be enough to get you started. Have fun!
I will certainly get beaten by telling this ...but...:
First, I am actually working on a standalone library ( as a hobby ) that is made of:
a tokenizer, building linear (flat list) of tokens from the source text and following the same sequence as the text ( lexems created from the text flow ).
A parser by hands (syntax analyse; pseudo-compiler )
There is no "pseudo-code" nor "virtual CPU/machine".
Instructions(such as 'return', 'if' 'for' 'while'... then arithemtic expressions ) are represented by a base c++-struct/class and is the object itself. The base object, I name it atom, have a virtual method called "eval", among other common members, that is the "execution/branch" also by itself. So no matter I have an 'if' statement with its possible branchings ( single statement or bloc of statements/instructions ) as true or false condition, it will be called from the base virtual atom::eval() ... and so on for everything that is an atom.
Even 'objects' such as variables are 'atom'. 'eval()' will simply return its value from a variant container held by the atom itself ( pointer, refering to the 'local' variant instance (the instance variant iself) held the 'atom' or to another variant held by an atom that is created in a given 'bloc/stack'. So 'atom' are 'inplace' instructions/objects.
As of now, as an example, chunk of not really meaningful 'code' as below just works:
r = 5!; // 5! : (factorial of 5 )
Response = 1 + 4 - 6 * --r * ((3+5)*(3-4) * 78);
if (Response != 1){ /* '<>' also is not equal op. */
return r^3;
}
else{
return 0;
}
Expressions ( arithemtics ) are built into binary tree expression:
A = b+c; =>
=
/ \
A +
/ \
b c
So the 'instruction'/statement for expression like above is the tree-entry atom that in the above case, is the '=' (binary) operator.
The tree is built with atom::r0,r1,r2 :
atom 'A' :
r0
|
A
/ \
r1 r2
Regarding 'full-duplex' mecanism between c++ runtime and the 'script' library, I've made class_adaptor and adaptor<> :
ex.:
template<typename R, typename ...Args> adaptor_t<T,R, Args...>& import_method(const lstring& mname, R (T::*prop)(Args...)) { ... }
template<typename R, typename ...Args> adaptor_t<T,R, Args...>& import_property(const lstring& mname, R (T::*prop)(Args...)) { ... }
Second: I know there are plenty of tools and libs out there such as lua, boost::bind<*>, QML, JSON, etc... But in my situation, I need to create my very own [edit] 'independant' [/edit] lib for "live scripting". I was scared that my 'interpreter' could take a huge amount of RAM, but I am surprised that it is not as big as using QML,jscript or even lua :-)
Thank you :-)
Don't bother with hacking a parser together by hand. Use a parser generator. lex + yacc is the classic lexer/parser generator combination, but a Google search will reveal plenty of others.

ANTLR Tree Grammar for loops

I'm trying to implement a parser by directly reading a treeWalker and implementing the commands needed for the compiler on the fly. So if I have a command like:
statement
:
^('WRITE' expression)
{
//Here is the command that is created by my Tree Parser
ch.emitRO("OUT",0,0,0,"write out the value of ac");
//and then I handle it in my other classes
}
;
I want it to write OUT 0,0,0; to a file. That's my grammar.
I have a problem though with the loop section in my grammar it is:
'WHILE'^ expression 'DO' stat_seq 'ENDDO'
and in the tree parser:
doWhileStatement
:
^('WHILE' expression 'DO' stat_seq 'ENDDO')
;
What I want to do is directly parse the code from the while loop into the commands I need. I came up with this solution but it doesn't work:
doWhileStatement
:
^('WHILE' e=expression head='DO'
{
int loopHead =((CommonTree) head).getTokenStartIndex();
}
stat_seq
{
if ($e.result==1) {
input.seek(loopHead);
doWhileStatement();
}
}
'ENDDO')
;
for the record here are some of the other commands I've written:
(ignore the code written in brackets, it's for the generation of the commands in a text file.)
stat_seq
:
(statement)+
;
statement
:
^(':=' ID e=expression) { variables.put($ID.text,e); }
| ^('WRITE' expression)
{
ch.emitRM("LDC",ac,$expression.result,0,"pass the expression value to the ac reg");
ch.emitRO("OUT",ac,0,0,"write out the value of ac");
}
| ^('READ' ID)
{
ch.emitRO("IN",ac,0,0,"read value");
}
| ^('IF' expression 'THEN'
{
ch.emitRM("LDC",ac1,$expression.result,0,"pass the expression result to the ac reg");
int savedLoc1 = ch.emitSkip(1);
}
sseq1=stat_seq
'ELSE'
{
int savedLoc2 = ch.emitSkip(1);
ch.emitBackup(savedLoc1);
ch.emitRM("JEQ",ac1,savedLoc2+1,0,"skip as many places as needed depending on the expression");
ch.emitRestore();
}
sseq2=stat_seq
{
int savedLoc3 = ch.emitSkip(0);
ch.emitBackup(savedLoc2);
ch.emitRM("LDC",PC_REG,savedLoc3,0,"skip for the else command");
ch.emitRestore();
}
'ENDIF')
| doWhileStatement
;
Any help would be appreciated, thank you
I found it for everyone who has the same problem I did it like this and it's working:
^('WHILE'
{int c = input.index();}
expression
{int s=input.index();}
.* )// .* is a sequence of statements
{
int next = input.index(); // index of node following WHILE
input.seek(c);
match(input, Token.DOWN, null);
pushFollow(FOLLOW_expression_in_statement339);
int condition = expression();
state._fsp--;
//there is a problem here
//expression() seemed to be reading from the grammar file and I couldn't
//get it to read from the tree walker rule somehow
//It printed something like no viable alt at input 'DOWN'
//I googled it and found this mistake
// So I copied the code from the normal while statement
// And pasted it here and it works like a charm
// Normally there should only be int condition = expression()
while ( condition == 1 ) {
input.seek(s);
stat_seq();//stat_seq is a sequence of statements: (statement ';')+
input.seek(c);
match(input, Token.DOWN, null); //Copied value from EvaluatorWalker.java
//cause couldn't find another way to do it
pushFollow(FOLLOW_expression_in_statement339);
condition = expression();
state._fsp--;
System.out.println("condition:"+condition + " i:"+ variables.get("i"));
}
input.seek(next);
}
I wrote the problem at the comments of my code. If anyone can help me out and answer this for me how to do it I would be grateful. It's so weird that there is nearly no feedback on a correct way to implement loops within a tree grammar on the fly.
Regards,
Alex

How to parse template languages in Ragel?

I've been working on a parser for simple template language. I'm using Ragel.
The requirements are modest. I'm trying to find [[tags]] that can be embedded anywhere in the input string.
I'm trying to parse a simple template language, something that can have tags such as {{foo}} embedded within HTML. I tried several approaches to parse this but had to resort to using a Ragel scanner and use the inefficient approach of only matching a single character as a "catch all". I feel this is the wrong way to go about this. I'm essentially abusing the longest-match bias of the scanner to implement my default rule ( it can only be 1 char long, so it should always be the last resort ).
%%{
machine parser;
action start { tokstart = p; }
action on_tag { results << [:tag, data[tokstart..p]] }
action on_static { results << [:static, data[p..p]] }
tag = ('[[' lower+ ']]') >start #on_tag;
main := |*
tag;
any => on_static;
*|;
}%%
( actions written in ruby, but should be easy to understand ).
How would you go about writing a parser for such a simple language? Is Ragel maybe not the right tool? It seems you have to fight Ragel tooth and nails if the syntax is unpredictable such as this.
Ragel works fine. You just need to be careful about what you're matching. Your question uses both [[tag]] and {{tag}}, but your example uses [[tag]], so I figure that's what you're trying to treat as special.
What you want to do is eat text until you hit an open-bracket. If that bracket is followed by another bracket, then it's time to start eating lowercase characters till you hit a close-bracket. Since the text in the tag cannot include any bracket, you know that the only non-error character that can follow that close-bracket is another close-bracket. At that point, you're back where you started.
Well, that's a verbatim description of this machine:
tag = '[[' lower+ ']]';
main := (
(any - '[')* # eat text
('[' ^'[' | tag) # try to eat a tag
)*;
The tricky part is, where do you call your actions? I don't claim to have the best answer to that, but here's what I came up with:
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' #PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* #eof(PrintTextNode);
}%%
There are a few non-obvious things:
The eof action is needed because %PrintTextNode is only ever invoked on leaving a machine. If the input ends with normal text, there will be no input to make it leave that state. Because it will also be called when the input ends with a tag, and there is no final, unprinted text node, PrintTextNode tests that it has some text to print.
The %PrintTextNode action nestled in after the ^'[' is needed because, though we marked the start when we hit the [, after we hit a non-[, we'll start trying to parse anything again and remark the start point. We need to flush those two characters before that happens, hence that action invocation.
The full parser follows. I did it in C because that's what I know, but you should be able to turn it into whatever language you need pretty readily:
/* ragel so_tag.rl && gcc so_tag.c -o so_tag */
#include <stdio.h>
#include <string.h>
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' #PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* #eof(PrintTextNode);
}%%
%% write data;
int
main(void) {
char buffer[4096];
int cs;
char *p = NULL;
char *pe = NULL;
char *eof = NULL;
%% write init;
do {
size_t nread = fread(buffer, 1, sizeof(buffer), stdin);
p = buffer;
pe = p + nread;
if (nread < sizeof(buffer) && feof(stdin)) eof = pe;
%% write exec;
if (eof || cs == %%{ write error; }%%) break;
} while (1);
return 0;
}
Here's some test input:
[[header]]
<html>
<head><title>title</title></head>
<body>
<h1>[[headertext]]</h1>
<p>I am feeling very [[emotion]].</p>
<p>I like brackets: [ is cool. ] is cool. [] are cool. But [[tag]] is special.</p>
</body>
</html>
[[footer]]
And here's the output from the parser:
TAG(header)
TEXT(
<html>
<head><title>title</title></head>
<body>
<h1>)
TAG(headertext)
TEXT(</h1>
<p>I am feeling very )
TAG(emotion)
TEXT(.</p>
<p>I like brackets: )
TEXT([ )
TEXT(is cool. ] is cool. )
TEXT([])
TEXT( are cool. But )
TAG(tag)
TEXT( is special.</p>
</body>
</html>
)
TAG(footer)
TEXT(
)
The final text node contains only the newline at the end of the file.

Resources