Non-greedy multi-line and single-line matching - flex-lexer

I'm trying to modify a flex+bison generator to allow the inclusion of code snippets denoted by surrounding '{{' and '}}'. Unlike the multi-line comment case, I must capture all of the content.
My attempts either fail in the case where the '{{' and the '}}' are on the same line or they are painfully slow.
My first attempt was something like this:
%{
#include <stdio.h>
// sscce implementation of a growing string buffer
char codeBlock[4096];
int codeOffset;
const char* curFilename = "file.l";
extern int yylineno;
void add_code_line(const char* yytext)
{
codeOffset += sprintf(codeBlock + codeOffset, "#line %u \"%s\"\n\t%s\n", yylineno, curFilename, yytext);
}
%}
%option stack
%option yylineno
%x CODE_FRAG
%%
"{{"[ \n]* { codeOffset = 0; yy_push_state(CODE_FRAG); }
<CODE_FRAG>"}}" { codeBlock[codeOffset] = 0; printf("// code\n%s\n", codeBlock); yy_pop_state(); }
<CODE_FRAG>[^\n]* { add_code_line(yytext); }
<CODE_FRAG>\n
\n
.
Note: the "codeBlock" implementation is a contrivance for the purpose of an SSCCE only. It's not what I'm actually using.
This works for a simple test case:
{{ from line 1
from line 2
}}
{{
from line 7
}}
Outputs
// code
#line 1 "file.l"
from line 1
#line 2 "file.l"
from line 2
// code
#line 7 "file.l"
from line 7
But it can't handle
{{ hello }}
The two solutions I can think of are:
/* capture character-by-character */
<CODE_FRAG>. { add_code_character(yytext[0]); }
And
<INITIAL>"{{".*?"}}" { int n = strlen(yytext); yytext + (n - 2) = 0; add_code(yytext + 2); }
The former seems likely to be slow, and the latter just feels wrong.
Any ideas?
--- EDIT ---
The following appears to achieve the result desired, but I'm not sure if it's a "good" Flex way to do this:
"{{"[ \n]* { codeOffset = 0; yy_push_state(CODE_FRAG); }
<CODE_FRAG>"}}" { codeBlock[codeOffset] = 0; printf("// code\n%s\n", codeBlock); yy_pop_state(); }
<CODE_FRAG>.*?/"}}" { add_code_line(yytext); }
<CODE_FRAG>.*? { add_code_line(yytext); }
<CODE_FRAG>\n

Flex doesn't implement non-greedy matches. So .*? won't work the way you expect it to in flex. (It will be an optional .*, which is indistinguishable from .*)
Here's a regular expression which will match from {{ as far as possible without a }}:
"{{"([}]?[^}])*
That might not be what you want, since it won't allow nested {{...}} within your code blocks. However, you didn't mention that as a requirement and none of your examples functions that way.
The above regular expression does not match the closing }}, which appears to be what you want since it lets you call add_code(yytext+2) without modifying the temporary buffer. However, you do need to deal with the }} in your action. See below.
The regular expression above will match to the end of the file if there is no matching }}. You probably want to deal with that as an error; the simplest way is to check if EOF is encountered while you are trying to ignore the }}
"{{"([}]?[^}])* { add_code(yytext+2);
if (input() == EOF || input() == EOF) {
/* Produce an error, unclosed {{ */
}
}

Related

Match rule only at the begining of a file

Problem
I'm writing a sort of a script language interpreter.
I would like it to be able to handle (ignore) things like shebang, utf-bom, or other such thing that can appear on the beginning of a file.
The problem is that I cannot be sure that my growing grammar won't at some point have a rule that could match one of those things. (It's unlikely but you don't get reliable programs by just ignoring unlikely problems.)
Therefore, I would like to do it properly and ignore those things only if they are at the beginning of a file.
Let's focus on a shebang in the example.
I've written some simple grammar that illustrates the problems I'm facing.
Lexer:
%%
#!.+ { printf("shebang: \"%s\"\n", yytext + 2); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Input file:
#!my-program
# some multiline
thingy #
aaa bbb
ccc#!not a shebang#ddd
eee
Expected output:
shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Actual output:
thingy: "#!my-program
#"
id: some
id: multiline
id: thingy
thingy: "#
aaa bbb
ccc#"
error: '!'
id: not
id: a
id: shebang
error: '#'
id: ddd
id: eee
My (bad?) solution
I figured that this is a good case to use start conditions.
I managed to use them to write a lexer that does work, however, it's rather ugly:
%s MAIN
%%
<INITIAL>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<INITIAL>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
<MAIN>#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Notice that I had to specify the start condition MAIN before the rule #[^#]*#.
It's because it would otherwise collide with the shebang rule #!.+.
Unfortunately, the INITIAL start condition is inclusive, which means I had to specifically exclude from it any rule that would cause problems. I have to remember about it every time I write a new rule (AKA I'll forget about it).
Is there some way to make the INITIAL exclusive or choose a different start condition to be the default?
Here's a simpler solution, assuming you're using Flex (as per your flex-lexer tag):
%option noinput nounput noyywrap nodefault yylineno
%{
#define YY_USER_INIT BEGIN(STARTUP);
%}
%x STARTUP
%%
<STARTUP>#!.* { BEGIN(INITIAL); printf("Shebang: \"%s\"\n", yytext+2); }
<STARTUP>.|\n { BEGIN(INITIAL); yyless(0); }
/* Rest is INITIAL */
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
Test:
rici$ flex -o shebang.c shebang.l
rici$ gcc -Wall -o shebang shebang.c -lfl
rici$ ./shebang <<"EOF"
> #!my-program
> # some multiline
> thingy #
> aaa bbb
> ccc#!not a shebang#ddd
> eee
> EOF
Shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Notes:
The %option line:
prevents "Unused function" warnings;
removes the need for yywrap;
shows an error if there's some possible input which doesn't match any pattern;
counts input lines in the global yylineno
The macro YY_USER_INIT is executed precisely once, when the scanner starts up. It executes before any of Flex's initialization code; fortunately, Flex's initialization code does not change the start condition if it's already been set.
yyless(0) causes the current token to be rescanned. (The argument doesn't have to be 0; it truncates the current token to that length and efficiently puts the rest back into the input stream.)
The library -lfl includes yywrap() (although in this case, it's not used), and a simple main() definition rather similar to the one in your example.
(1) and (2) are Flex extensions. (3) and (4) should be available in any lex which conforms to Posix, with the exception that the Posix lex libary is linked with -ll.
There is an indirect way to select a different start condition.
The start condition is an integer variable (e.g. INITIAL is 0).
You can get its current value using a macro YY_START and if it equals INITIAL, change it to another value, effectively replacing it.
%x BEGINNING
%s MAIN
%%
%{
if (YY_START == INITIAL)
BEGIN(BEGINNING);
%}
<BEGINNING>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<BEGINNING>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
The disadvantages of this solution are:
The code block will execute every time you call yylex (not only the first time, when it's actually needed). The overhead is small enough to be ignored, though.
Flex will still generate the whole state machine as if INITIAL was used. I have no idea if this creates a lot of code or if it would matter in a big parser.

How to achieve capturing groups in flex lex?

I wanted to match for a string which starts with a '#', then matches everything until it matches the character that follows '#'. This can be achieved using capturing groups like this: #(.)[^(?1)]*(?1)(EDIT this regex is also erroneous). This matches #$foo$, does not match #%bar&, matches first 6 characters of #"foo"bar.
But since flex lex does not support capturing groups, what is the workaround here?
As you say, (f)lex does not support capturing groups, and it certainly doesn't support backreferences.
So there is no simple workaround, but there are workarounds. Here are a few possibilities:
You can read the input one character at a time using the input() function, until you find the matching character (but you have to create your own buffer to store the characters, because characters read by input() are not added to the current token). This is not the most efficient because reading one character at a time is a bit clunky, but it's the only interface that (f)lex offers. (The following snippet assumes you have some kind of expandable stringBuilder; if you are using C++, this would just be replaced with a std::string.)
#. { StringBuilder sb = string_builder_new();
int delim = yytext[1];
for (;;) {
int next = input();
if (next == delim) break;
if (next == EOF ) { /* Signal error */; break; }
string_builder_addchar(next);
}
yylval = string_builder_release();
return DELIMITED_STRING;
}
Even less efficiently, but perhaps more conveniently, you can get (f)lex to accumulate the characters in yytext using yymore(), matching one character at a time in a start condition:
%x DELIMITED
%%
int delim;
#. { delim = yytext[1]; BEGIN(DELIMITED); }
<DELIMITED>.|\n { if (yytext[0] == delim) {
yylval = strdup(yytext);
BEGIN(INITIAL);
return DELIMITED_STRING;
}
yymore();
}
<DELIMITED><<EOF>> { /* Signal unterminated string error */ }
The most efficient solution (in (f)lex) is to just write one rule for each possible delimiter. While that's a lot of rules, they could be easily generated with a small script in whatever scripting language you prefer. And, actually, there are not that many rules, particularly if you don't allow alphabetic and non-printing characters to be delimiters. This has the additional advantage that if you want Perl-like parenthetic delimiters (#(Hello) instead of #(Hello(), you can just modify the individual pattern to suit (as I've done below). [Note 1] Since all the actions are the same; it might be easier to use a macro for the action, making it easier to modify.
/* Ordinary punctuation */
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#![^!]*! { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\.[^.]*\. { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Matched pairs */
#<[^>]*> { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\[[^]]*] { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Trap errors */
# { /* Report unmatched or invalid delimiter error */ }
If I were writing a script to generate these rules, I would use hexadecimal escapes for all the delimiter characters rather than trying to figure out which ones needed escapes.
Notes:
Perl requires nested balanced parentheses in constructs like that. But you can't do that with regular expressions; if you wanted to reproduce Perl behaviour, you'd need to use some variation on one of the other suggestions. I'll try to revisit this answer later to address that feature.

Reducing insane flex lexer expansion?

I have written a flex lexer to handle the text in BYOND's .dmi file format. The contents inside are (key, value) pairs delimited by '='. Valid keys are all essentially keywords (such as "width"), and invalid keys are not errors: they are just ignored.
Interestingly, the current state of BYOND's .dmi parser uses everything prior to the '=' as its keyword, and simply ignores any excess junk. This means "\twidth123" is recognized as "width".
The crux of my problem is in allowing for this irregularity. In doing so my generated lexer expands from ~40-50KB to ~13-14MB. For reference, I present the following contrived example:
%option c++ noyywrap
fill [^=#\n]*
%%
{fill}version{fill} { return 0; }
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
{fill}state{fill} { return 0; }
{fill}dirs{fill} { return 0; }
{fill}frames{fill} { return 0; }
{fill}delay{fill} { return 0; }
{fill}loop{fill} { return 0; }
{fill}rewind{fill} { return 0; }
{fill}movement{fill} { return 0; }
{fill}hotspot{fill} { return 0; }
%%
fill is the rule that is used to merge the keywords with "anything before the =". Running flex on the above yields a ~13MB lex.yy.cc on my computer. Simply removing the kleene star (*) in the fill rule yields a 45KB lex.yy.cc file; however, obviously, this then makes the lexer incorrect.
Are there any tricks, flex options, or lexer hacks to avoid this insane expansion? The only things I can think of are:
Disallow "width123" to represent "width", which is undesirable as then technically-correct files could not be parsed.
Make one rule that is simply [^=\n]+ to return some identifier token, and pick out the keyword in the parser. This seems suboptimal to me as well, particularly because different keywords have different value types and it seems most natural to be able to handle "'width' '=' INT" and "'version' '=' FLOAT" in the parser instead of "ID '=' VALUE" followed by picking out the keyword in the identifier, making sure the value is of the right type, etc.
I could make the rule {fill}(width|height|version|...){fill}, which does indeed keep the generated file small. However, while regular expression parsers tend to produce "captures," flex just gives me yytext and re-parsing that for a keyword to produce the desired token seems to be very undesirable in terms of algorithmic complexity.
Make fill a separate rule of its own that does nothing, and remove it from all the other rules, and separate its definition from whitespace for clarity:
whitespace [ \t\f]
fill [^#=\n]
%%
{whitespace}+ ;
{fill}+ ;
I would probably also avoid building the keywords into the lexer and just use an identifier [a-zA-Z]+ rule that does a table lookup. And finally add a rule to catch the =:
. return yytext[0];
to let the parser handle all special characters.
This is not really a problem flex is "good at", but it can be solved if it is precisely defined. In particular, it is important to know which of the keywords should be returned if the random string of letters before the = contains more than one keyword. For example, suppose the input is:
garbage_widtheight_moregarbage = 42
Now, is that setting the width or the height?
Remember that flex scanners will choose the rule with longest match, and of rules with equally long matches, the first one in the lexical description.
So the model presented in the OP:
fill [^=#\n]*
%%
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
/* SNIP */
will always prefer width to height, because the matches will be the same length (both terminate at the last character before the =), and the width pattern comes first in the file. If the rules were written in the opposite order, height would be preferred.
On the other hand, if you removed the second {fill}:
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
then the last keyword in the input (in this case, height) will be preferred, because that one has the longer match.
The most likely requirement, however, is that the first keyword be recognized, so neither of the preceding will work. In order to match the first keyword, it is necessary to first match the shortest possible sequence of {fill}. And since flex does not implement non-greedy repetition, that can only be done with a character-by-character span.
Here's an example, using start conditions. Note that we hold onto the keyword token until we actually find the =, in case the = is not found.
/* INITIAL: beginning of a line
* FIND_EQUAL: keyword recognized, looking for the =
* VALUE: = recognized, lexing the right-hand side
* NEXT_LINE: find the next line and continue the scan
*/
%x FIND_EQUAL VALUE
%%
int keyword;
"[#=]".* /* Skip comments and lines with no recognizable keyword */
version { keyword = KW_VERSION; BEGIN(FIND_EQUAL); }
width { keyword = KW_WIDTH; BEGIN(FIND_EQUAL); }
height { keyword = KW_HEIGHT; BEGIN(FIND_EQUAL); }
/* etc. */
.|\n /* Skip any other single character, or newline */
<FIND_EQUAL>{
[^=#\n]*"=" { BEGIN(VALUE); return keyword; }
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
}
<VALUE>{
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
[[:blank:]]+ ; /* Ignore space and tab characters */
[[:digit:]]+ { yylval.ival = atoi(yytext);
BEGIN(NEXT_LINE); return INTEGER;
}
[[:digit:]]+"."[[:digit:]]*|"."[[:digit:]]+ {
yylval.fval = atod(yytext);
BEGIN(NEXT_LINE); return FLOAT;
}
\"([^"]|\\.)*\" { char* s = malloc(yyleng - 1);
yylval.sval = s;
/* Remove quotes and escape characters */
yytext[yyleng - 1] = '\0';
do {
if (*++yytext == '\\') ++yytext;
*s++ = *yytext;
} while (*yytext);
BEGIN(NEXT_LINE); return STRING;
}
/* Other possible value token types */
. BEGIN(NEXT_LINE); /* bad character in value */
}
<NEXT_LINE>.*\n? BEGIN(INITIAL);
In the escape-removal code, you might want to translate things like \n. And you might also want to avoid string values with physical newlines. And a bunch of etceteras. It's only intended as a model.

How to parse template languages in Ragel?

I've been working on a parser for simple template language. I'm using Ragel.
The requirements are modest. I'm trying to find [[tags]] that can be embedded anywhere in the input string.
I'm trying to parse a simple template language, something that can have tags such as {{foo}} embedded within HTML. I tried several approaches to parse this but had to resort to using a Ragel scanner and use the inefficient approach of only matching a single character as a "catch all". I feel this is the wrong way to go about this. I'm essentially abusing the longest-match bias of the scanner to implement my default rule ( it can only be 1 char long, so it should always be the last resort ).
%%{
machine parser;
action start { tokstart = p; }
action on_tag { results << [:tag, data[tokstart..p]] }
action on_static { results << [:static, data[p..p]] }
tag = ('[[' lower+ ']]') >start #on_tag;
main := |*
tag;
any => on_static;
*|;
}%%
( actions written in ruby, but should be easy to understand ).
How would you go about writing a parser for such a simple language? Is Ragel maybe not the right tool? It seems you have to fight Ragel tooth and nails if the syntax is unpredictable such as this.
Ragel works fine. You just need to be careful about what you're matching. Your question uses both [[tag]] and {{tag}}, but your example uses [[tag]], so I figure that's what you're trying to treat as special.
What you want to do is eat text until you hit an open-bracket. If that bracket is followed by another bracket, then it's time to start eating lowercase characters till you hit a close-bracket. Since the text in the tag cannot include any bracket, you know that the only non-error character that can follow that close-bracket is another close-bracket. At that point, you're back where you started.
Well, that's a verbatim description of this machine:
tag = '[[' lower+ ']]';
main := (
(any - '[')* # eat text
('[' ^'[' | tag) # try to eat a tag
)*;
The tricky part is, where do you call your actions? I don't claim to have the best answer to that, but here's what I came up with:
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' #PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* #eof(PrintTextNode);
}%%
There are a few non-obvious things:
The eof action is needed because %PrintTextNode is only ever invoked on leaving a machine. If the input ends with normal text, there will be no input to make it leave that state. Because it will also be called when the input ends with a tag, and there is no final, unprinted text node, PrintTextNode tests that it has some text to print.
The %PrintTextNode action nestled in after the ^'[' is needed because, though we marked the start when we hit the [, after we hit a non-[, we'll start trying to parse anything again and remark the start point. We need to flush those two characters before that happens, hence that action invocation.
The full parser follows. I did it in C because that's what I know, but you should be able to turn it into whatever language you need pretty readily:
/* ragel so_tag.rl && gcc so_tag.c -o so_tag */
#include <stdio.h>
#include <string.h>
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' #PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* #eof(PrintTextNode);
}%%
%% write data;
int
main(void) {
char buffer[4096];
int cs;
char *p = NULL;
char *pe = NULL;
char *eof = NULL;
%% write init;
do {
size_t nread = fread(buffer, 1, sizeof(buffer), stdin);
p = buffer;
pe = p + nread;
if (nread < sizeof(buffer) && feof(stdin)) eof = pe;
%% write exec;
if (eof || cs == %%{ write error; }%%) break;
} while (1);
return 0;
}
Here's some test input:
[[header]]
<html>
<head><title>title</title></head>
<body>
<h1>[[headertext]]</h1>
<p>I am feeling very [[emotion]].</p>
<p>I like brackets: [ is cool. ] is cool. [] are cool. But [[tag]] is special.</p>
</body>
</html>
[[footer]]
And here's the output from the parser:
TAG(header)
TEXT(
<html>
<head><title>title</title></head>
<body>
<h1>)
TAG(headertext)
TEXT(</h1>
<p>I am feeling very )
TAG(emotion)
TEXT(.</p>
<p>I like brackets: )
TEXT([ )
TEXT(is cool. ] is cool. )
TEXT([])
TEXT( are cool. But )
TAG(tag)
TEXT( is special.</p>
</body>
</html>
)
TAG(footer)
TEXT(
)
The final text node contains only the newline at the end of the file.

How to find functions in a cpp file that contain a specific word

using grep, vim's grep, or another unix shell command, I'd like to find the functions in a large cpp file that contain a specific word in their body.
In the files that I'm working with the word I'm looking for is on an indented line, the corresponding function header is the first line above the indented line that starts at position 0 and is not a '{'.
For example searching for JOHN_DOE in the following code snippet
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
should give me
void bar ( std::string arg2 )
The algorithm that I hope to catch in grep/vim/unix shell scripts would probably best use the indentation and formatting assumptions, rather than attempting to parse C/C++.
Thanks for your suggestions.
I'll probably get voted down for this!
I am an avid (G)VIM user but when I want to review or understand some code I use Source Insight. I almost never use it as an actual editor though.
It does exactly what you want in this case, e.g. show all the functions/methods that use some highlighted data type/define/constant/etc... in a relations window...
(source: sourceinsight.com)
Ouch! There goes my rep.
As far as I know, this can't be done. Here's why:
First, you have to search across lines. No problem, in vim adding a _ to a character class tells it to include new lines. so {_.*} would match everything between those brackets across multiple lines.
So now you need to match whatever the pattern is for a function header(brittle even if you get it to work), then , and here's the problem, whatever lines are between it and your search string, and finally match your search string. So you might have a regex like
/^\(void \+\a\+ *(.*)\)\_.*JOHN_DOE
But what happens is the first time vim finds a function header, it starts matching. It then matches every character until it finds JOHN_DOE. Which includes all the function headers in the file.
So the problem is that, as far as I know, there's no way to tell vim to match every character except for this regex pattern. And even if there was, a regex is not the tool for this job. It's like opening a beer with a hammer. What we should do is write a simple script that gives you this info, and I have.
fun! FindMyFunction(searchPattern, funcPattern)
call search(a:searchPattern)
let lineNumber = line(".")
let lineNumber = lineNumber - 1
"call setpos(".", [0, lineNumber, 0, 0])
let lineString = getline(lineNumber)
while lineString !~ a:funcPattern
let lineNumber = lineNumber - 1
if lineNumber < 0
echo "Function not found :/"
endif
let lineString = getline(lineNumber)
endwhile
echo lineString
endfunction
That should give you the result you want and it's way easier to share, debug, and repurpose than a regular expression spit from the mouth of Cthulhu himself.
Tough call, although as a starting point I would suggest this wonderful VIM Regex Tutorial.
You cannot do that reliably with a regular expression, because code is not a regular language. You need a real parser for the language in question.
Arggh! I admit this is a bit over the top:
A little program to filter stdin, strip comments, and put function bodies on the same line. It'll get fooled by namespaces and function definitions inside class declarations, besides other things. But it might be a good start:
#include <stdio.h>
#include <assert.h>
int main() {
enum {
NORMAL,
LINE_COMMENT,
MULTI_COMMENT,
IN_STRING,
} state = NORMAL;
unsigned depth = 0;
for(char c=getchar(),prev=0; !feof(stdin); prev=c,c=getchar()) {
switch(state) {
case NORMAL:
if('/'==c && '/'==prev)
state = LINE_COMMENT;
else if('*'==c && '/'==prev)
state = MULTI_COMMENT;
else if('#'==c)
state = LINE_COMMENT;
else if('\"'==c) {
state = IN_STRING;
putchar(c);
} else {
if(('}'==c && !--depth) || (';'==c && !depth)) {
putchar(c);
putchar('\n');
} else {
if('{'==c)
depth++;
else if('/'==prev && NORMAL==state)
putchar(prev);
else if('\t'==c)
c = ' ';
if(' '==c && ' '!=prev)
putchar(c);
else if(' '<c && '/'!=c)
putchar(c);
}
}
break;
case LINE_COMMENT:
if(' '>c)
state = NORMAL;
break;
case MULTI_COMMENT:
if('/'==c && '*'==prev) {
c = '\0';
state = NORMAL;
}
break;
case IN_STRING:
if('\"'==c && '\\'!=prev)
state = NORMAL;
putchar(c);
break;
default:
assert(!"bug");
}
}
putchar('\n');
return 0;
}
Its c++, so just it in a file, compile it to a file named 'stripper', and then:
cat my_source.cpp | ./stripper | grep JOHN_DOE
So consider the input:
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
The output of "cat example.cpp | ./stripper" is:
int foo ( int arg1 ) { }
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The output of "cat example.cpp | ./stripper | grep JOHN_DOE" is:
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The job of finding the function name (guess its the last identifier to precede a "(") is left as an exercise to the reader.
For that kind of stuff, although it comes to primitive searching again, I would recommend compview plugin. It will open up a search window, so you can see the entire line where the search occured and automatically jump to it. Gives a nice overview.
(source: axisym3.net)
Like Robert said Regex will help. In command mode start a regex search by typing the "/" character followed by your regex.
Ctags1 may also be of use to you. It can generate a tag file for a project. This tag file allows a user to jump directly from a function call to it's definition even if it's in another file using "CTRL+]".
u can use grep -r -n -H JOHN_DOE * it will look for "JOHN_DOE" in the files recursively starting from the current directory
you can use the following code to practically find the function which contains the text expression:
public void findFunction(File file, String expression) {
Reader r = null;
try {
r = new FileReader(file);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
BufferedReader br = new BufferedReader(r);
String match = "";
String lineWithNameOfFunction = "";
Boolean matchFound = false;
try {
while(br.read() > 0) {
match = br.readLine();
if((match.endsWith(") {")) ||
(match.endsWith("){")) ||
(match.endsWith("()")) ||
(match.endsWith(")")) ||
(match.endsWith("( )"))) {
// this here is because i guessed that method will start
// at the 0
if((match.charAt(0)!=' ') && !(match.startsWith("\t"))) {
lineWithNameOfFunction = match;
}
}
if(match.contains(expression)) {
matchFound = true;
break;
}
}
if(matchFound)
System.out.println(lineWithNameOfFunction);
else
System.out.println("No matching function found");
} catch (IOException ex) {
ex.printStackTrace();
}
}
i wrote this in JAVA, tested it and works like a charm. has few drawbacks though, but for starters it's fine. didn't add support for multiple functions containing same expression and maybe some other things. try it.

Resources