I've been working on a Lua fslex lexer in my spare time, using the ocamllex manual as a reference.
I hit a few snags while trying to tokenize long strings correctly. "Long strings" are delimited by '[' ('=')* '[' and ']' ('=')* ']' tokens; the number of = signs must be the same.
In the first implementation, the lexer seemed to not recognize [[ patterns, producing two LBRACKET tokens despite the longest match rule, whereas [=[ and variations where recognized correctly. In addition, the regular expression failed to ensure that the correct closing token is used, stopping at the first ']' ('=')* ']' capture, no matter the actual long string "level". Also, fslex does not seem to support "as" constructs in regular expressions.
let lualongstring = '[' ('=')* '[' ( escapeseq | [^ '\\' '[' ] )* ']' ('=')* ']'
(* ... *)
| lualongstring { (* ... *) }
| '[' { LBRACKET }
| ']' { RBRACKET }
(* ... *)
I've been trying to solve the issue with another rule in the lexer:
rule tokenize = parse
(* ... *)
| '[' ('=')* '[' { longstring (getLongStringLevel(lexeme lexbuf)) lexbuf }
(* ... *)
and longstring level = parse
| ']' ('=')* ']' { (* check level, do something *) }
| _ { (* aggregate other chars *) }
(* or *)
| _ {
let c = lexbuf.LexerChar(0);
(* ... *)
}
But I'm stuck, for two reasons: first, I don't think I can "push", so to speak, a token to the next rule once I'm done reading the long string; second, I don't like the idea of reading char by char until the right closing token is found, making the current design useless.
How can I tokenize Lua long strings in fslex? Thanks for reading.
Apologies if I answer my own question, but I'd like to contribute with my own solution to the problem for future reference.
I am keeping state across lexer function calls with the LexBuffer<_>.BufferLocalStore property, which is simply a writeable IDictionary instance.
Note: long brackets are used both by long string and multiline comments. This is often an overlooked part of the Lua grammar.
let beginlongbracket = '[' ('=')* '['
let endlongbracket = ']' ('=')* ']'
rule tokenize = parse
| beginlongbracket
{ longstring (longBracketLevel(lexeme lexbuf)) lexbuf }
(* ... *)
and longstring level = parse
| endlongbracket
{ if longBracketLevel(lexeme lexbuf) = level then
LUASTRING(endLongString(lexbuf))
else
longstring level lexbuf
}
| _
{ toLongString lexbuf (lexeme lexbuf); longstring level lexbuf }
| eof
{ failwith "Unexpected end of file in string." }
Here are the functions I use to simplify storing data into the BufferLocalStore:
let longBracketLevel (str : string) =
str.Count(fun c -> c = '=')
let createLongStringStorage (lexbuf : LexBuffer<_>) =
let sb = new StringBuilder(1000)
lexbuf.BufferLocalStore.["longstring"] <- box sb
sb
let toLongString (lexbuf : LexBuffer<_>) (c : string) =
let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
let storage = if hasString then (sb :?> StringBuilder) else (createLongStringStorage lexbuf)
storage.Append(c.[0]) |> ignore
let endLongString (lexbuf : LexBuffer<_>) : string =
let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
let ret = if not hasString then "" else (sb :?> StringBuilder).ToString()
lexbuf.BufferLocalStore.Remove("longstring") |> ignore
ret
Perhaps it's not very functional, but it seems to be getting the job done.
use the tokenize rule until the beginning of a long bracket is found
switch to the longstring rule and loop until a closing long bracket of the same level is found
store every lexeme that does not match a closing long bracket of the same level into a StringBuilder, which is in turn stored into the LexBuffer BufferLocalStore.
once the longstring is over, clear the BufferLocalStore.
Edit: You can find the project at http://ironlua.codeplex.com. Lexing and parsing should be okay. I am planning on using the DLR. Comments and constructive criticism welcome.
Related
I am trying to parse logical BNF statements , and trying to apply paranthesis to them.
For example:
I am trying to parse a statement a=>b<=>c&d as ((a)=>(b))<=>((c)&(d)), and similar statements as well.
Problem Facing: Some of the statements are working fine, and while some are not. The example provided above is not working and the solution is printing as ((c)&(d))<=>((c)&(d)) The second expr seems to be overriding the first one.
Conditions that are working: While other simple examples like a<=>b , a|(b&c) are working fine.
I think I have made some basic error in my code, which I cannot figure out.
Here is my code
lex file
letters [a-zA-Z]
identifier {letters}+
operator (?:<=>|=>|\||&|!)
separator [\(\)]
%%
{identifier} {
yylval.s = strdup(yytext);
return IDENTIFIER; }
{operator} { return *yytext; }
{separator} { return *yytext; }
[\n] { return *yytext; }
%%
yacc file
%start program
%union {char* s;}
%type <s> program expr IDENTIFIER
%token IDENTIFIER
%left '<=>'
%left '=>'
%left '|'
%left '&'
%right '!'
%left '(' ')'
%%
program : expr '\n'
{
cout<<$$;
exit(0);
}
;
expr : IDENTIFIER {
cout<<" Atom ";
cout<<$1<<endl;
string s1 = string($1);
cout<<$$<<endl;
}
| expr '<=>' expr {
cout<<"Inside <=>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"<=>"+"(" + s2 +")";
$$ = (char * )s3.c_str();
cout<<s3<<endl;
}
| expr '=>' expr {
cout<<"Inside =>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"=>"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '|' expr {
cout<<"Inside |\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"|"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '&' expr {
cout<<"Inside &\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"&"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| '!' expr {
cout<<"Inside !\n";
string s1 = string($2);
cout<<s1<<endl;
string s2 = "!" + s1;
$$ = (char *)s2.c_str();
cout<<$$<<endl;
}
| '(' expr ')' { $$ = $2; cout<<"INSIDE BRACKETS"; }
;
%%
Please let me know the mistake I have made.
Thank you
The basic problem you have is that you save the pointer returned by string::c_str() on the yacc value stack, but after the action finishes and the string object is destroyed, that pointer is no longer valid.
To fix this you need to either not use std::string at all, or change your %union to be { std::string *s; } (instead of char *). In either case you have issues with memory leaks. If you are using Linux, the former is pretty easy. Your actions would become something like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
asprintf(&$$, "(%s)<=>(%s)", $1, $3);
cout<<$$<<endl;
free($1);
free($3);
}
for the latter, the action would look like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
$$ = new string("(" + *$1 +")" +"<=>"+"(" + *$2 +")");
cout<<$$<<endl;
delete $1;
delete $3;
}
I'm writing a parser for a specific file format using FParsec as a firstish foaray into learning fsharp. Part of the file has the following format
{ 123 456 789 333 }
Where the numbers in the brackets are pairs of values and there can be an arbitrary number of spaces to separate them. So these would also be valid things to parse:
{ 22 456 7 333 }
And of course the content of the brackets might be empty, i.e. {}
In addition I want the parser to be able to handle the case where the content is a bit malformed, eg. { some descriptive text } or maybe more likely { 12 3 4} (invalid since the 4 wouldn't be paired with anything). In this case I just want the contents saved to be processed separately.
I have this so far:
type DimNummer = int
type ObjektNummer = int
type DimObjektPair = DimNummer * ObjektNummer
type ObjektListResult = Result<DimObjektPair list, string>
let sieObjektLista =
let pnum = numberLiteral NumberLiteralOptions.None "dimOrObj"
let ws = spaces
let pobj = pnum .>> ws |>> fun x ->
let on: ObjektNummer = int x.String
on
let pdim = pnum |>> fun x ->
let dim: DimNummer = int x.String
dim
let pdimObj = (pdim .>> spaces1) .>>. pobj |>> DimObjektPair
let toObjektLista(objList:list<DimObjektPair>) =
let res: ObjektListResult = Result.Ok objList
res
let pdimObjs = sepBy pdimObj spaces1
let validList = pdimObjs |>> toObjektLista
let toInvalid(str:string) =
let res: ObjektListResult =
match str.Trim(' ') with
| "" -> Result.Ok []
| _ -> Result.Error str
res
let invalidList = manyChars anyChar |>> toInvalid
let pres = between (pchar '{') (pchar '}') (ws >>. (validList <|> invalidList) .>> ws)
pres
let parseSieObjektLista = run sieObjektLista
However running this on a valid sample I get an error:
{ 53735 7785 86231 36732 }
^
Expecting: whitespace or '}'
You're trying to consume too many spaces.
Look: pdimObj is a pdim, followed by some spaces, followed by pobj, which is itself a pnum followed by some spaces. So if you look at the first part of the input:
{ 53735 7785 86231 36732 }
\___/\______/\__/\/
^ ^ ^ ^
| | | |
pnum | | |
^ spaces1 | |
| | ws
pdim pnum ^
^ ^ |
| \ /
| \ /
| \/
\ pobj
\ /
\________/
^
|
pdimObj
One can clearly see from here that pdimObj consumes everything up to 86231, including the space just before it. And therefore, when sepBy inside pdimObjs looks for the next separator (which is spaces1), it can't find any. So it fails.
The smallest way to fix this is to make pdimObjs use many instead of sepBy: since pobj already consumes trailing spaces, there is no need to also consume them in sepBy:
let pdimObjs = many pdimObj
But a cleaner way, in my opinion, would be to remove ws from pobj, because, intuitively, trailing spaces aren't part of the number representing your object (whatever that is), and instead handle possible trailing spaces in pdimObjs via sepEndBy:
let pobj = pnum |>> fun x ->
let on: ObjektNummer = int x.String
on
...
let pdimObjs = sepEndBy pdimObj spaces1
The main problem here is in pdimObjs. The sepBy parser fails because the separator spaces following each number have already been consumed by pobj, so spaces1 cannot succeed. Instead, I suggest you try this:
let pdimObjs = many pdimObj
Which gives the following result on your test input:
Success: Ok [(53735, 7785); (86231, 36732)]
Well, I'm writing my first parser, in OCaml, and I immediately somehow managed to make one with an infinite-loop.
Of particular note, I'm trying to lex identifiers according to the rules of the Scheme specification (I have no idea what I'm doing, obviously) — and there's some language in there about identifiers requiring that they are followed by a delimiter. My approach, right now, is to have a delimited_identifier regex that includes one of the delimiter characters, that should not be consumed by the main lexer … and then once that's been matched, the reading of that lexeme is reverted by Sedlexing.rollback (well, my wrapper thereof), before being passed to a sublexer that only eats the actual identifier, hopefully leaving the delimiter in the buffer to be eaten as a different lexeme by the parent lexer.
I'm using Menhir and Sedlex, mostly synthesizing the examples from #smolkaj's ocaml-parsing example-repo and RWO's parsing chapter; here's the simplest reduction of my current parser and lexer:
%token LPAR RPAR LVEC APOS TICK COMMA COMMA_AT DQUO SEMI EOF
%token <string> IDENTIFIER
(* %token <bool> BOOL *)
(* %token <int> NUM10 *)
(* %token <string> STREL *)
%start <Parser.AST.t> program
%%
program:
| p = list(expression); EOF { p }
;
expression:
| i = IDENTIFIER { Parser.AST.Atom i }
%%
… and …
(** Regular expressions *)
let newline = [%sedlex.regexp? '\r' | '\n' | "\r\n" ]
let whitespace = [%sedlex.regexp? ' ' | newline ]
let delimiter = [%sedlex.regexp? eof | whitespace | '(' | ')' | '"' | ';' ]
let digit = [%sedlex.regexp? '0'..'9']
let letter = [%sedlex.regexp? 'A'..'Z' | 'a'..'z']
let special_initial = [%sedlex.regexp?
'!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~' ]
let initial = [%sedlex.regexp? letter | special_initial ]
let special_subsequent = [%sedlex.regexp? '+' | '-' | '.' | '#' ]
let subsequent = [%sedlex.regexp? initial | digit | special_subsequent ]
let peculiar_identifier = [%sedlex.regexp? '+' | '-' | "..." ]
let identifier = [%sedlex.regexp? initial, Star subsequent | peculiar_identifier ]
let delimited_identifier = [%sedlex.regexp? identifier, delimiter ]
(** Swallow whitespace and comments. *)
let rec swallow_atmosphere buf =
match%sedlex buf with
| Plus whitespace -> swallow_atmosphere buf
| ";" -> swallow_comment buf
| _ -> ()
and swallow_comment buf =
match%sedlex buf with
| newline -> swallow_atmosphere buf
| any -> swallow_comment buf
| _ -> assert false
(** Return the next token. *)
let rec token buf =
swallow_atmosphere buf;
match%sedlex buf with
| eof -> EOF
| delimited_identifier ->
Sedlexing.rollback buf;
identifier buf
| '(' -> LPAR
| ')' -> RPAR
| _ -> illegal buf (Char.chr (next buf))
and identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
(Yes, it's basically a no-op / the simplest thing possible rn. I'm trying to learn! :x)
Unfortunately, this combination results in an infinite loop in the parsing automaton:
State 0:
Lookahead token is now IDENTIFIER (1-1)
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
...
I'm new to parsing and lexing and all this; any advice would be welcome. I'm sure it's just a newbie mistake, but …
Thanks!
As said before, implementing too much logic inside the lexer is a bad idea.
However, the infinite loop does not come from the rollback but from your definition of identifier:
identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
within this definition _ matches the shortest possible words in the language consisting of all possible characters. In other words, _ always matches the empty word μ without consuming any part of its input, sending the parser into an infinite loop.
Trying to build a grammar that will parse simple bool expressions.
I am running into an issue when there are multiple expressions.
I need to be able to parse 1..n and/or'ed expressions.
Each example below is a complete expression:
(myitem.isavailable("1234") or myitem.ispresent("1234")) and
myitem.isready("1234")
myitem.value > 4 and myitem.value < 10
myitem.value = yes or myotheritem.value = no
Grammar:
#start = conditionalexpression* | expressiontypes;
conditionalexpression = condition expressiontypes;
expressiontypes = expression | functionexpression;
expression = itemname dot property condition value;
functionexpression = itemname dot functionproperty;
itemname = Word;
propertytypes = property | functionproperty;
property = Word;
functionproperty = Word '(' value ')';
value = Word | QuotedString | Number;
condition = textcondition;
dot = '.';
textcondition = 'or' | 'and' | '<' | '>' | '=';
Developer of ParseKit here.
Here is a ParseKit grammar that matches your example input:
#start = expr;
expr = orExpr;
orExpr = andExpr orTerm*;
orTerm = 'or' andExpr;
// 'and' should bind more tightly than 'or'
andExpr = relExpr andTerm*;
andTerm = 'and' relExpr;
// relational expressions should bind more tightly than 'and'/'or'
relExpr = callExpr relTerm*;
relTerm = relOp callExpr;
// func calls should bind more tightly than relational expressions
callExpr = primaryExpr ('(' argList ')')?;
argList = Empty | atom (',' atom)*;
primaryExpr = atom | '(' expr ')';
atom = obj | literal;
// member access should bind most tightly
obj = id member*;
member = ('.' id);
id = Word;
literal = Number | QuotedString | bool;
bool = 'yes' | 'no';
relOp = '<' | '>' | '=';
To give you an idea of how I arrived at this grammar:
I realized that your language is a simple, composable expression langauge.
I remembered that XPath 1.0 is also a relatively simple expression langauge with a easily available/readable grammar.
I visited the XPath 1.0 spec online and quickly scanned the XPath basic language grammar. That served to provide a quick jumping-off point for desinging your language grammar. If you ignore the path expression part of XPath expressions, XPath is a very good template for a basic expression language.
My grammar above successfully parses all of your example inputs (see below). Hope this helps.
[foo, ., bar, (, "hello", ), or, (, bar, or, baz, >, bat, )]foo/./bar/(/"hello"/)/or/(/bar/or/baz/>/bat/)^
[myitem, ., value, >, 4, and, myitem, ., value, <, 10]myitem/./value/>/4/and/myitem/./value/</10^
[myitem, ., value, =, yes, or, myotheritem, ., value, =, no]myitem/./value/=/yes/or/myotheritem/./value/=/no^
I've been using regexes to go through a pile of Verilog files and pull out certain statements. Currently, regexes are fine for this, however, I'm starting to get to the point where a real parser is going to be needed in order to deal with nested structures so I'm investigating ocamllex/ocamlyacc. I'd like to first duplicate what I've got in my regex implementation and then slowly add more to the grammar.
Right now I'm mainly interested in pulling out module declarations and instantiations. To keep this question a bit more brief, let's look at module declarations only.
In Verilog a module declaration looks like:
module modmame ( ...other statements ) endmodule;
My current regex implementation simply checks that there is a module declared with a particular name ( checking against a list of names that I'm interested in - I don't need to find all module declarations just ones with certain names). So basically, I get each line of the Verilog file I want to parse and do a match like this (pseudo-OCaml with Pythonish and Rubyish elements ):
foreach file in list_of_files:
let found_mods = Hashtbl.create 17;
open file
foreach line in file:
foreach modname in modlist
let mod_patt= Str.regexp ("module"^space^"+"^modname^"\\("^space^"+\\|(\\)") in
try
Str.search_forward (mod_patt) line 0
found_mods[file] = modname; (* map filename to modname *)
with Not_found -> ()
That works great. The module declaration can occur anywhere in the Verilog file; I'm just wanting to find out if the file contains that particular declaration, I don't care about what else may be in that file.
My first attempt at converting this over to ocamllex/ocamlyacc:
verLexer.mll:
rule lex = parse
| [' ' '\n' '\t'] { lex lexbuf }
| ['0'-'9']+ as s { INT(int_of_string s) }
| '(' { LPAREN }
| ')' { RPAREN }
| "module" { MODULE }
| ['A'-'Z''a'-'z''0'-'9''_']+ as s { IDENT(s) }
| _ { lex lexbuf }
| eof
verParser.mly:
%{ type expr = Module of expr | Ident of string | Int of int %}
%token <int> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE EOF
%start expr1
%type <expr> expr1
%%
expr:
| MODULE IDENT LPAREN { Module( Ident $2) };
expr1:
| expr EOF { $1 };
Then trying it out in the REPL:
# #use "verLexer.ml" ;;
# #use "verParser.ml" ;;
# expr1 lex (Lexing.from_string "module foo (" ) ;;
- : expr = Module (Ident "foo")
That's great, it works!
However, a real Verilog file will have more than a module declaration in it:
# expr1 lex (Lexing.from_string "//comment\nmodule foo ( \nstuff" ) ;;
Exception: Failure "lexing: empty token".
I don't really care about what appeared before or after that module definition, is there a way to just extract that part of the grammar to determine that the Verilog files contains the 'module foo (' statement? Yes, I realize that regexes are working fine for this, however, as stated above, I am planning to grow this grammar slowly and add more elements to it and regexes will start to break down.
EDIT: I added a match any char to the lex rule:
| _ { lex lexbuf }
Thinking that it would skip any characters that weren't matched so far, but that didn't seem to work:
# expr1 lex (Lexing.from_string "fof\n module foo (\n" ) ;;
Exception: Parsing.Parse_error.
A first advertisement minute: instead of ocamlyacc you should consider using François Pottier's Menhir, which is like a "yacc, upgraded", better in all aspects (more readable grammars, more powerful constructs, easier to debug...) while still very similar. It can of course be used in combination with ocamllex.
Your expr1 rule only allows to begin and end with a expr rule. You should enlarge it to allow "stuff" before or after expr. Something like:
junk:
| junk LPAREN
| junk RPAREN
| junk INT
| junk IDENT
expr1:
| junk expr junk EOF
Note that this grammar does not allow the module token to appear in the junk section. Doing so would be a bit problematic as it would make the grammar ambiguous (the structure you're looking for could be embedded either in expr or junk). If you could have a module token happening outside the form you're looking form, you should consider changing the lexer to capture the whole module ident ( structure of interest in a single token, so that it can be atomically matched from the grammar. On the long term, however, have finer-grained tokens is probably better.
As suggested by #gasche I tried menhir and am already getting much better results. I changed the verLexer.ml to:
{
open VerParser
}
rule lex = parse
| [' ' '\n' '\t'] { lex lexbuf }
| ['0'-'9']+ as s { INT(int_of_string s) }
| '(' { LPAREN }
| ')' { RPAREN }
| "module" { MODULE }
| ['A'-'Z''a'-'z''0'-'9''_']+ as s { IDENT(s) }
| _ as c { lex lexbuf }
| eof { EOF }
And changed verParser.mly to:
%{ type expr = Module of expr | Ident of string | Int of int
|Lparen | Rparen | Junk %}
%token <int> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE EOF
%start expr1
%type <expr> expr1
%%
expr:
| MODULE IDENT LPAREN { Module( Ident $2) };
junk:
| LPAREN { }
| RPAREN { }
| INT { }
| IDENT { } ;
expr1:
| junk* expr junk* EOF { $2 };
The key here is that menhir allows a rule to be parameterized with a '*' as in the line above where I've got 'junk*' in a rule meaning match junk 0 or more times. ocamlyacc doesn't seem to allow that.
Now when I tried it in the REPL I get:
# #use "verParser.ml" ;;
# #use "verLexer.ml" ;;
# expr1 lex (Lexing.from_string "module foo ( " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo ( " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo (\nbar " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo (\n//comment " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module fot foo (\n//comment " ) ;;
Exception: Error.
# expr1 lex (Lexing.from_string "some module foo (\n//comment " ) ;;
Which seems to work exactly as I want it to.