I'm looking at reducers.
There is a nice example in the Tutor for counting words:
(0 | it + 1 | /\w+/ := S)
where S is some longer string with several words. The reducer returns the count of such words.
I was wondering how to capture the matched substring and use it in the accumulating expression, something like
("" | it + e | str e ... /\w+/ := S)
so that the result would be the concatenation of all matched substrings.
Any idea?
Yes, the capture syntax is with the <name:regex> notation:
("" | it + e | /<e:\w+>/ := S)
rascal>S ="Jabberwocky by Lewis Carroll";
str: "Jabberwocky by Lewis Carroll"
rascal>("" | "<it>,<e>" | /<e:\w+>/ := S)[1..]
str: "Jabberwocky,by,Lewis,Carroll"
or use the for-template syntax instead of a reducer expression:
rascal>x = "<for (/<e:\w+>/ := S) {><e>;
>>>>>>> '<}>";
str: "Jabberwocky;\nby;\nLewis;\nCarroll;\n"
rascal>import IO;
ok
rascal>println(x)
Jabberwocky;
by;
Lewis;
Carroll;
ok
rascal>
Related
My task is to write grammar for custom query language, where users can write some basic queries.
My grammar so far:
grammar EAQL;
prog: cond;
cond: cond logical_operator cond | elexpr comparison_operator VALUE;
elexpr: ELSTEREOTYPE '.' eattribute;
conexpr: CSTEREOTYPE '.' cattribute;
eattribute: 'Name' | 'Path' | 'GUID' | conexpr;
cattribute: 'Name' | 'GUID' | elexpr;
VALUE: QUOTATION ( ~([QUOTATION]) | ~('\n'))+ QUOTATION;
ELSTEREOTYPE: 'EG_ApplicationComponent' | 'EG_ApplicationFunction';
CSTEREOTYPE: 'EG_Flow';
SPACE: ' ';
QUOTATION: '"';
EOL: '\n';
WS : (' ' | '\t')+ -> channel(HIDDEN);
AND: 'AND';
OR: 'OR';
logical_operator: AND | OR;
EQUALS: '=';
GREATER_THAN: '>';
SMALLER_THAN: '<';
comparison_operator: GREATER_THAN | SMALLER_THAN | EQUALS;
When i try to parse this string
EG_ApplicationComponent.Name= "name1" AND EG_ApplicationFunction.Name="name2"
ANTLR will create following children in tree:
'EG_ApplicationComponent'
'.'
'Name'
'='
'"name1" AND EG_ApplicationFunction.Name= "name2"'
I am absolute beginner in creating parsers, but i still do not understand why it does greedy matching until end of string in VALUE, when I specified that matching should end when QUOTATION is found. I expect, that if would match 'name1' as VALUE in first branch of tree and then create another branch with EG_ApplicationFunction.Name= "name2" parsed as previous branch.
This would be my expected result:
'EG_ApplicationComponent'
'.'
'Name'
'='
'"name1"'
AND
EG_ApplicationFunction
'.'
'Name'
'='
'"name2"'
~[QUOTATION] matches any character other than Q, U, O, T, A, T, I, O and N. What you need to do is ~["].
Your VALUE rule could look like this:
VALUE
: QUOTATION ~["\r\n]* QUOTATION
;
I tried to use this code to scramble the characters into different characters and return a new list with those new characters. However, I keep getting errors saying : "a list but here has type char" on line 3, "a list list but given a char list" on the line 13 . Not sure how to fix this. Thanks in advance for the help.
let _scram x =
match x with
| [] -> [] // line 3
| 's' -> 'v'
| 'a' -> 's'
| 'e' -> 'o'
| '_' -> '_'
let rec scramble L P =
match L with
| [] -> P
| hd::t1 -> scramble t1 (P # (_scram hd))
let L =
let p = ['h'; 'e'; 'l'; 'l'; 'o'] //line 13
scramble p []
That's because you are calling the _scram as second operand of the (#) operator which concatenates two lists, so it infers that the whole expression has to be a list.
A quick fix is to enclose it into a list: (P # [_scram hd]), this way _scram hd is inferred to be an element (in this case a char).
Then you will discover your next error, the catch-all wildcard is in quotes, and even if it wouldn't, you can't use it to bind a value to be used later.
So you can change it to | c -> c.
Then your code will be like this:
let _scram x =
match x with
| 's' -> 'v'
| 'a' -> 's'
| 'e' -> 'o'
| c -> c
let rec scramble L P =
match L with
| [] -> P
| hd::t1 -> scramble t1 (P # [_scram hd])
let L =
let p = ['h'; 'e'; 'l'; 'l'; 'o']
scramble p []
F# code is defined sequentially. The first error indicates there is some problem with the code upto that point, the definition of _scram. The line | [] -> [] implies that _scram takes lists to lists. The next line | 's' -> 'v' implies that _scram takes chars to chars. That is incompatible and that explains the error.
i need to replace multiple characters by single character
RETURN LOWER(REPLACE("ranchod-das-chanchad-240190---Funshuk--Wangdu",'--', '-'))
is there any regex to do this
for neo4j 2.2.2
There's no function similar to REPLACE taking a regex as a parameter.
Since you're using Neo4j 2.2, you can't implement it as a procedure either.
The only way to do it is by splitting and joining (using a combination of reduce and substring):
RETURN substring(reduce(s = '', e IN filter(e IN split('ranchod-das-chanchad-240190---Funshuk--Wangdu', '-') WHERE e <> '') | s + '-' + e), 1);
It can be easier to read if you decompose it:
WITH split('ranchod-das-chanchad-240190---Funshuk--Wangdu', '-') AS elems
WITH filter(e IN elems WHERE e <> '') AS elems
RETURN substring(reduce(s = '', e IN elems | s + '-' + e), 1);
sexp is like this: type sexp = Atom of string | List of sexp list, e.g., "((a b) ((c d) e) f)".
I have written a parser to parse a sexp string to the type:
let of_string s =
let len = String.length s in
let empty_buf () = Buffer.create 16 in
let rec parse_atom buf i =
if i >= len then failwith "cannot parse"
else
match s.[i] with
| '(' -> failwith "cannot parse"
| ')' -> Atom (Buffer.contents buf), i-1
| ' ' -> Atom (Buffer.contents buf), i
| c when i = len-1 -> (Buffer.add_char buf c; Atom (Buffer.contents buf), i)
| c -> (Buffer.add_char buf c; parse_atom buf (i+1))
and parse_list acc i =
if i >= len || (i = len-1 && s.[i] <> ')') then failwith "cannot parse"
else
match s.[i] with
| ')' -> List (List.rev acc), i
| '(' ->
let list, j = parse_list [] (i+1) in
parse_list (list::acc) (j+1)
| c ->
let atom, j = parse_atom (empty_buf()) i in
parse_list (atom::acc) (j+1)
in
if s.[0] <> '(' then
let atom, j = parse_atom (empty_buf()) 0 in
if j = len-1 then atom
else failwith "cannot parse"
else
let list, j = parse_list [] 1 in
if j = len-1 then list
else failwith "cannot parse"
But I think it is too verbose and ugly.
Can someone help me with an elegant way to write such a parser?
Actually, I always have problems in writing code of parser and what I could do only is write such a ugly one.
Any tricks for this kind of parsing? How to effectively deal with symbols, such as (, ), that implies recursive parsing?
You can use a lexer+parser discipline to separate the details of lexical syntax (skipping spaces, mostly) from the actual grammar structure. That may seem overkill for such a simple grammar, but it's actually better as soon as the data you parse has the slightest chance of being wrong: you really want error location (and not to implement them yourself).
A technique that is easy and gives short parsers is to use stream parsers (using a Camlp4 extension for them described in the Developping Applications with Objective Caml book); you may even get a lexer for free by using the Genlex module.
If you want to do really do it manually, as in your example above, here is my recommendation to have a nice parser structure. Have mutually recursive parsers, one for each category of your syntax, with the following interface:
parsers take as input the index at which to start parsing
they return a pair of the parsed value and the first index not part of the value
nothing more
Your code does not respect this structure. For example, you parser for atoms will fail if it sees a (. That is not his role and responsibility: it should simply consider that this character is not part of the atom, and return the atom-parsed-so-far, indicating that this position is not in the atom anymore.
Here is a code example in this style for you grammar. I have split the parsers with accumulators in triples (start_foo, parse_foo and finish_foo) to factorize multiple start or return points, but that is only an implementation detail.
I have used a new feature of 4.02 just for fun, match with exception, instead of explicitly testing for the end of the string. It is of course trivial to revert to something less fancy.
Finally, the current parser does not fail if the valid expression ends before the end of the input, it only returns the end of the input on the side. That's helpful for testing but you would do it differently in "production", whatever that means.
let of_string str =
let rec parse i =
match str.[i] with
| exception _ -> failwith "unfinished input"
| ')' -> failwith "extraneous ')'"
| ' ' -> parse (i+1)
| '(' -> start_list (i+1)
| _ -> start_atom i
and start_list i = parse_list [] i
and parse_list acc i =
match str.[i] with
| exception _ -> failwith "unfinished list"
| ')' -> finish_list acc (i+1)
| ' ' -> parse_list acc (i+1)
| _ ->
let elem, j = parse i in
parse_list (elem :: acc) j
and finish_list acc i =
List (List.rev acc), i
and start_atom i = parse_atom (Buffer.create 3) i
and parse_atom acc i =
match str.[i] with
| exception _ -> finish_atom acc i
| ')' | ' ' -> finish_atom acc i
| _ -> parse_atom (Buffer.add_char acc str.[i]; acc) (i + 1)
and finish_atom acc i =
Atom (Buffer.contents acc), i
in
let result, rest = parse 0 in
result, String.sub str rest (String.length str - rest)
Note that it is an error to reach the end of input when parsing a valid expression (you must have read at least one atom or list) or when parsing a list (you must have encountered the closing parenthesis), yet it is valid at the end of an atom.
This parser does not return location information. All real-world parsers should do so, and this is enough of a reason to use a lexer/parser approach (or your preferred monadic parser library) instead of doing it by hand. Returning location information here is not terribly difficult, though, just duplicate the i parameter into the index of the currently parsed character, on one hand, and the first index used for the current AST node, on the other; whenever you produce a result, the location is the pair (first index, last valid index).
I have a task to write a (toy) parser for a (toy) grammar using OCaml and not sure how to start (and proceed with) this problem.
Here's a sample Awk grammar:
type ('nonterm, 'term) symbol = N of 'nonterm | T of 'term;;
type awksub_nonterminals = Expr | Term | Lvalue | Incrop | Binop | Num;;
let awksub_grammar =
(Expr,
function
| Expr ->
[[N Term; N Binop; N Expr];
[N Term]]
| Term ->
[[N Num];
[N Lvalue];
[N Incrop; N Lvalue];
[N Lvalue; N Incrop];
[T"("; N Expr; T")"]]
| Lvalue ->
[[T"$"; N Expr]]
| Incrop ->
[[T"++"];
[T"--"]]
| Binop ->
[[T"+"];
[T"-"]]
| Num ->
[[T"0"]; [T"1"]; [T"2"]; [T"3"]; [T"4"];
[T"5"]; [T"6"]; [T"7"]; [T"8"]; [T"9"]]);;
And here's some fragments to parse:
let frag1 = ["4"; "+"; "3"];;
let frag2 = ["9"; "+"; "$"; "1"; "+"];;
What I'm looking for is a rulelist that is the result of the parsing a fragment, such as this one for frag1 ["4"; "+"; "3"]:
[(Expr, [N Term; N Binop; N Expr]);
(Term, [N Num]);
(Num, [T "3"]);
(Binop, [T "+"]);
(Expr, [N Term]);
(Term, [N Num]);
(Num, [T "4"])]
The restriction is to not use any OCaml libraries other than List... :/
Here is a rough sketch - straightforwardly descend into the grammar and try each branch in order. Possible optimization : tail recursion for single non-terminal in a branch.
exception Backtrack
let parse l =
let rules = snd awksub_grammar in
let rec descend gram l =
let rec loop = function
| [] -> raise Backtrack
| x::xs -> try attempt x l with Backtrack -> loop xs
in
loop (rules gram)
and attempt branch (path,tokens) =
match branch, tokens with
| T x :: branch' , h::tokens' when h = x ->
attempt branch' ((T x :: path),tokens')
| N n :: branch' , _ ->
let (path',tokens) = descend n ((N n :: path),tokens) in
attempt branch' (path', tokens)
| [], _ -> path,tokens
| _, _ -> raise Backtrack
in
let (path,tail) = descend (fst awksub_grammar) ([],l) in
tail, List.rev path
Ok, so the first think you should do is write a lexical analyser. That's the
function that takes the ‘raw’ input, like ["3"; "-"; "("; "4"; "+"; "2"; ")"],
and splits it into a list of tokens (that is, representations of terminal symbols).
You can define a token to be
type token =
| TokInt of int (* an integer *)
| TokBinOp of binop (* a binary operator *)
| TokOParen (* an opening parenthesis *)
| TokCParen (* a closing parenthesis *)
and binop = Plus | Minus
The type of the lexer function would be string list -> token list and the ouput of
lexer ["3"; "-"; "("; "4"; "+"; "2"; ")"]
would be something like
[ TokInt 3; TokBinOp Minus; TokOParen; TokInt 4;
TBinOp Plus; TokInt 2; TokCParen ]
This will make the job of writing the parser easier, because you won't have to
worry about recognising what is a integer, what is an operator, etc.
This is a first, not too difficult step because the tokens are already separated.
All the lexer has to do is identify them.
When this is done, you can write a more realistic lexical analyser, of type string -> token list, that takes a actual raw input, such as "3-(4+2)" and turns it into a token list.
I'm not sure if you specifically require the derivation tree, or if this is a just a first step in parsing. I'm assuming the latter.
You could start by defining the structure of the resulting abstract syntax tree by defining types. It could be something like this:
type expr =
| Operation of term * binop * term
| Term of term
and term =
| Num of num
| Lvalue of expr
| Incrop of incrop * expression
and incrop = Incr | Decr
and binop = Plus | Minus
and num = int
Then I'd implement a recursive descent parser. Of course it would be much nicer if you could use streams combined with the preprocessor camlp4of...
By the way, there's a small example about arithmetic expressions in the OCaml documentation here.