Writing Custom Expression parser or using ANTLR library?

Writing Custom Expression parser or using ANTLR library? - parsing

I have expressions like follows:
eg 1: (f1 AND f2)
eg 2: ((f1 OR f2) AND f3)
eg 3: ((f1 OR f2) AND (f3 OR (f4 AND f5)))
Each of f(n) is used to generate a fragment of SQL and each of these fragments will be joined using OR / AND described in the expression.
Now I want to :
1) Parse this expression
2) Validate it
3) Generate "Expression Tree" for the expression and use this tree to generate the final SQL.
I found this series of articles on writing tokenizers, parsers..etc ex :
http://cogitolearning.co.uk/2013/05/writing-a-parser-in-java-the-expression-tree/
I also came across with the library ANTLR , which wondering whether I can use for my case.
Any tips?

I'm guessing you might only interested in Java (it would be good to say so in future), but if you have a choice of languages, then I would recommend using Python and parsy for a task like this. It is much more light weight than things like ANTLR.
Here is some example code I knocked together that parses your samples into appropriate data structures:
import attr
from parsy import string, regex, generate
#attr.s
class Variable():
name = attr.ib()
#attr.s
class Compound():
left_value = attr.ib()
right_value = attr.ib()
operator = attr.ib()
#attr.s
class Expression():
value = attr.ib()
# You could put an `evaluate` method here,
# or `generate_sql` etc.
whitespace = regex(r'\s*')
lexeme = lambda p: whitespace >> p << whitespace
AND = lexeme(string('AND'))
OR = lexeme(string('OR'))
OPERATOR = AND | OR
LPAREN = lexeme(string('('))
RPAREN = lexeme(string(')'))
variable = lexeme((AND | OR | LPAREN | RPAREN).should_fail("not AND OR ( )") >> regex("\w+")).map(Variable)
#generate
def compound():
yield LPAREN
left = yield variable | compound
op = yield OPERATOR
right = yield variable | compound
yield RPAREN
return Compound(left_value=left,
right_value=right,
operator=op)
expression = (variable | compound).map(Expression)
I'm also use attrs for simple data structures.
The result of parsing is a hierarchy of expressions:
>>> expression.parse("((f1 OR f2) AND (f3 OR (f4 AND f5)))")
Expression(value=Compound(left_value=Compound(left_value=Variable(name='f1'), right_value=Variable(name='f2'), operator='OR'), right_value=Compound(left_value=Variable(name='f3'), right_value=Compound(left_value=Variable(name='f4'), right_value=Variable(name='f5'), operator='AND'), operator='OR'), operator='AND'))

Related

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?

Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.

the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

What is the | symbol for in f #?

I'm pretty new to functional programming and I've started looking at the documentation for match statements and in the example I came across here gitpages and cut and pasted to my question below:
let rec fib n =
match n with
| 0 -> 0
| 1 -> 1
| _ -> fib (n - 1) + fib (n - 2)
I understand that let is for static binding in this case for a recursive function called fib which takes a parameter n. It tries to match n with 3 cases. If it's 0, 1 or anything else.
What I don't understand is what the | symbol is called in this context or why it is used? Anything I search for pertaining to f-sharp pipe takes me to this |> which is the piping character in f sharp.
What is this | used for in this case? Is it required or optional? And when should be and shouldn't I be using |?

The | symbol is used for several things in F#, but in this case, it serves as a separator of cases of the match construct.
The match construct lets you pattern match on some input and handle different values in different ways - in your example, you have one case for 0, one for 1 and one for all other values.
Generally, the syntax of match looks like this:
match <input> with <case_1> | ... | <case_n>
Where each <case> has the following structure:
<case> = <pattern> -> <expression>
Here, the | symbol simply separates multiple cases of the pattern matching expression. Each case then has a pattern and an expression that is evaluated when the input matches the pattern.

To expand on Tomas's excellent answer, here are some more of the various uses of | in F#:
Match expressions
In match expressions, | separates the various patterns, as Tomas has pointed. While you can write the entire match expression on a single line, it's conventional to write each pattern on a separate line, lining up the | characters, so that they form a visual indicator of the scope of the match statement:
match n with
| 0 -> "zero"
| 1 -> "one"
| 2 -> "two"
| 3 -> "three"
| _ -> "something else"
Discriminated Unions
Discriminated Unions (or DUs, since that's a lot shorter to type) are very similar to match expressions in style: defining them means listing the possibilities, and | is used to separate the possibilities. As with match expressions, you can (if you want to) write DUs on a single line:
type Option<'T> = None | Some of 'T
but unless your DU has just two possibilities, it's usually better to write it on multiple lines:
type ContactInfo =
| Email of string
| PhoneNumber of areaCode : string * number : string
| Facebook of string
| Twitter of string
Here, too, the | ends up forming a vertical line that draws the eye to the possibilities of the DU, and makes it very clear where the DU definition ends.
Active patterns
Active patterns also use | to separate the possibilities, but they also are wrapped inside an opening-and-closing pair of | characters:
let (Even|Odd) n = if n % 2 = 0 then Even else Odd // <-- Wrong!
let (|Even|Odd|) n = if n % 2 = 0 then Even else Odd // <-- Right!
Active patterns are usually written in the way I just showed, with the | coming immediately inside the parentheses, which is why some people talk about "banana clips" (because the (| and |) pairs look like bananas if you use your imagination). But in fact, it's not necessary to write the (| and |) characters together: it's perfectly valid to have spaces separating the parentheses from the | characters:
let (|Even|Odd|) n = if n % 2 = 0 then Even else Odd // <-- Right!
let ( |Even|Odd| ) n = if n % 2 = 0 then Even else Odd // <-- ALSO right!
Unrelated things
The pipe operator |> and the Boolean-OR operator || are not at all the same thing as uses of the | operator. F# allows operators to be any combination of symbols, and they can have very different meanings from an operator that looks almost the same. For example, >= is a standard operator that means "greater than". And many F# programs will define a custom operator >>=. But although >>= is not defined in the F# core library, it has a standard meaning, and that standard meaning is NOT "a lot greater than". Rather, >>= is the standard way to write an operator for the bind function. I won't get into what bind does right now, as that's a concept that could take a whole answer all on its own to go through. But if you're curious about how bind works, you can read Scott Wlaschin's series on computation expressions, which explains it all very well.

Layout in Rascal

When I import the Lisra recipe,
import demo::lang::Lisra::Syntax;
This creates the syntax:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical AtomExp = (![0-9()\t-\n\r\ ])+ !>> ![0-9()\t-\n\r\ ];
start syntax LispExp
= IntegerLiteral
| AtomExp
| "(" LispExp* ")"
;
Through the start syntax-definition, layout should be ignored around the input when it is parsed, as is stated in the documentation: http://tutor.rascal-mpl.org/Rascal/Declarations/SyntaxDefinition/SyntaxDefinition.html
However, when I type:
rascal>(LispExp)` (something)`
This gives me a concrete syntax fragment error (or a ParseError when using the parse-function), in contrast to:
rascal>(LispExp)`(something)`
Which succesfully parses. I tried this both with one of the latest versions of Rascal as well as the Eclipse plugin version. Am I doing something wrong here?
Thank you.
Ps. Lisra's parse-function:
public Lval parse(str txt) = build(parse(#LispExp, txt));
Also fails on the example:
rascal>parse(" (something)")
|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>): ParseError(|unknown:///|(0,1,<1,0>,<1,1>))
at *** somewhere ***(|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>))
at parse(|project://rascal/src/org/rascalmpl/library/demo/lang/Lisra/Parse.rsc|(163,3,<7,44>,<7,47>))
at $shell$(|stdin:///|(0,13,<1,0>,<1,13>))

When you define a start non-terminal Rascal defines two non-terminals in one go:
rascal>start syntax A = "a";
ok
One non-terminal is A, the other is start[A]. Given a layout non-terminal in scope, say L, the latter is automatically defined by (something like) this rule:
syntax start[A] = L before A top L after;
If you call a parser or wish to parse a concrete fragment, you can use either non-terminal:
parse(#start[A], " a ") // parse using the start non-terminal and extra layout
parse(A, "a") // parse only an A
(start[A]) ` a ` // concrete fragment for the start-non-terminal
(A) `a` // concrete fragment for only an A
[start[A]] " a "
[A] "a"

Haskell: Traverse through a String/Text File

I am trying to read a script file then process and output it to a html file. In my script file, whenever there is a #title(this is a title), I will add tag [header] this is a title [/header] in my html output. So my approach is to first read the script file, write the content to a string, process the string, then write the string to html file.
In other to recognize the #title, I will need to read character by character in the string. When I read '#', I will need to detect the next character to see if they are t i t l e.
QUESTION: How do I traverse through a string (which is a list of char) in Haskell?

You could use a simple recursion trick, for example
findTag [] = -- end of list code.
findTag ('#':xs)
| take 5 xs == "title" = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) = findTag xs
so basically you just pattern match if the next char (head of list) is '#' and then you check if the next 5 characters form "title". if so you can then continue your parsing code. if next character isnt '#' you just continue the recursing. Once the list is empty you reach the first pattern match.
Someone else might have a better solution.
I hope this answers your question.
edit:
For a bit more flexibility, if you want to find a specific tag you could do this:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take (length tagName) xs == tagName = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) _ = findTag xs
This way if you do
findTag text "title"
You'll specifically look for the title, and you can always change the tagname to whatever you want.
Another edit:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take tLength xs == tagName = getTagContents tLength xs
| otherwise = findTag xs
where tLength = length tagName
findTag (_:xs) _ = findTag xs
getTagContents :: Int -> String -> String
getTagContents len = takeWhile (/=')') . drop (len + 1)
to be honest, it's getting a bit messy but here's what's happening:
You first drop the length of the tagName, then one more for the open bracket, and then you finish off by using takeWhile to take the characters until the closing bracket.

Evidently your problem falls into parsing category. As wisely stated by Daniel Wagner, for maintainability reasons you're much better off approaching it generally with a parser.
Another thing is if you want to work with textual data efficiently, you're better off using Text instead of String.
Here's how you could solve your problem using the Attoparsec parser library:
-- For autocasting of hardcoded strings to `Text` type
{-# LANGUAGE OverloadedStrings #-}
-- Import a way more convenient prelude, excluding symbols conflicting
-- with the parser library. See
-- http://hackage.haskell.org/package/classy-prelude
import ClassyPrelude hiding (takeWhile, try)
-- Exclude the standard Prelude
import Prelude ()
import Data.Attoparsec.Text
-- A parser and an inplace converter for title
title = do
string "#title("
r <- takeWhile $ notInClass ")"
string ")"
return $ "[header]" ++ r ++ "[/header]"
-- A parser which parses the whole document to parts which are either
-- single-character `Text`s or modified titles
parts =
(try endOfInput >> return []) ++
((:) <$> (try title ++ (singleton <$> anyChar)) <*> parts)
-- The topmost parser which concats all parts into a single text
top = concat <$> parts
-- A sample input
input = "aldsfj#title(this is a title)sdlfkj#title(this is a title2)"
-- Run the parser and output result
main = print $ parseOnly top input
This outputs
Right "aldsfj[header]this is a title[/header]sdlfkj[header]this is a title2[/header]"
P.S. ClassyPrelude reimplements ++ as an alias for Monoid's mappend, so you can replace it with mappend, <> or Alternative's <|> if you want.

For pattern search-and-replace, you can use
streamEdit.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
title :: Parsec Void String String
title = do
void $ string "#title("
someTill anySingle $ string ")"
editor t = "[header]" ++ t ++ "[/header]"
streamEdit title editor " #title(this is a title) "
" [header]this is a title[/header] "

F# how to write an empty statement

How can I write a no-op statement in F#?
Specifically, how can I improve the second clause of the following match statement:
match list with
| [] -> printfn "Empty!"
| _ -> ignore 0

Use unit for empty side effect:
match list with
| [] -> printfn "Empty!"
| _ -> ()

The answer from Stringer is, of course, correct. I thought it may be useful to clarify how this works, because "()" insn't really an empty statement or empty side effect...
In F#, every valid piece of code is an expression. Constructs like let and match consist of some keywords, patterns and several sub-expressions. The F# grammar for let and match looks like this:
<expr> ::= let <pattern> = <expr>
<expr>
::= match <expr> with
| <pat> -> <expr>
This means that the body of let or the body of clause of match must be some expression. It can be some function call such as ignore 0 or it can be some value - in your case it must be some expression of type unit, because printfn ".." is also of type unit.
The unit type is a type that has only one value, which is written as () (and it also means empty tuple with no elements). This is, indeed, somewhat similar to void in C# with the exception that void doesn't have any values.
BTW: The following code may look like a sequence of statements, but it is also an expression:
printf "Hello "
printf "world"
The F# compiler implicitly adds ; between the two lines and ; is a sequencing operator, which has the following structure: <expr>; <expr>. It requires that the first expression returns unit and returns the result of the second expression.
This is a bit surprising when you're coming from C# background, but it makes the langauge surprisingly elegant and consise. It doesn't limit you in any way - you can for example write:
if (a < 10 && (printfn "demo"; true)) then // ...
(This example isn't really useful - just a demonstration of the flexibility)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Writing Custom Expression parser or using ANTLR library? - parsing

Related

Parse String to Datatype in Haskell

What is the | symbol for in f #?

Layout in Rascal

Haskell: Traverse through a String/Text File

F# how to write an empty statement

Categories

Resources