I've built a parser and lexer with Alex and Happy which produces and abstract syntax tree of the language I'm parsing (Solidity). My problem now is how I properly traverse and match certain aspects of the language. The aim is to
create a rule engine which will perform code analysis on the resulting AST, checking for specific issues like improper uses of functions, dangerous calls or the lacking of certain elements.
This is the layout of my data, which happy outputs as the AST. (This isn't the full AST but just a snapshot)
data SourceUnit = SourceUnit PragmaDirective
| ImportUnit ImportDirective
| ContractDef ContractDefinition
deriving (Show, Eq, Data, Typeable, Ord)
-- Version Information
data PragmaDirective = PragmaDirective PragmaName Version Int
deriving(Show, Eq, Data, Typeable, Ord)
data Version = Version String
deriving (Show, Eq, Data, Typeable, Ord)
data PragmaName = PragmaName Ident
deriving(Show, Eq, Typeable, Data, Ord)
data PragmaValue = PragmaValue Dnum
deriving(Show, Eq, Data, Typeable, Ord)
-- File imports/Contract Imports
data ImportDirective = ImportDir String
| ImportMulti Identifier Identifier Identifier String
deriving (Show, Eq, Data, Typeable, Ord)
-- The definition of an actual Contract Code Block
data ContractDefinition = Contract Identifier [InheritanceSpec] [ContractConts]
deriving (Show, Eq, Data, Typeable, Ord)
data ContractConts = StateVarDec StateVarDeclaration
| FunctionDefinition FunctionDef
| UsingFor UsingForDec
deriving (Show, Eq, Data, Typeable, Ord)
My current train of thought is to use pattern matching by passing in the [SourceUnit] to a function and matching for specific cases. For instance the following function matches the code and returns the data type for a state variable declaration.
getStateVar :: [SourceUnit] -> Maybe StateVarDeclaration
getStateVar [SourceUnit _ , ContractDef (Contract _ _ [StateVarDec x]) ] = Just x
getStateVar _ = Nothing
This outputs the following, which is in part what I need. Unfortunately the language could contain multiple contract declarations, with multiple state variable declarations so I don't think it's entirely possible to match it in this fashion.
Main> getStateVar $ runTest "pragma solidity ^0.5.0; contract test { address owner = msg.send;}"
Just (StateVariableDeclaration (ElementaryTypeName (AddrType "address")) [] (Identifier "owner") [MemberAccess (IdentExpression "msg") "." (Identifier "send")])
I've read somewhat on generic programming and "scrap your boilerplate" but I don't understand exactly how it works or the best method to implement it would be.
The question is, am I on the right track in terms of pattern matching this way or is there a better alternative?
Related
Hi I am attempting to make a combinator parser and I am currently attempting to make it read headers and create parsers based upon what the header which is parsed is. I.e A header of; int, float, string will result in Parser<Parser<int>*Parser<float>*Parser<string>>.
I am wondering however how you would unpack the "inner" parsers which and then end up with a something like; Parser<int*float*string>?
Parser type is: type Parser<'a> = Parser of (string -> Result<'a * string, string>)
I'm not sure that your idea with nested parsers is going to work - if you parse a header dynamically, then you'll need to produce a list of parsers of the same type. The way you wrote this is suggesting that the type of the parser will depend on the input, which is not possible in F#.
So, I'd expect that you will need to define a value like this:
type Value = Int of int | String of string | Float of float
And then your parser that parses a header will produce something like:
let parseHeaders args : Parser<Parser<Value> list> = (...)
The next question is, what do you want to do with the nested parsers? Presumably, you'll need to turn them into a single parser that parses the whole line of data (if this is something like a CSV file). Typically, you'd define a function sequence:
val sequence : sep:Parser<unit> -> parsers:Parser<'a> list -> Parser<'a list>
This takes a separator (say, parser to recognize a comma) and a list of parsers and produces a single parser that runs all the parsers in a sequence with the separator in between.
Then you can do:
parseHeaders input |> map (fun parsers -> sequence (char ',') parsers)
And you get a single parser Parser<Parser<string>>. You now want to run the nested parser on the rest of the that is left after running the outer parser, which recognizes headers. The following function does the trick:
let unwrap (Parser f:Parser<Parser<'a>>) = Parser (fun s ->
match f s with
| Result.Ok(Parser nested, rest) -> nested rest
| Result.Error e -> Result.Error e )
I am writing a Delphi code parser using Parsec, my current AST data structures look like this:
module Text.DelphiParser.Ast where
data TypeName = TypeName String [String] deriving (Show)
type UnitName = String
data ArgumentKind = Const | Var | Out | Normal deriving (Show)
data Argument = Argument ArgumentKind String TypeName deriving (Show)
data MethodFlag = Overload | Override | Reintroduce | Static | StdCall deriving (Show)
data ClassMember =
ConstField String TypeName
| VarField String TypeName
| Property String TypeName String (Maybe String)
| ConstructorMethod String [Argument] [MethodFlag]
| DestructorMethod String [Argument] [MethodFlag]
| ProcMethod String [Argument] [MethodFlag]
| FunMethod String [Argument] TypeName [MethodFlag]
| ClassProcMethod String [Argument] [MethodFlag]
| ClassFunMethod String [Argument] TypeName [MethodFlag]
deriving (Show)
data Visibility = Private | Protected | Public | Published deriving (Show)
data ClassSection = ClassSection Visibility [ClassMember] deriving (Show)
data Class = Class String [ClassSection] deriving (Show)
data Type = ClassType Class deriving (Show)
data Interface = Interface [UnitName] [Type] deriving (Show)
data Implementation = Implementation [UnitName] deriving (Show)
data Unit = Unit String Interface Implementation deriving (Show)
I want to preserve comments in my AST data structures and I'm currently trying to figure out how to do this.
My parser is split into a lexer and a parser (both written with Parsec) and I have already implemented lexing of comment tokens.
unit SomeUnit;
interface
uses
OtherUnit1, OtherUnit2;
type
// This is my class that does blabla
TMyClass = class
var
FMyAttribute: Integer;
public
procedure SomeProcedure;
{ The constructor takes an argument ... }
constructor Create(const Arg1: Integer);
end;
implementation
end.
The token stream looks like this:
[..., Type, LineComment " This is my class that does blabla", Identifier "TMyClass", Equals, Class, ...]
The parser translates this into:
Class "TMyClass" ...
The Class data type doesn't have any way to attach comments and since comments (especially block comments) could appear almost anywhere in the token stream I would have to add an optional comment to all data types in the AST?
How can I deal with comments in my AST?
A reasonable approach for dealing with annotated data on an AST is to thread an extra type parameter through that can contain whatever metadata you like. Apart from being able to selectively include or ignore comments, this will also let you include other sorts of information with your tree.
First, you would rewrite all your AST types with an extra parameter:
data TypeName a = TypeName a String [String]
{- ... -}
data ClassSection a = ClassSection a Visibility [ClassMember a]
{- ... -}
It would be useful to add deriving Functor to all of them as well, making it easy to transform the annotations on a given AST.
Now an AST with the comments remaining would have the type Class Comment or something to that effect. You could also reuse this for additional information like scope analysis, where you would include the current scope with the relevant part of the AST.
If you wanted multiple annotations at once, the simplest solution would be to use a record, although that's a bit awkward because (at least for now¹) we can't easily write code polymorphic over record fields. (Ie we can't easily write the type "any record with a comments :: Comment field".)
One additional neat thing you can do is use PatternSynonyms (available from GHC 7.8) to have a suite of patterns that work just like your current unannotated AST, letting you reuse your existing case statements. (To do this, you'll also have to rename the constructors for the annotated types so they don't overlap.)
pattern TypeName a as <- TypeName' _ a as
Footnotes
¹ Hopefully part 2 the revived overloaded record fields proposal will help in this regard when it actually gets added to the language.
I'm new to Haskell, and am working through the Haskell LLVM tutorial. In it, the author defines a simple algebraic data type to represent the AST.
type Name = String
data Expr
= Float Double
| BinOp Op Expr Expr
| Var String
| Call Name [Expr]
| Function Name [Expr] Expr
| Extern Name [Expr]
deriving (Eq, Ord, Show)
data Op
= Plus
| Minus
| Times
| Divide
deriving (Eq, Ord, Show)
However, this is not an ideal structure, because the parser actually expects that the list of Expr in an Extern will only ever contain expressions representing variables (i.e. parameters in this situation cannot be arbitrary expressions). I would like to make the types reflect this constraint (making it easier to generate random valid ASTs using QuickCheck); however, for the sake of consistency in the parser functions (which all have type Parser Expr), I don't just want to say | Expr Name [Name]. I would like to do something like this:
data Expr
= ...
| Var String
...
| Function Name [Expr] Expr
| Extern Name [Var] -- enforce constraint here
deriving (Eq, Ord, Show)
But it's not possible in Haskell.
To summarize, Extern and Var should both be Expr, and Extern should have a list of Vars representing parameters. Would the best way be to split all of these out and make them instances of an Expr typeclass (that wouldn't have any methods)? Or is there a more idiomatic method (or would it be better to scrap these types and do something totally different)?
Disclaimer, I'm the author of the LLVM tutorial you mentioned.
Just use Extern Name [Name], everything after Chapter 3 onward in the tutorial uses that exact definition anyways. I think I just forgot to make Chapter 2 Syntax.hs consistent with the others.
I wouldn't worry about making the parser definitions consistent, it's fine for them to return different types. Here's what the later parsers use. identifier is just the parsec builtin for the alphanumeric identifier from the LanguageDef that becomes the Name type in the AST.
extern :: Parser Expr
extern = do
reserved "extern"
name <- identifier
args <- parens $ many identifier
return $ Extern name args
This question is related to both Parsec and uu-parsinglib. When we write parser combinators, they process characters streams from compiler. Is it somehow possible to parse a character and put it back (or return another character back) to the input stream?
I want for example to parse input "test + 5", parse the t, e, s, t and after recognition of test pattern, put for example v character back into the character stream, so while continuating the parsing process we are matching against v + 5
I do not want to use this in any particular case for now - I want to deeply learn the possibilities.
I'm not sure if it's possible with these parsers directly, but in general you can accomplish it by combining parsers with some streaming that allows injecting leftovers.
For example, using attoparsec-conduit you can turn a parser into a conduit using
sinkParser :: (AttoparsecInput a, MonadThrow m)
=> Parser a b -> Consumer a m b
where Consumer is a special kind of conduit that doesn't produce any output, only receives input and returns a final value.
Since conduits support leftovers, you can create a helper method that converts a parser that optionally returns a value to be pushed into the stream into a conduit:
import Data.Attoparsec.Types
import Data.Conduit
import Data.Conduit.Attoparsec
import Data.Functor
reinject :: (AttoparsecInput a, MonadThrow m)
=> Parser a (Maybe a, b) -> Consumer a m b
reinject p = do
(lo, r) <- sinkParser p
maybe (return ()) leftover lo
return r
Then you convert standard parsers to conduits using sinkParser and these special parsers using reinject, and then combine conduits instead of parsers.
I think the simplest way to archive this is to build a multi-layered parser. Think of a lexer + parser combination. This is a clean approach to this problem.
You have to separate the two kind of parsing. The search-and-replace parsing goes to the first parser and the build-the-AST parsing to the second. Or you can create an intermediate token representation.
import Text.Parsec
import Text.Parsec.String
parserLvl1 :: Parser String
parserLvl1 = many (try (string "test" >> return 'v') <|> anyChar)
parserLvl2 :: Parser Plus
parserLvl2 = do text1 <- many (noneOf "+")
char '+'
text2 <- many (noneOf "+")
return $ Plus text1 text2
data Plus = Plus String String
deriving Show
wholeParse :: String -> Either ParseError Plus
wholeParse source = do res1 <- parse parserLvl1 "lvl1" source
res2 <- parse parserLvl2 "lvl2" res1
return res2
Now you can parse your example. wholeParse "test+5" results in Right (Plus "v" "5").
Possible variations:
Create a class and an instance for combining wrapped parser stages. (Possibly carrying parser state.)
Create an intermediate representation, a stream of tokens
This is easily done in uu-parsinglib using the pSwitch function. But the question is why you want to do so? Because the v is missing from the input? In that case uu-parsinglib will perform error correction automatically so you do not need something like this. Otherwise you can write
pSwitch :: (st1 -> (st2, st2 -> st1)) -> P st2 a -> P st1 a
pInsert_v = pSwitch (\st1 -> (prepend v st2, id) (pSucceed ())
It depends on your actual state type how the v is actually added, so you will have to define the function prepend yourself. I do not know e.g. how such an insertion would influence the current position in the file etc.
Doaitse Swierstra
Given a LL(1) grammar what is an appropriate data structure or algorithm for producing an immutable concrete syntax tree in a functionally pure manner? Please feel free to write example code in whatever language you prefer.
My Idea
symbol : either a token or a node
result : success or failure
token : a lexical token from source text
value -> string : the value of the token
type -> integer : the named type code of the token
next -> token : reads the next token and keeps position of the previous token
back -> token : moves back to the previous position and re-reads the token
node : a node in the syntax tree
type -> integer : the named type code of the node
symbols -> linkedList : the symbols found at this node
append -> symbol -> node : appends the new symbol to a new copy of the node
Here is an idea I have been thinking about. The main issue here is handling syntax errors.
I mean I could stop at the first error but that doesn't seem right.
let program token =
sourceElements (node nodeType.program) token
let sourceElements node token =
let (n, r) = sourceElement (node.append (node nodeType.sourceElements)) token
match r with
| success -> (n, r)
| failure -> // ???
let sourceElement node token =
match token.value with
| "function" ->
functionDeclaration (node.append (node nodeType.sourceElement)) token
| _ ->
statement (node.append (node nodeType.sourceElement)) token
Please Note
I will be offering up a nice bounty to the best answer so don't feel rushed. Answers that simply post a link will have less weight over answers that show code or contain detailed explanations.
Final Note
I am really new to this kind of stuff so don't be afraid to call me a dimwit.
You want to parse something into an abstract syntax tree.
In the purely functional programming language Haskell, you can use parser combinators to express your grammar. Here an example that parses a tiny expression language:
EDIT Use monadic style to match Graham Hutton's book
-- import a library of *parser combinators*
import Parsimony
import Parsimony.Char
import Parsimony.Error
(+++) = (<|>)
-- abstract syntax tree
data Expr = I Int
| Add Expr Expr
| Mul Expr Expr
deriving (Eq,Show)
-- parse an expression
parseExpr :: String -> Either ParseError Expr
parseExpr = Parsimony.parse pExpr
where
-- grammar
pExpr :: Parser String Expr
pExpr = do
a <- pNumber +++ parentheses pExpr -- first argument
do
f <- pOp -- operation symbol
b <- pExpr -- second argument
return (f a b)
+++ return a -- or just the first argument
parentheses parser = do -- parse inside parentheses
string "("
x <- parser
string ")"
return x
pNumber = do -- parse an integer
digits <- many1 digit
return . I . read $ digits
pOp = -- parse an operation symbol
do string "+"
return Add
+++
do string "*"
return Mul
Here an example run:
*Main> parseExpr "(5*3)+1"
Right (Add (Mul (I 5) (I 3)) (I 1))
To learn more about parser combinators, see for example chapter 8 of Graham Hutton's book "Programming in Haskell" or chapter 16 of "Real World Haskell".
Many parser combinator library can be used with different token types, as you intend to do. Token streams are usually represented as lists of tokens [Token].
Definitely check out the monadic parser combinator approach; I've blogged about it in C# and in F#.
Eric Lippert's blog series on immutable binary trees may be helpful. Obviously, you need a tree which is not binary, but it will give you the general idea.