I am writing a Delphi code parser using Parsec, my current AST data structures look like this:
module Text.DelphiParser.Ast where
data TypeName = TypeName String [String] deriving (Show)
type UnitName = String
data ArgumentKind = Const | Var | Out | Normal deriving (Show)
data Argument = Argument ArgumentKind String TypeName deriving (Show)
data MethodFlag = Overload | Override | Reintroduce | Static | StdCall deriving (Show)
data ClassMember =
ConstField String TypeName
| VarField String TypeName
| Property String TypeName String (Maybe String)
| ConstructorMethod String [Argument] [MethodFlag]
| DestructorMethod String [Argument] [MethodFlag]
| ProcMethod String [Argument] [MethodFlag]
| FunMethod String [Argument] TypeName [MethodFlag]
| ClassProcMethod String [Argument] [MethodFlag]
| ClassFunMethod String [Argument] TypeName [MethodFlag]
deriving (Show)
data Visibility = Private | Protected | Public | Published deriving (Show)
data ClassSection = ClassSection Visibility [ClassMember] deriving (Show)
data Class = Class String [ClassSection] deriving (Show)
data Type = ClassType Class deriving (Show)
data Interface = Interface [UnitName] [Type] deriving (Show)
data Implementation = Implementation [UnitName] deriving (Show)
data Unit = Unit String Interface Implementation deriving (Show)
I want to preserve comments in my AST data structures and I'm currently trying to figure out how to do this.
My parser is split into a lexer and a parser (both written with Parsec) and I have already implemented lexing of comment tokens.
unit SomeUnit;
interface
uses
OtherUnit1, OtherUnit2;
type
// This is my class that does blabla
TMyClass = class
var
FMyAttribute: Integer;
public
procedure SomeProcedure;
{ The constructor takes an argument ... }
constructor Create(const Arg1: Integer);
end;
implementation
end.
The token stream looks like this:
[..., Type, LineComment " This is my class that does blabla", Identifier "TMyClass", Equals, Class, ...]
The parser translates this into:
Class "TMyClass" ...
The Class data type doesn't have any way to attach comments and since comments (especially block comments) could appear almost anywhere in the token stream I would have to add an optional comment to all data types in the AST?
How can I deal with comments in my AST?
A reasonable approach for dealing with annotated data on an AST is to thread an extra type parameter through that can contain whatever metadata you like. Apart from being able to selectively include or ignore comments, this will also let you include other sorts of information with your tree.
First, you would rewrite all your AST types with an extra parameter:
data TypeName a = TypeName a String [String]
{- ... -}
data ClassSection a = ClassSection a Visibility [ClassMember a]
{- ... -}
It would be useful to add deriving Functor to all of them as well, making it easy to transform the annotations on a given AST.
Now an AST with the comments remaining would have the type Class Comment or something to that effect. You could also reuse this for additional information like scope analysis, where you would include the current scope with the relevant part of the AST.
If you wanted multiple annotations at once, the simplest solution would be to use a record, although that's a bit awkward because (at least for now¹) we can't easily write code polymorphic over record fields. (Ie we can't easily write the type "any record with a comments :: Comment field".)
One additional neat thing you can do is use PatternSynonyms (available from GHC 7.8) to have a suite of patterns that work just like your current unannotated AST, letting you reuse your existing case statements. (To do this, you'll also have to rename the constructors for the annotated types so they don't overlap.)
pattern TypeName a as <- TypeName' _ a as
Footnotes
¹ Hopefully part 2 the revived overloaded record fields proposal will help in this regard when it actually gets added to the language.
Related
Let's say there are two unions where one is a strict subset of another.
type Superset =
| A of int
| B of string
| C of decimal
type Subset =
| A of int
| B of string
Is it possible to automatically upcast a Subset value to Superset value without resorting to explicit pattern matching? Like this:
let x : Subset = A 1
let y : Superset = x // this won't compile :(
Also it's ideal if Subset type was altered so it's no longer a subset then compiler should complain:
type Subset =
| A of int
| B of string
| D of bool // - no longer a subset of Superset!
I believe it's not possible to do but still worth asking (at least to understand why it's impossible)
WHY I NEED IT
I use this style of set/subset typing extensively in my domain to restrict valid parameters in different states of entities / make invalid states non-representable and find the approach very beneficial, the only downside is very tedious upcasting between subsets.
Sorry, no
Sorry, but this is not possible. Take a look at https://fsharpforfunandprofit.com/posts/fsharp-decompiled/#unions — you'll see that F# compiles discriminated unions to .NET classes, each one separate from each other with no common ancestors (apart from Object, of course). The compiler makes no effort to try to identify subsets or supersets between different DUs. If it did work the way you suggested, it would be a breaking change, because the only way to do this would be to make the subset DU a base class, and the superset class its derived class with an extra property. And that would make the following code change behavior:
type PhoneNumber =
| Valid of string
| Invalid
type EmailAddress =
| Valid of string
| ValidButOutdated of string
| Invalid
let identifyContactInfo (info : obj) =
// This came from external code we don't control, but it should be contact info
match (unbox obj) with
| :? PhoneNumber as phone -> // Do something
| :? EmailAddress as email -> // Do something
Yes, this is bad code and should be written differently, but it illustrates the point. Under current compiler behavior, if identifyContactInfo gets passed a EmailAddress object, the :? PhoneNumber test will fail and so it will enter the second branch of the match, and treat that object (correctly) as an email address. If the compiler were to guess supersets/subsets based on DU names as you're suggesting here, then PhoneNumber would be considered a subset of EmailAddress and so would become its base class. And then when this function received an EmailAddress object, the :? PhoneNumber test would succeed (because an instance of a derived class can always be cast to the type of its base class). And then the code would enter the first branch of the match expression, and your code might then try to send a text message to an email address.
But wait...
What you're trying to do might be achievable by pulling out the subsets into their own DU category:
type AorB =
| A of int
| B of string
type ABC =
| AorB of AorB
| C of decimal
type ABD =
| AorB of AorB
| D of bool
Then your match expressions for an ABC might look like:
match foo with
| AorB (A num) -> printfn "%d" num
| AorB (B s) -> printfn "%s" s
| C num -> printfn "%M" num
And if you need to pass data between an ABC and an ABD:
let (bar : ABD option) =
match foo with
| AorB data -> Some (AorB data)
| C _ -> None
That's not a huge savings if your subset has only two common cases. But if your subset is a dozen cases or so, being able to pass those dozen around as a unit makes this design attractive.
I've built a parser and lexer with Alex and Happy which produces and abstract syntax tree of the language I'm parsing (Solidity). My problem now is how I properly traverse and match certain aspects of the language. The aim is to
create a rule engine which will perform code analysis on the resulting AST, checking for specific issues like improper uses of functions, dangerous calls or the lacking of certain elements.
This is the layout of my data, which happy outputs as the AST. (This isn't the full AST but just a snapshot)
data SourceUnit = SourceUnit PragmaDirective
| ImportUnit ImportDirective
| ContractDef ContractDefinition
deriving (Show, Eq, Data, Typeable, Ord)
-- Version Information
data PragmaDirective = PragmaDirective PragmaName Version Int
deriving(Show, Eq, Data, Typeable, Ord)
data Version = Version String
deriving (Show, Eq, Data, Typeable, Ord)
data PragmaName = PragmaName Ident
deriving(Show, Eq, Typeable, Data, Ord)
data PragmaValue = PragmaValue Dnum
deriving(Show, Eq, Data, Typeable, Ord)
-- File imports/Contract Imports
data ImportDirective = ImportDir String
| ImportMulti Identifier Identifier Identifier String
deriving (Show, Eq, Data, Typeable, Ord)
-- The definition of an actual Contract Code Block
data ContractDefinition = Contract Identifier [InheritanceSpec] [ContractConts]
deriving (Show, Eq, Data, Typeable, Ord)
data ContractConts = StateVarDec StateVarDeclaration
| FunctionDefinition FunctionDef
| UsingFor UsingForDec
deriving (Show, Eq, Data, Typeable, Ord)
My current train of thought is to use pattern matching by passing in the [SourceUnit] to a function and matching for specific cases. For instance the following function matches the code and returns the data type for a state variable declaration.
getStateVar :: [SourceUnit] -> Maybe StateVarDeclaration
getStateVar [SourceUnit _ , ContractDef (Contract _ _ [StateVarDec x]) ] = Just x
getStateVar _ = Nothing
This outputs the following, which is in part what I need. Unfortunately the language could contain multiple contract declarations, with multiple state variable declarations so I don't think it's entirely possible to match it in this fashion.
Main> getStateVar $ runTest "pragma solidity ^0.5.0; contract test { address owner = msg.send;}"
Just (StateVariableDeclaration (ElementaryTypeName (AddrType "address")) [] (Identifier "owner") [MemberAccess (IdentExpression "msg") "." (Identifier "send")])
I've read somewhat on generic programming and "scrap your boilerplate" but I don't understand exactly how it works or the best method to implement it would be.
The question is, am I on the right track in terms of pattern matching this way or is there a better alternative?
I have a base abstract class named Tokenand some sub types like NumToken,StrToken.
I want to put their instances to the same list.
I can't declare a variable use let l = list<'a when 'a :> Token>
Then, I write
let extractToken<'a when 'a :> Token>(lineNum:int, line:string) : 'a list option =
let mutable result : 'a list = []
It works, but can not add element.result <- new NumToken(lineNum, value) :: result just say it needs 'a but here is NumToken
Now I can use new NumToken(lineNum, value) :> Token and declare Token list.
It wokrs but looks ugly(I know fsharp doesn't do auto up cast..).
list<_ :> Token> doesn't work too, it only accepts one sub type.
Thx for help.
When you model tokens using a class hierarchy and you create a list of tokens, the type of the list needs to be specific. You can either return list<Token> or list<NumToken>.
The flexible types with when constraints are useful only quite rarely - typically, when you have a function that takes some other function and returns whatever the other function produces, so I do not think you need them here. You can use list<Token> and write:
result <- (NumToken(lineNum, value) :> Token) :: result
That said, modelling token in F# using a class hierarchy is not very good idea. F# supports discriminated unions, which are much better fit for this kind of problem:
type Token =
| NumToken of int
| StrToken of string
Then you can write a function that returns list<Token> just by writing
result <- (NumToken 42) :: result
Although, depending on what you are doing, it might be also a good idea to avoid the mutation.
I'm new to Haskell, and am working through the Haskell LLVM tutorial. In it, the author defines a simple algebraic data type to represent the AST.
type Name = String
data Expr
= Float Double
| BinOp Op Expr Expr
| Var String
| Call Name [Expr]
| Function Name [Expr] Expr
| Extern Name [Expr]
deriving (Eq, Ord, Show)
data Op
= Plus
| Minus
| Times
| Divide
deriving (Eq, Ord, Show)
However, this is not an ideal structure, because the parser actually expects that the list of Expr in an Extern will only ever contain expressions representing variables (i.e. parameters in this situation cannot be arbitrary expressions). I would like to make the types reflect this constraint (making it easier to generate random valid ASTs using QuickCheck); however, for the sake of consistency in the parser functions (which all have type Parser Expr), I don't just want to say | Expr Name [Name]. I would like to do something like this:
data Expr
= ...
| Var String
...
| Function Name [Expr] Expr
| Extern Name [Var] -- enforce constraint here
deriving (Eq, Ord, Show)
But it's not possible in Haskell.
To summarize, Extern and Var should both be Expr, and Extern should have a list of Vars representing parameters. Would the best way be to split all of these out and make them instances of an Expr typeclass (that wouldn't have any methods)? Or is there a more idiomatic method (or would it be better to scrap these types and do something totally different)?
Disclaimer, I'm the author of the LLVM tutorial you mentioned.
Just use Extern Name [Name], everything after Chapter 3 onward in the tutorial uses that exact definition anyways. I think I just forgot to make Chapter 2 Syntax.hs consistent with the others.
I wouldn't worry about making the parser definitions consistent, it's fine for them to return different types. Here's what the later parsers use. identifier is just the parsec builtin for the alphanumeric identifier from the LanguageDef that becomes the Name type in the AST.
extern :: Parser Expr
extern = do
reserved "extern"
name <- identifier
args <- parens $ many identifier
return $ Extern name args
I was wondering if it is possible to do something like this in Haskell
data Word = Det String | Adj String | Noun String | Verb String | Adverb String
data NounPhrase = Noun | Det Noun
If I'm going about this wrong, what I am trying to say is - a "Word" is either a "Det", "Adj", "Noun", etc. and a "NounPhrase" is a "Noun" OR a "Det" followed by a "Noun".
When I try to do this I get the error: "Undefined type constructor "Noun""
How can I go about this so it performs as stated above.
When you define an algebraic data type
like
data MyType = Con1 String | Con2 Int
then Con1 and Con2 are data constructors which themselves are
functions Con1 :: String -> MyType and Con2 :: Int -> MyType.
Your example is having 2 problems:
You're using the same data constructor for different types. Since data
constructors are functions that yield a value of a specific type, you can't use
the same data constructors (Det and Noun) for Word and for NounPhrase.
So you need to choose different names for the constructors of
NounPhrase.
Det Noun does not make sense, since Noun is a data
constructor, whereas the argument of Det needs to be a type, e.g., String.
See Constructors in Haskell to help clear things up.
You are confusing types and value constructors. When you say
data Word = Det String | Adj String | Noun String | Verb String | Adverb String
You define the type Word as being one of a number of forms, for example
Det String
says that Det is a constructor that takes a String and gives you back a value of type Word. Same goes with Noun which is defined to be a constructor for Words and not a type.
There are various ways you might encode what you want in Haskell. By far the simplest is to use
data Word = Det String | Adj String | Noun String | Verb String | Adverb String
data NounPhrase = JustNoun String | Compound String String
and this is the "learning Haskell" way of doing it. It is "stringly typed", but is probably sufficient for your purposes.