Generate parser from EBNF

Generate parser from EBNF - parsing

I want to create tools make developer work more easy (autocomplete, linting, refactoring etc).
The first step would be to create a tool that can take grammar in some form (I'm thinking about EBNF) and create parser for it.
The most pleasant option is to use Parsec, but I've as it is top-down parser wouldn't it be less performant then bottom up one?
Are there bottom-up parser generators for Haskell that can be created easily from BNF?
Also how parse only part of file that changes as user changes the sources?
type Grammar = String
type Label = String
data ParseTree a = Node
    { label :: Label
      start :: Integer
    , childrens :: [ParseTree]
    } | Leaf {value :: a} deriving Show
mkParser :: Grammar -> Parser ParseTree
mkParser = undefined
From ParseTree various data can be taken and stored persistently like function definition places, usages etc.
Next step is type inference, but for that I believe someone should define for each language what are function and data types definitions.
After that I think the algorithm would be language independent.
And from that there would be able to give suggestions and autocompletion and linting.
So is my train  of thought is going into right direction? And what are the answers to a more concrete questions in the start?

Related

When would a parser append its input string?

Given a parser
newtype Parser a = Parser { parse :: String -> [(a,String)] }
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \s -> concat [ parse (f a) s' | (a, s') <- parse p s ]
return :: a -> Parser a
return a = Parser (\s -> [(a,s)])
item :: Parser Char
item = Parser $ \s -> case cs of
"" -> []
(c:cs) -> [(c,cs)]
We can see that item consumes part of the input string given to it ("abc" -> [('a', "bc")]). Is there ever a case where a parser would produce additional string output or replace/modify it (e.g. Parser $ \s -> [((), 'a':s)])? I suspect that this might be the case with context-sensitive grammars but have trouble coming up with a sensible example.
Is there a reason why it would make sense to do this for a real-world problem?
References
Monadic Parsing in Haskell

Here are a couple of cases where it is convenient to inject tokens into the input stream. (How this is actually integrated into the parsing pipeline is another question.)
Macro expansion, in the style of the C/C++ preprocessing phase. This is arguably not the best model for macro expansion; hygienic macros would more likely be expanded using a tree transformation, as with C++ template resolution. But the token-oriented preprocessor is not going away soon. Since it is not tightly coupled with the language syntax, the easiest implementation is to substitute the macro (and arguments if applicable) with the tokens from its expansion.
Ecmascript-style automatic semi-colon insertion (ASI). The language syntax requires a semi-colon to be inserted into the token stream under certain precisely-defined circumstances, which are difficult (at least) to incorporate in a CFG. Since ASI is only possible if the next token in the input stream cannot be shifted (and done other conditions), it can certainly be integrated into the parser loop.
Similarly, indentation-aware block syntax (as in Haskell and Python, for example) can certainly be implemented by replacing leading whitespace with an injected INDENT token or some number of injected DEDENTs. Since this substitution is dependent on parse context (it isn't done inside parentheses, for example), injection inside the parser could be a reasonable approach.
That's not an exhaustive list, but it might be at least indicative. Not all of those cases necessarily involve context-sensitivity (I believe ASI could, in theory, be handled with a context-free grammar although I have no intention of trying) and not all instances of context-sensitivity necessarily require token injection (the ambiguity in C between type and variable names only requires selecting the correct token).

How to parse arbitrary lists with Haskell parsers?

Is it possible to use one of the parsing libraries (e.g. Parsec) for parsing something different than a String? And how would I do this?
For the sake of simplicity, let's assume the input is a list of ints [Int]. The task could be
drop leading zeros
parse the rest into the pattern (S+L+)*, where S is a number less than 10, and L is a number larger or equal to ten.
return a list of tuples (Int,Int), where fst is the product of the S and snd is the product of the L integers
It would be great if someone could show how to write such a parser (or something similar).

Yes, as user5402 points out, Parsec can parse any instance of Stream, including arbitrary lists. As there are no predefined token parsers (as there are for text) you have to roll your own, (myToken below) using e.g. tokenPrim
The only thing I find a bit awkward is the handling of "source positions". SourcePos is an abstract type (rather than a type class) and forces me to use its "filename/line/column" format, which feels a bit unnatural here.
Anyway, here is the code (without the skipping of leading zeroes, for brevity)
import Text.Parsec
myToken :: (Show a) => (a -> Bool) -> Parsec [a] () a
myToken test = tokenPrim show incPos $ justIf test where
incPos pos _ _ = incSourceColumn pos 1
justIf test x = if (test x) then Just x else Nothing
small = myToken (< 10)
large = myToken (>= 10)
smallLargePattern = do
smallints <- many1 small
largeints <- many1 large
let prod = foldl1 (*)
return (prod smallints, prod largeints)
myIntListParser :: Parsec [Int] () [(Int,Int)]
myIntListParser = many smallLargePattern
testMe :: [Int] -> [(Int, Int)]
testMe xs = case parse myIntListParser "your list" xs of
Left err -> error $ show err
Right result -> result
Trying it all out:
*Main> testMe [1,2,55,33,3,5,99]
[(2,1815),(15,99)]
*Main> testMe [1,2,55,33,3,5,99,1]
*** Exception: "your list" (line 1, column 9):
unexpected end of input
Note the awkward line/column format in the error message
Of course one could write a function sanitiseSourcePos :: SourcePos -> MyListPosition

There is very likely a way to get Parsec to use [a] as the stream type, but the idea behind parser combinators is actually very simple, and it's not very difficult to roll your own library.
A very accessible resource I would recommend is Monadic Parsing in Haskell by Graham Hutton and Erik Meijer.
Indeed, right now Erik Meijer is teaching an intro Haskell/functional programming course on edx.org (link) and Lecture 7 is all about functional parsers. As he states in the intro to the lecture:
"... No one can follow the path towards mastering functional programming without writing their own parser combinator library. We start by explaining what parsers are and how they can naturally be viewed as side-effecting functions. Next we define a number of basic parsers and higher-order functions for combining parsers. ..."

Approaches For Deeply Extending a Parser Library

The material on parser combinators I have found covers building up complex parsers though composition, but I would like to know if there are any good approaches for defining parsers by tweaking the composed parsers of a library without completely duplicating the original library's logic.
For example, here is a simplified CSV parser defined in Real world Haskell
import Text.ParserCombinators.Parsec
csvFile = endBy line eol
line = sepBy cell (char ',')
cell = many (noneOf ",\n")
eol = char '\n'
Assuming csvFile is defined in one library, can another library create its own CSV parser using a custom version of the cell parser without having to rewrite the line and csvFile parsers as well? Can the source library be rewritten to make this possible? This is simple enough for the CSV parser but I am interested in a broadly applicable solution.

Generally you'd need to abstract over the signature of the components you want to replace. For instance, in the CSV example we'd need to extend the type of csvFile to allow us to slot in a custom cell.
line cell = sepBy cell (char ',')
csvFile cell = endBy (line cell) eol
Obviously this gets unwieldy quickly. We can package all of the extension points desired up into a dictionary however and pass it around
data LanguageDefinition =
LanguageDefinition { cell :: Parser Cell
, ...
}
Then we parameterize the entire parser combinator library over this LanguageDefinition
data Parsers = Parsers { line :: Parser Line, csvFile :: Parser [Line], ... }
mkParsers :: LanguageDefinition -> Parsers
This is exactly the approach taken by the generalized Token parsing modules of Parsec: see Text.Parsec.Token and Text.Parsec.Language.
Even more generic approaches can be taken which abstract more and more things into the dictionary that's getting passed around. Effectively this becomes an object- or OCaml module oriented method of organizing code and can be very effective.
The folk Expression Problem states that there's a tension between introducing more functionality and introducing more variants. In this case, you're asking for a new variant so you need to fix the functionality (and list it all out in a dictionary). This will open a path to introduce new variants.

Using Haskell's Parsec for Programming Language Converter

Say I have two languages (A & B). My goal is to write some type of program to convert the syntax found in A to the equivalent of B. Currently my solution has been to use Haskell's Parsec to perform this task. As someone who is new to Haskell and functional programming for that matter however, finding just a simple example in Parsec has been quite difficult. The examples I have found on the web are either incomplete examples (frustrating for a new Haskell programmer) or too much removed from my goal.
So can someone provide me with an amazingly trivial and explicit example of using Parsec for something related to what I'd like to achieve? Or perhaps even some tutorial that parallels my goal as well.
Thanks.

Consider the following simple grammar of a CSV document (In ABNF):
csv = *crow
crow = *(ccell ',') ccell CR
ccell = "'" *(ALPHA / DIGIT) "'"
We want to write a converter that converts this grammar into a TSV (tabulator separated values) document:
tsv = *trow
trow = *(tcell HTAB) tcell CR
tcell = DQUOTE *(ALPHA / DIGIT) DQUOTE
First of all, let's create an algebraic data type that descibes our abstract syntax tree. Type synonyms are included to ease understandment:
data XSV = [Row]
type Row = [Cell]
type Cell = String
Writing a parser for this grammar is pretty simple. We write a parser as if we would describe the ABNF:
csv :: Parser XSV
csv = XSV <$> many crow
crow :: Parser Row
crow = do cells <- ccell `sepBy` (char ',')
newline
return cells
ccell :: Parser Cell
ccell = do char '\''
content <- many (digit <|> letter)
char '\''
return content
This parser uses do-notation. After a do, a sequence of statements follows. For parsers, these statements are simply other parsers. One can use <- to bind the result of a parser. This way, one builds a big parser by chaining multiple smaller parsers. To obtain interesting effects, one can also combine parser using special combinators (such as a <|> b, which parses either a or b or many a, which parses as many as as possible). Please be aware that Parsec does not backtrack by default. If a parser might fail after consuming characters, prefix it with try to enable backtracking for one instance. try slows down parsing.
The result is a parser csv that parses our CSV document into an abstract syntax tree. Now it is easy to turn that into another language (such as TSV):
xsvToTSV :: XSV -> String
xsvToTSV xst = unlines (map toLines xst) where
toLines = intersperse '\t'
Connecting these two things one gets a conversion function:
csvToTSV :: String -> Maybe String
csvToTSV document = case parse csv "" document of
Left _ -> Nothing
Right xsv -> xsvToTSV xsv
And that is all! Parsec has lots of other functions to build up extremely sophisticated parsers. The book Real World Haskell has a nice chapter about parsers, but it's a little bit outdated. Most of that is still true, though. If you have further questions, feel free to ask.

IndentParser example

Could someone please post a small example of IndentParser usage? I am looking to parse YAML-like input like the following:
fruits:
apples: yummy
watermelons: not so yummy
vegetables:
carrots: are orange
celery raw: good for the jaw
I know there is a YAML package. I would like to learn the usage of IndentParser.

I've sketched out a parser below, for your problem you probably only need the block
parser from IndentParser. Note I haven't tried to run it so it might have elementary errors.
The biggest problem for your parser is not really indenting, but that you only have strings and colon as tokens. You might find the code below takes quite a bit of debugging as it will have to be very sensitive about not consuming too much input, though I have tried to be careful about left-factoring. Because you only have two tokens there isn't much benefit you can get from Parsec's Token module.
Note that there is a strange truth to parsing that simple looking formats are often not simple to parse. For learning, writing a parser for simple expressions will teach you much more that an more-or-less arbitrary text format (that might only cause you frustration).
data DefinitionTree = Nested String [DefinitionTree]
| Def String String
deriving (Show)
-- Note - this might need some testing.
--
-- This is a tricky one, the parser has to parse trailing
-- spaces and tabs but not a new line.
--
category :: IndentCharParser st String
category = do
{ a <- body
; rest
; return a
}
where
body = manyTill1 (letter <|> space) (char ':')
rest = many (oneOf [' ', '\t'])
-- Because the DefinitionTree data type has two quite
-- different constructors, both sharing the same prefix
-- 'category' this combinator is a bit more complicated
-- than usual, and has to use an Either type to descriminate
-- between the options.
--
definition :: IndentCharParser st DefinitionTree
definition = do
{ a <- category
; b <- (textL <|> definitionsR)
; case b of
Left ss -> return (Def a ss)
Right ds -> return (Nested a ds)
}
-- Note this should parse a string *provided* it is on
-- the same line as the category.
--
-- However you might find this assumption needs verifying...
--
textL :: IndentCharParser st (Either DefinitionTrees a)
textL = do
{ ss <- manyTill1 anyChar "\n"
; return (Left ss)
}
-- Finally this one uses an indent parser.
--
definitionsR :: IndentCharParser st (Either a [DefinitionTree])
definitionsR = block body
where
body = do { a <- many1 definition; return (Right a) }

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart