Re-define "stringLiteral" token in Parsec.Token - parsing

I am developing Pascal language parser in Haskell using Parsec library and I need to re-define some tokens defined in Parsec.Token class.
Speeking of it, here is my case:
I need to change how stringLiteral token is matched. In default definition, it is something between char '"' (see this), but I need it to be between '\'' (apostrophes). How can I do this modification to Parsec behavior?
Thanks!!!

You are talking about adjusting the field of a data type named GenTokenParser. It looks like you are using a function that automatically fills in the data type with sensible defaults and you just want to adjust one thing, here you go:
myMakeTokenParser langDef =
let default = makeTokenParser langDef
in default { stringLiteral = newStringLit }
where
newStringLit = lexeme (
do{ str <- between (char '\'')
(char '\'' <?> "end of string")
(many stringChar)
; return (foldr (maybe id (:)) "" str)
}
<?> "literal string")

Related

FParsec match string which have one of 2 patterns

I'm trying to learn FParsec and am trying to match strings which follow on of two patterns.
The string can either be an ordanary string like "string" or it can be a string with one dot in it, like "st.ring".
The parser should look like this: Parser<(string Option * string),unit>. The first string is optional depending of if the string is splitted by a dot or not. The optional string represent the part of the string which is before the ".".
I have tried a few different things but I feel this attempt was the closes:
let charstilldot = manyCharsTill anyChar (pstring ".")
let parser = opt(charstilldot) .>>. (many1Chars anyChar)
This works with input like this "st.ring" but not "string" since not dot exists in the latter.
I would verry much appriciate some help, thank you!
EDIT:
I have solution which basicly parse the arguments in order and swap the arguments depending of their is a dot or not in the string
let colTargetWithoutDot : Parser<string Option,unit> = spaces |>> fun _ -> None
let colTargetWithDot = (pstring "." >>. alphastring) |>> Some
let specificColumn = alphastring .>>. (colTargetWithDot <|> colTargetWithoutDot) |>> (fun (h,t) ->
match h,t with
| h,None -> (None,h)
| h,Some(t) -> (Some(h),t))
However this is not pretty so I would still appriciate another solution!
I think the main problem here is that charstilldot consumes characters even when it fails. In that situation, many1chars then fails because the entire input has already been consumed. The easiest way to address this is by using attempt to rollback when there is no dot:
let charstilldot = attempt (manyCharsTill anyChar (pstring "."))
let parser = opt(charstilldot) .>>. (many1Chars anyChar)
Result:
"str.ing" -> (Some "str", "ing")
"string" -> (None, "string")
I think there are other good solutions as well, but I've tried to give you one that requires the least change to your current code.

How to fail a nested megaparsec parser?

I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF to yield ABC:D.
However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.
Two questions:
Why does fail end parsing instead of failing the parser?
Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?
many has to allow its sub-parser to fail once without the whole parse
failing - for example many (char 'A') *> char 'B', while parsing
"AAAB", has to fail to parse the B to know it got to the end of the
As.
You might want manyTill which allows you to recognise the terminator
explicitly. Something like this:
MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your
existing definition of pComponent and use it with sepBy or similar, like:
MP.sepBy pComponent (MP.satisfy isControlChar)
If you also check for end-of-file after this, like:
MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.
What a parser object normally does is to extract from the input stream whatever subset it is supposed to accept. That's the usual rule.
Here, it seems you want the parser to accept strings that are followed by something specific. From your examples, it is either end of file (eof) or character ':'. So you might want to consider look ahead.
Environment and auxiliary functions:
import Data.Void (Void)
import qualified Data.Text as T
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MC
type Parser = MP.Parsec Void T.Text
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
isAdmissibleChar ch = elem ch ['A' .. 'Z']
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
isControlChar ch = elem ch ":"
-- The escape character:
escChar :: Char
escChar = '\\'
Termination parser, to be used for look ahead:
termination :: Parser ()
termination = MP.eof MP.<|> do
_ <- MP.satisfy isControlChar
return ()
Modified pComponent parser:
pComponent :: Parser (Maybe T.Text)
pComponent = do
txt <- MP.many (escaped MP.<|> regular)
MP.lookAhead termination -- **CHANGE HERE**
if (null txt) then (return Nothing) else (return $ Just (T.pack txt))
where
regular = (MP.satisfy isAdmissibleChar) MP.<|> (fail "Inadmissible character")
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Testing utility:
tryParse :: String -> IO ()
tryParse str = do
let res = MP.parse pComponent "(noname)" (T.pack str)
putStrLn $ (show res)
Let's try to rerun your examples:
$ ghci
λ>
λ> :load q67809465.hs
λ>
λ> str1 = "ABC\\:D:EF"
λ> putStrLn str1
ABC\:D:EF
λ>
λ> tryParse str1
Right (Just "ABC:D")
λ>
So that is successful, as desired.
λ>
λ> tryParse "ABC&D"
Left (ParseErrorBundle {bundleErrors = TrivialError 3 (Just (Tokens ('&' :| ""))) (fromList [EndOfInput]) :| [], bundlePosState = PosState {pstateInput = "ABC&D", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "(noname)", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})
λ>
So that fails, as desired.
Trying our 2 acceptable termination contexts:
λ> tryParse "ABC:&D"
Right (Just "ABC")
λ>
λ>
λ> tryParse "ABCDEF"
Right (Just "ABCDEF")
λ>
fail does not end parsing in general. It just continues with the next alternative. In this case it selects the empty list alternative introduced by the many combinator, so it stops parsing without an error message.
I think the best way to solve your problem is to specify that the input must end in a termination character, that means that it cannot "succeed" halfway like this. You can do that with the notFollowedBy or lookAhead combinators. Here is the relevant part of the megaparsec tutorial.

how parse the between of when the right could come after a repeating pattern?

How would you use existing FParsec functionality to find a repeating consecutive pattern in the rightmost tag?
It's a legitimate possibility in this context. Pre-parsing + escaping might work, but is there a better solution? Do we need to write a new forward combinator, and if so, what does it look like?
#r"""bin\debug\FParsecCS.dll"""
#r"""bin\debug\FParsec.dll"""
open FParsec
let str = pstring
let phraseEscape = pchar '\\' >>. pchar '"'
let phraseChar = phraseEscape <|> (noneOf "|\"\r\n]") // <- this right square bracket needs to be removed
let phrase = manyChars phraseChar
let wrapped = between (str"[[") (str"]]".>>newline) phrase
run wrapped "[[some text]]\n" // <- works fine
// !! problem
run wrapped "[[array[] d]]\n" // <- that means we can't make ']' invalid in phraseChar
// !! problem
run wrapped "[[array[]]]\n" // <- and this means that the first ]] gets match leaving a floating one to break the parser
Sorry to be answering my own question, but...
See composable function phraseTill, and the pend parser that is passed to it of (notFollowedBy(s"]]]")>>.(s"]]"))
#r"""bin\debug\FParsecCS.dll"""
#r"""bin\debug\FParsec.dll"""
open FParsec
let s = pstring
let phraseChar = (noneOf "\r\n")
let phrase = manyChars phraseChar
/// keep eating characters until the pend parser is successful
let phraseTill pend = manyCharsTill phraseChar pend
/// when not followed by tipple, a double will truly be the end
let repeatedTo repeatedPtrn ptrn = notFollowedBy(s repeatedPtrn)>>.(s ptrn)
let wrapped = (s"[[")>>.phraseTill (repeatedTo "]]]" "]]")
run wrapped "[[some text]]]"
run wrapped "[[some text]]"
NB. if you try this out in FSharp Interactive (FSI), make sure you have at least one "run wrapped" line when you send your text to FSI to be evaluated (ie. right-click 'Execute In Interactive'). The type only gets inferred / pinned on application in this example. We could have provided explicit definitions at the risk of being more verbose.

Position information in fparsec

My AST model needs to carry location information (filename, line, index). Is there any built in way to access this information? From the reference docs, the stream seems to carry the position, but I'd prefer that I dont have to implement a dummy parser just to save the position, and add that everywhere.
Thanks in advance
Parsers are actually type abbreviations for functions from streams to replies:
Parser<_,_> is just CharStream<_> -> Reply<_>
Keeping that in mind, you can easily write a custom parser for positions:
let position : CharStream<_> -> Reply<Position> = fun stream -> Reply(stream.Position)
(* OR *)
let position : Parser<_,_> = fun stream -> Reply stream.Position
and atttach position information to every bit you parse using
position .>>. yourParser (*or tuple2 position yourParser*)
position parser does not consume any input and thus it is safe to combine in that way.
You can keep the code change required restricted to a single line and avoid uncontrollable code spread:
type AST = Slash of int64
| Hash of int64
let slash : Parser<AST,_> = char '/' >>. pint64 |>> Slash
let hash : Parser<AST,_> = char '#' >>. pint64 |>> Hash
let ast : Parser<AST,_> = slash <|> hash
(*if this is the final parser used for parsing lists of your ASTs*)
let manyAst : Parser< AST list,_> = many (ast .>> spaces)
let manyAstP : Parser<(Position * AST) list,_> = many ((position .>>. ast) .>> spaces)
(*you can opt in to parse position information for every bit
you parse just by modifiying only the combined parser *)
Update: FParsec has a predefined parser for positions:
http://www.quanttec.com/fparsec/reference/charparsers.html#members.getPosition

Pattern matching for custom read function

I am writing a custom read function for one of the data types in my module. For eg, when I do read "(1 + 1)" :: Data, I want it to return Plus 1 1. My data declaration is data Data = Plus Int Int. Thanks
This sounds like something better suited to a parser; Parsec is a powerful Haskell parser combinator library, which I would recommend.
I'd like to second the notion of using a parser. However, if you absolutely have to use a pattern-matching, go like this:
import Data.List
data Expr = Plus Int Int | Minus Int Int deriving Show
test = [ myRead "(1 + 1)", myRead "(2-1)" ]
myRead = match . lexer
where
match ["(",a,"+",b,")"] = Plus (read a) (read b)
match ["(",a,"-",b,")"] = Minus (read a) (read b)
match garbage = error $ "Cannot parse " ++ show garbage
lexer = unfoldr next_lexeme
where
next_lexeme "" = Nothing
next_lexeme str = Just $ head $ lex str
You could use GHC's ReadP.

Resources