How do I parse/skip specific characters in attrparsec? - parsing

I'm trying to learn how to use the attrparsec module and so am practicing by parsing one of my log files.
I have the following code for the beginning of a line parser:
lineParser :: Parser LogEntry
lineParser = do
theMonth <- monthParser
skipSpace
theDate <- decimal
skipSpace
theHour <- decimal
skipColon
theMinute <- decimal
skipColon
theSecond <- decimal
return LogEntry { monthOfYear = theMonth, dayOfMonth = theDate,
hourOfDay = theHour, minuteOfHour = theMinute, secondOfMinute = theSecond}
but I'm having trouble with the skipColon function. I've tried various versions such as
skipColon = isColon
where isColon c = c == ':'
but I just get type errors
I would have liked to have simply written something like
skipChar ':'
but haven't been able to figure that out either.
Ran into all sorts of type errors around Char vs Word8 but most search results where about converting Word8 to Char, not the other way round.
Would appreciate some guidance. Many thanks.

AJFarmer posted the best answer.
Simply use
char ':'
However, you'll get a warning about a discarded value so to avoid that warning write
_ <- char ':'
Thanks for answering.

Related

How to fail a nested megaparsec parser?

I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF to yield ABC:D.
However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.
Two questions:
Why does fail end parsing instead of failing the parser?
Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?
many has to allow its sub-parser to fail once without the whole parse
failing - for example many (char 'A') *> char 'B', while parsing
"AAAB", has to fail to parse the B to know it got to the end of the
As.
You might want manyTill which allows you to recognise the terminator
explicitly. Something like this:
MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your
existing definition of pComponent and use it with sepBy or similar, like:
MP.sepBy pComponent (MP.satisfy isControlChar)
If you also check for end-of-file after this, like:
MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.
What a parser object normally does is to extract from the input stream whatever subset it is supposed to accept. That's the usual rule.
Here, it seems you want the parser to accept strings that are followed by something specific. From your examples, it is either end of file (eof) or character ':'. So you might want to consider look ahead.
Environment and auxiliary functions:
import Data.Void (Void)
import qualified Data.Text as T
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MC
type Parser = MP.Parsec Void T.Text
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
isAdmissibleChar ch = elem ch ['A' .. 'Z']
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
isControlChar ch = elem ch ":"
-- The escape character:
escChar :: Char
escChar = '\\'
Termination parser, to be used for look ahead:
termination :: Parser ()
termination = MP.eof MP.<|> do
_ <- MP.satisfy isControlChar
return ()
Modified pComponent parser:
pComponent :: Parser (Maybe T.Text)
pComponent = do
txt <- MP.many (escaped MP.<|> regular)
MP.lookAhead termination -- **CHANGE HERE**
if (null txt) then (return Nothing) else (return $ Just (T.pack txt))
where
regular = (MP.satisfy isAdmissibleChar) MP.<|> (fail "Inadmissible character")
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Testing utility:
tryParse :: String -> IO ()
tryParse str = do
let res = MP.parse pComponent "(noname)" (T.pack str)
putStrLn $ (show res)
Let's try to rerun your examples:
$ ghci
λ>
λ> :load q67809465.hs
λ>
λ> str1 = "ABC\\:D:EF"
λ> putStrLn str1
ABC\:D:EF
λ>
λ> tryParse str1
Right (Just "ABC:D")
λ>
So that is successful, as desired.
λ>
λ> tryParse "ABC&D"
Left (ParseErrorBundle {bundleErrors = TrivialError 3 (Just (Tokens ('&' :| ""))) (fromList [EndOfInput]) :| [], bundlePosState = PosState {pstateInput = "ABC&D", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "(noname)", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})
λ>
So that fails, as desired.
Trying our 2 acceptable termination contexts:
λ> tryParse "ABC:&D"
Right (Just "ABC")
λ>
λ>
λ> tryParse "ABCDEF"
Right (Just "ABCDEF")
λ>
fail does not end parsing in general. It just continues with the next alternative. In this case it selects the empty list alternative introduced by the many combinator, so it stops parsing without an error message.
I think the best way to solve your problem is to specify that the input must end in a termination character, that means that it cannot "succeed" halfway like this. You can do that with the notFollowedBy or lookAhead combinators. Here is the relevant part of the megaparsec tutorial.

Tracking locations in Genlex

I'm writing a parser for a language that is sufficiently simple for Genlex + camlp4 stream parsers to take care of it. However, I'd still be interested in having a more or less precise location (i.e. at least a line number) in case of parsing error.
My idea is to use an intermediate stream between the original char Stream and the token Stream of Genlex, that takes care of line counts, like in the code below, but I'm wondering whether there's a simpler solution?
let parse_file s =
let num_lines = ref 1 in
let bol = ref 0 in
let print_pos fmt i =
(* Emacs-friendly location *)
Printf.fprintf fmt "File %S, line %d, characters %d-%d:"
s !num_lines (i - !bol) (i - !bol)
in
(* Normal stream *)
let chan =
try open_in s
with
Sys_error e -> Printf.eprintf "Cannot open %s: %s\n%!" s e; exit 1
in
let chrs = Stream.of_channel chan in
(* Capture newlines and move num_lines and bol accordingly *)
let next i =
try
match Stream.next chrs with
| '\n' -> bol := i; incr num_lines; Some '\n'
| c -> Some c
with Stream.Failure -> None
in
let chrs = Stream.from next in
(* Pass that to the Genlex's lexer *)
let toks = lexer chrs in
let error s =
Printf.eprintf "%a\n%s %a\n%!"
print_pos (Stream.count chrs) s print_top toks;
exit 1
in
try
parse toks
with
| Stream.Failure -> error "Failure"
| Stream.Error e -> error ("Error " ^ e)
| Parsing.Parse_error -> error "Unexpected symbol"
A much simpler solution is to use Camlp4 grammars.
Parsers built this way allow one to get decent error messages "for free", unlike the case with stream parsers (which are a low level tool).
It could be that there is no need to define your own lexer, because OCaml's lexer suits your needs already. But if you really need your own lexer, then you can easily plug in a custom one:
module Camlp4Loc = Camlp4.Struct.Loc
module Lexer = MyLexer.Make(Camlp4Loc)
module Gram = Camlp4.Struct.Grammar.Static.Make(Lexer)
open Lexer
let entry = Gram.Entry.mk "entry"
EXTEND Gram
entry: [ [ ... ] ];
END
let parse str =
Gram.parse rule (Loc.mk file) (Stream.of_string str)
If you are new to OCaml, then all this module system trickery might seem at first like black voodoo magic :-) The fact that Camlp4 is a severely underdocumented beast might also contribute to the surreality of experience.
So never hesitate to ask a question (even a stupid one) on the mailing list.

Position information in fparsec

My AST model needs to carry location information (filename, line, index). Is there any built in way to access this information? From the reference docs, the stream seems to carry the position, but I'd prefer that I dont have to implement a dummy parser just to save the position, and add that everywhere.
Thanks in advance
Parsers are actually type abbreviations for functions from streams to replies:
Parser<_,_> is just CharStream<_> -> Reply<_>
Keeping that in mind, you can easily write a custom parser for positions:
let position : CharStream<_> -> Reply<Position> = fun stream -> Reply(stream.Position)
(* OR *)
let position : Parser<_,_> = fun stream -> Reply stream.Position
and atttach position information to every bit you parse using
position .>>. yourParser (*or tuple2 position yourParser*)
position parser does not consume any input and thus it is safe to combine in that way.
You can keep the code change required restricted to a single line and avoid uncontrollable code spread:
type AST = Slash of int64
| Hash of int64
let slash : Parser<AST,_> = char '/' >>. pint64 |>> Slash
let hash : Parser<AST,_> = char '#' >>. pint64 |>> Hash
let ast : Parser<AST,_> = slash <|> hash
(*if this is the final parser used for parsing lists of your ASTs*)
let manyAst : Parser< AST list,_> = many (ast .>> spaces)
let manyAstP : Parser<(Position * AST) list,_> = many ((position .>>. ast) .>> spaces)
(*you can opt in to parse position information for every bit
you parse just by modifiying only the combined parser *)
Update: FParsec has a predefined parser for positions:
http://www.quanttec.com/fparsec/reference/charparsers.html#members.getPosition

Haskell Parsec items numeration

I'm using Text.ParserCombinators.Parsec and Text.XHtml to parse an input like this:
- First type A\n
-- First type B\n
- Second type A\n
-- First type B\n
--Second type B\n
And my output should be:
<h1>1 First type A\n</h1>
<h2>1.1 First type B\n</h2>
<h1>2 Second type A\n</h2>
<h2>2.1 First type B\n</h2>
<h2>2.2 Second type B\n</h2>
I have come to this part, but I cannot get any further:
title1= do{
;(count 1 (char '-'))
;s <- many1 anyChar newline
;return (h1 << s)
}
title2= do{
;(count 2 (char '--'))
;s <- many1 anyChar newline
;return (h1 << s)
}
text=do {
;many (choice [try(title1),try(title2)])
}
main :: IO ()
main = do t putStr "Error: " >> print err
Right x -> putStrLn $ prettyHtml x
This is ok, but it does not include the numbering.
Any ideas?
Thanks!
You probably want to use GenParser with a state containing the current section numbers as a list in reverse order, so section 1.2.3 will be represented as [3,2,1], and maybe the length of the list to avoid repeatedly counting it. Something like
data SectionState = SectionState {nums :: [Int], depth :: Int}
Then make your parser actions return type be "GenParser Char SectionState a". You can access the current state in your parser actions using "getState" and "setState". When you get a series of "-" at the start of a line count them and compare it with "depth" in the state, manipulate the "nums" list appropriately, and then emit "nums" in reverse order (I suggest keeping nums in reverse order because most of the time you want to access the least significant item, so putting it at the head of the list is both easier and more efficient).
See Text.ParserCombinators.Parsec.Prim for details of GenParser. The more usual Parser type is just "type Parser a = GenParser Char () a" You probably want to say
type MyParser a = GenParser Char SectionState a
somewhere near the start of your code.

Pattern matching for custom read function

I am writing a custom read function for one of the data types in my module. For eg, when I do read "(1 + 1)" :: Data, I want it to return Plus 1 1. My data declaration is data Data = Plus Int Int. Thanks
This sounds like something better suited to a parser; Parsec is a powerful Haskell parser combinator library, which I would recommend.
I'd like to second the notion of using a parser. However, if you absolutely have to use a pattern-matching, go like this:
import Data.List
data Expr = Plus Int Int | Minus Int Int deriving Show
test = [ myRead "(1 + 1)", myRead "(2-1)" ]
myRead = match . lexer
where
match ["(",a,"+",b,")"] = Plus (read a) (read b)
match ["(",a,"-",b,")"] = Minus (read a) (read b)
match garbage = error $ "Cannot parse " ++ show garbage
lexer = unfoldr next_lexeme
where
next_lexeme "" = Nothing
next_lexeme str = Just $ head $ lex str
You could use GHC's ReadP.

Resources