Haskell: Parsing escape characters in single quotes - parsing

I'm currently making a scanner for a basic compiler I'm writing in Haskell. One of the requirements is that any character enclosed in single quotes (') is translated into a character literal token (type T_Char), and this includes escape sequences such as '\n' and '\t'. I've defined this part of the scanner function which works okay for most cases:
scanner ('\'':cs) | (length cs) == 0 = error "Illegal character!"
| head cs == '\\' = mkEscape (head (drop 1 cs)) : scanner (drop 3 cs)
| head (drop 1 cs) == '\'' = T_Char (head cs) : scanner (drop 2 cs)
where
mkEscape :: Char -> Token
mkEscape 'n' = T_Char '\n'
mkEscape 'r' = T_Char '\r'
mkEscape 't' = T_Char '\t'
mkEscape '\\' = T_Char '\\'
mkEscape '\'' = T_Char '\''
However, this comes up when I run it in GHCi:
Main> scanner "abc '\\' def"
[T_Id "abc", T_Char '\'', T_Id "def"]
It can recognise everything else but gets escaped backslashes confused with escaped single quotes. Is this something to do with character encodings?

I don't think there's anything wrong with the parser regarding your problem. To Haskell, the string will be read as
abc '\' def
because Haskell also has string escapes. So when it reaches the first quotation mark, cs contains the char sequence \' def. Obviously head cs is a backslash, so it will run mkEscape.
The argument given is head (drop 1 cs), which is ', thus mkEscape will return T_Char '\'', which is what you saw.
Perhaps you should call
scanner "abc '\\\\' def"
The 1st level of \ is for the Haskell interpreter, and the 2nd level is for scanner.

Related

Using makeExprParser with ambiguity

I'm currently encountering a problem while translating a parser from a CFG-based tool (antlr) to Megaparsec.
The grammar contains lists of expressions (handled with makeExprParser) that are enclosed in brackets (<, >) and separated by ,.
Stuff like <>, <23>, <23,87> etc.
The problem now is that the expressions may themselves contain the > operator (meaning "greater than"), which causes my parser to fail.
<1223>234> should, for example, be parsed into [BinaryExpression ">" (IntExpr 1223) (IntExpr 234)].
I presume that I have to strategically place try somewhere, but the places I tried (to the first argument of sepBy and the first argument of makeExprParser) did unfortunately not work.
Can I use makeExprParser in such a situation or do I have to manually write the expression parser?:
This is the relevant part of my parser:
-- uses megaparsec, text, and parser-combinators
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type BinaryOperator = Text
type Name = Text
data Expr
= IntExpr Integer
| BinaryExpression BinaryOperator Expr Expr
deriving (Eq, Show)
type Parser = Parsec Void Text
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
sc :: Parser ()
sc = L.space space1 (L.skipLineComment "//") (L.skipBlockCommentNested "/*" "*/")
parseInteger :: Parser Expr
parseInteger = do
number <- some digitChar
_ <- sc
return $ IntExpr $ read number
parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression ">" <$ symbol ">")]]
parseBracketList :: Parser [Expr]
parseBracketList = do
_ <- symbol "<"
exprs <- sepBy parseExpr (symbol ",")
_ <- symbol ">"
return exprs
main :: IO ()
main = do
text : _ <- getArgs
let res = runParser parseBracketList "stdin" (pack text)
case res of
(Right suc) -> do
print suc
(Left err) ->
putStrLn $ errorBundlePretty err
You've (probably) misdiagnosed the problem. Your parser fails on <1233>234> because it's trying to parse > as a left associative operator, like +. In other words, the same way:
1+2+
would fail, because the second + has no right-hand operand, your parser is failing because:
1233>234>
has no digit following the second >. Assuming you don't want your > operator to chain (i.e., 1>2>3 is not a valid Expr), you should first replace InfixL with InfixN (non-associative) in your makeExprParser table. Then, it will parse this example fine.
Unfortunately, with or without this change your parser will still fail on the simpler test case:
<1233>
because the > is interpreted as an operator within a continuing expression.
In other words, the problem isn't that your parser can't handle expressions with > characters, it's that it's overly aggressive in treating > characters as part of an expression, preventing them from being recognized as the closing angle bracket.
To fix this, you need to figure out exactly what you're parsing. Specifically, you need to resolve the ambiguity in your parser by precisely characterizing the situations where > can be part of a continuing expression and where it can't.
One rule that will probably work is to only consider a > as an operator if it is followed by a valid "term" (i.e., a parseInteger). You can do this with lookAhead. The parser:
symbol ">" <* lookAhead term
will parse a > operator only if it is followed by a valid term. If it fails to find a term, it will consume some input (at least the > symbol itself), so you must surround it with a try:
try (symbol ">" <* lookAhead term)
With the above two fixes applied to parseExpr:
parseExpr :: Parser Expr
parseExpr = makeExprParser term
[[InfixN (BinaryExpression ">" <$ try (symbol ">" <* lookAhead term))]]
where term = parseInteger
you'll get the following parses:
λ> parseTest parseBracketList "<23>"
[IntExpr 23]
λ> parseTest parseBracketList "<23,87>"
[IntExpr 23,IntExpr 87]
λ> parseTest parseBracketList "<23,87>18>"
[IntExpr 23,BinaryExpression ">" (IntExpr 87) (IntExpr 18)]
However, the following will fail:
λ> parseTest parseBracketList "<23,87>18"
1:10:
|
1 | <23,87>18
| ^
unexpected end of input
expecting ',', '>', or digit
λ>
because the fact that the > is followed by 18 means that it is a valid operator, and it is parse failure that the valid expression 87>18 is followed by neither a comma nor a closing > angle bracket.
If you need to parse something like <23,87>18, you have bigger problems. Consider the following two test cases:
<1,2>3,4,5,6,7,...,100000000000,100000000001>
<1,2>3,4,5,6,7,...,100000000000,100000000001
It's a challenge to write an efficient parser that will parse the first one as a list of 10000000000 expressions but the second one as a list of two expression:
[IntExpr 1, IntExpr 2]
followed by some "extra" text. Hopefully, the underlying "language" you're trying to parse isn't so hopelessly broken that this will be an issue.

How to combine two parsers in Haskell ReadP anonymously?

The Problem
I want to chain two ReadP parsers anonymously.
Example
Input characters are in {'a', 'o', '+', ' '} where ' ' is a space.
I want to parse this input according to the following rules:
the first o or a is not preceeded by a space/plus
every other o is preceeded by a space
every other a is preceeded by a plus
The BNF
In case this is not clear, I came up with the following BNF:
{-
<Input> :: = <O> <Expr> | <A> <Expr>
<Expr> ::= <Space> <O> <Expr> | <Plus> <A> <Expr> | <EOL>
<O> ::= "o"
<A> ::= "a"
<Space> ::= " "
<Plus> ::= "+"
-}
Specific questions
The idea is, that the rules "o preceeded by space" and "a preceeded by plus"
(if not at the beginning) should not be part of the o or a parser, nor do
I want to create an explicit/named parser, because that is a part of the
definition of an expression (Expr).
How do I chain the oParser with the spaceParser?
Composition (does not work): spaceParser . oParser
(+++) is the symmetric choice: oParser (+++) spaceParser. And related to that: Is (+++) equivalent to <|> from Control.Applicative?
ReadP.choice [oParser, spaceParser] creates a new parser, which obviously does not work for a string, but I did that previously and the behaviour of choice surprised me. How does choice work?
The Code
--------------------------------------------------------------------------------
module Parser (main) where
--------------------------------------------------------------------------------
-- modules
import Text.ParserCombinators.ReadP as ReadP
import Data.Char as D
import Control.Applicative
--------------------------------------------------------------------------------
-- this parses one o
oParser :: ReadP Char
oParser = satisfy (== 'o')
-- this parses one a
aParser :: ReadP Char
aParser = satisfy (== 'a') --(\v -> v == 'a')
-- this parses one plus
pParser :: ReadP Char
pParser = satisfy (== '+')
spaceParser :: ReadP Char
spaceParser = satisfy D.isSpace
parseExpr :: ReadP String
parseExpr = -- ReadP.choice [oParser, spaceParser]
main :: IO ()
main = print [ x | (x, "") <- ReadP.readP_to_S parseExpr "o +a o+a+a o o"]
Thank you a lot for reading all of that :) (And: Which haskell parsing libraries would you recommend?)

ANTLR4 - How to tokenize differently inside quotes?

I am defining an ANTLR4 grammar and I'd like it to tokenize certain - but not all - things differently when they appear inside double-quotes than when they appear outside double-quotes. Here's the grammar I have so far:
grammar SimpleGrammar;
AND: '&';
TERM: TERM_CHAR+;
PHRASE_TERM: (TERM_CHAR | '%' | '&' | ':' | '$')+;
TRUNCATION: TERM '!';
WS: WS_CHAR+ -> skip;
fragment TERM_CHAR: 'a' .. 'z' | 'A' .. 'Z';
fragment WS_CHAR: [ \t\r\n];
// Parser rules
expr:
expr AND expr
| '"' phrase '"'
| TERM
| TRUNCATION
;
phrase:
(TERM | PHRASE_TERM | TRUNCATION)+
;
The above grammar works when parsing a! & b, which correctly parses to:
AND
/ \
/ \
a! b
However, when I attempt to parse "a! & b", I get:
line 1:4 extraneous input '&' expecting {'"', TERM, PHRASE_TERM, TRUNCATION}
The error message makes sense, because the & is getting tokenized as AND. What I would like to do, however, is have the & get tokenized as a PHRASE_TERM when it appears inside of double-quotes (inside a "phrase"). Note, I do want the a! to tokenize as TRUNCATION even when it appears inside the phrase.
Is this possible?
It is possible if you use lexer modes. It is possible to change mode after encounter of specific token. But lexer rules must be defined separately, not in combined grammar.
In your case, after encountering quote, you will change mode and after encountering another quote, you will change mode back to the default one.
LBRACK : '[' -> pushMode(CharSet);
RBRACK : ']' -> popMode;
For more information google 'ANTLR lexer Mode'

Parsing single qoute char in a single-quoted string in parsec

I've got a silly situation in my parsec parsers that I would like your help on.
I need to parse a sequence of strongs / chars that are separated by | characters.
So, we could have a|b|'c'|'abcd'
which should be turned into
[a,b,c,abcd]
Space is not allowed, unless inside of a ' ' string. Now, in my naïve attempt, I got the situation now where I can parse strings like a'a|'bb' to [a'a,bb] but not aa|'b'b' to [aa,b'b].
singleQuotedChar :: Parser Char
singleQuotedChar = noneOf "'" <|> try (string "''" >> return '\'')
simpleLabel = do
whiteSpace haskelldef
lab <- many1 (noneOf "|")
return $ lab
quotedLabel = do
whiteSpace haskelldef
char '\''
lab <- many singleQuotedChar
char '\''
return $ lab
Now, how do I tell the parser to consider ' a stoping ' iff it is followed by a | or white space?
(Or, get some ' char counting into this). The input is user generated, so I cannot rely on them \'-ing chars.
Note that allowing a quote in the middle of a string delimited by quotes is very confusing to read, but I believe this should allow you to parse it.
quotedLabel = do -- reads the first quote.
whiteSpace
char '\''
quotedLabel2
quotedLabel2 = do -- reads the string and the finishing quote.
lab <- many singleQuotedChar
try (do more <- quotedLabel3
return $ lttrace "quotedLabel2" (lab ++ more))
<|> (do char '\''
return $ lttrace "quotedLabel2" lab)
quotedLabel3 = do -- handle middle quotes
char '\''
lookAhead $ noneOf ['|']
ret <- quotedLabel2
return $ lttrace "quotedLabel3" $ "'" ++ ret

BNFC parser and bracket Mathematica like syntax

I played a bit with the BNF Converter and tried to re-engineer parts of the Mathematica language. My BNF had already about 150 lines and worked OK, until I noticed a very basic bug. Brackets [] in Mathematica are used for two different things
expr[arg] to call a function
list[[spec]] to access elements of an expression, e.g. a List
Let's assume I want to create the parser for a language which consists only of identifiers, function calls, element access and sequence of expressions as arguments. These forms would be valid
f[]
f[a]
f[a,b,c]
f[[a]]
f[[a,b]]
f[a,f[b]]
f[[a,f[x]]]
A direct, but obviously wrong input-file for BNFC could look like
entrypoints Expr ;
TSymbol. Expr1 ::= Ident ;
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]]" ;
coercions Expr 1 ;
separator Sequence "," ;
SequenceExpr. Sequence ::= Expr ;
This BNF does not work for the last two examples of the first code-block.
The problem seems to be located in the created Yylex lexer file, which matches ] and ]] separately. This is wrong, because as can be seen in the last to examples, whether or not it's a closing ] or ]] depends on the context. So either you have to create a stack of braces to ensure the right matching or you leave that to the parser.
Can someone enlighten me whether it's possible to realize this with BNFC?
(Btw, other hints would be gratefully taken too)
Your problem is the token "]]". If the lexer collects this without having
any memory of its past, it might be mistaken. So just don't do that!
The parser by definition remembers its left context, so you can get
it to do the bracket matching correctly.
I would define your grammar this way:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[" "[" [Sequence] "]" "]" ;
with the lexer detecting only single "[" "]" as tokens.
An odd variant:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]" "]" ;
with the lexer also detecting "[[" as a token, since it can't be mistaken.

Resources