How to convert a wildcard pattern to regex in Erlang? - erlang

Wildcard patterns are file system standards that match any characters to a ( ? ) and any sequence characters to an ( * ).
I am trying to use the erlang re:replace/3 function to replace:
a) * into .*
b) ? into .
c) . into \.
d) if a wildcard pattern does not start in a wildcard, then add a ^ (start-match in regex) to the end of the pattern
e) if a wildcard pattern does not end in a wildcard, then add a $ (end-match in regex) to the end of the pattern
Somehow I am unable to get the re:replace to achieve this.
Examples:
trying to replace based on item a) above
re:replace("something*.log","\*","\.\*").
exception error: bad argument

If you are confident in the completeness of your spec, you can write the conversion directly (I guess there is no performance problem because regular expression are generally short list)
-module(rep).
-compile([export_all]).
replace(L) when is_list(L) -> lists:reverse(replace(L,wildcard(hd(L)))).
% take care of the first character
replace(L,W={true,_}) -> replace(L,W,[]);
replace(L,W={false,_}) -> replace(L,W,[$^]).
% take care of the last character
replace([_],{true,R},Res) -> R ++ Res;
replace([_],{false,R},Res) -> [$$|R] ++ Res;
% middle characters
replace([_|Q],{_,R},Res) -> replace(Q,wildcard(hd(Q)),R++Res).
wildcard($*) -> {true,[$*,$.]};
wildcard($?) -> {true,[$.]};
wildcard($.) -> {true,[$.,$\\]};
wildcard(C) -> {false,[C]}.
with your example:
11> rep:replace("something*.log").
"^something.*\\.log$"
Note that the \\ is one single character as you can verify with:
12> length(rep:replace("something*.log")).
18

In your re:replace call:
re:replace("something*.log","\*","\.\*").
the backslashes don't actually end up in the strings, since they just escape the following characters. Some backslash escapes have special meanings, such as "\n" meaning a newline, but the ones that don't only let the character through unchanged:
4> "\*".
"*"
So you need a double backslash for the backslash to actually be part of the string:
5> re:replace("something*.log","\\*","\.\*").
[<<"something">>,<<".*">>|<<".log">>]
Note that the backslashes in "\.\*" are not needed.
The return value above is an iolist, which is usually perfectly useful (in particular if you want to write the result to a file or a socket), but sometimes you may want a plain string at additional memory and CPU cost. You can pass a fourth argument to re:replace:
7> re:replace("something*.log","\\*","\.\*", [{return, list}]).
"something.*.log"

Related

Unary minus messes up parsing

Here is the grammar of the language id' like to parse:
expr ::= val | const | (expr) | unop expr | expr binop expr
var ::= letter
const ::= {digit}+
unop ::= -
binop ::= /*+-
I'm using an example from the haskell wiki.
The semantics and token parser are not shown here.
exprparser = buildExpressionParser table term <?> "expression"
table = [ [Prefix (m_reservedOp "-" >> return (Uno Oppo))]
,[Infix (m_reservedOp "/" >> return (Bino Quot)) AssocLeft
,Infix (m_reservedOp "*" >> return (Bino Prod)) AssocLeft]
,[Infix (m_reservedOp "-" >> return (Bino Diff)) AssocLeft
,Infix (m_reservedOp "+" >> return (Bino Somm)) AssocLeft]
]
term = m_parens exprparser
<|> fmap Var m_identifier
<|> fmap Con m_natural
The minus char appears two times, once as unary, once as binary operator.
On input "1--2", the parser gives only
Con 1
instead of the expected
"Bino Diff (Con 1) (Uno Oppo (Con 2))"
Any help welcome.Full code here
The purpose of reservedOp is to create a parser (which you've named m_reservedOp) that parses the given string of operator symbols while ensuring that it is not the prefix of a longer string of operator symbols. You can see this from the definition of reservedOp in the source:
reservedOp name =
lexeme $ try $
do{ _ <- string name
; notFollowedBy (opLetter languageDef) <?> ("end of " ++ show name)
}
Note that the supplied name is parsed only if it is not followed by any opLetter symbols.
In your case, the string "--2" can't be parsed by m_reservedOp "-" because, even though it starts with the valid operator "-", this string occurs as the prefix of a longer valid operator "--".
In a language with single-character operators, you probably don't want to use reservedOp at all, unless you want to disallow adjacent operators without intervening whitespace. Just use symbol "-", which will always parse "-", no matter what follows (and consume following whitespace, if any). Also, in a language with a fixed set of operators (i.e., no user-defined operators), you probably won't use the operator parser, so you won't need opStart, or reservedOpNames. Without reservedOp or operator, the opLetter parser isn't used, so you can drop it too.
This is probably pretty confusing, and the Parsec documentation does a terrible job of explaining how the "reserved" mechanism is supposed to work. Here's a primer:
Let's start with identifiers, instead of operators. In a typical language that allows user-defined identifiers (i.e., pretty much any language, since "variables" and "functions" have user-defined names) and may also have some reserved words that aren't allowed as identifiers, the relevant settings in the GenLanguageDef are:
identStart -- parser for first character of valid identifier
identLetter -- second and following characters of valid identifier
reservedNames -- list of reserved names not allowed as identifiers
The lexeme (whitespace-absorbing) parsers created using the GenTokenParser object are:
identifier - Parses an unknown, user-defined identifier. It parses a character from identStart followed by zero or more identLetters up to the first non-identLetter. (It never parses a partial identifier, so it'll never leave more identLetters on the table.) Additionally, it checks that the identifier is not in the list reservedNames.
symbol - Parses the given string. If the string is a reserved word, no check is made that it isn't part of a larger valid identifier. So, symbol "for" would match the beginning of foreground = "black", which is rarely what you want. Note that symbol makes no use of identStart, identLetter, or reservedNames.
reserved - Parses the given string, and then ensures that it's not followed by an identLetter. So, m_reserved "for" will parse for (i=1; ... but not parse foreground = "black". Usually, the supplied string will be a valid identifier, but no check is made for this, so you can write m_reserved "15" if you want -- in a language with the usual sorts of alphanumeric identifiers, this would parse "15" provided it wasn't following by a letter or another digit. Also, maybe somewhat surprisingly, no check is made that the supplied string is in reservedNames.
If that makes sense to you, then the operator settings follow the exact same pattern. The relevant settings are:
opStart -- parser for first character of valid operator
opLetter -- valid second and following operator chars, for multichar operators
reservedOpNames -- list of reserved operator names not allowed as user-defined operators
and the relevant parsers are:
operator - Parses an unknown, user-defined operator starting with an opStart and followed by zero or more opLetters up to the first non-opLetter. So, operator applied to the string "--2" will always take the whole operator "--", never just the prefix "-". An additional check is made that the resulting operator is not in the reservedOpNames list.
symbol - Exactly as for identifiers. It parses a string with no checks or reference to opStart, opLetter, or reservedOpNames, so symbol "-" will parse the first character of the string "--" just fine, leaving the second "-" character for a later parser.
reservedOp - Parses the given string, ensuring it's not followed by opLetter. So, m_reservedOp "-" will parse the start of "-x" but not "--2", assuming - matches opLetter. As before, no check is made that the string is in reservedOpNames.

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

Maximal munch in Text.ParserCombinators.ReadP

The Read instance for Double behaves in a very straightforward way:
reads "34.567e8 foo" :: [(Double, String)] = [(3.4567e9," foo")]
However the Read instance for Scientific does something different:
reads "34.567e8 foo" :: [(Scientific, String)] =
[(34.0,".567e8 foo"),(34.567,"e8 foo"),(3.4567e9," foo")]
Strictly this is correct, in that it is presenting a list of possible parses of the input. In fact it could equally well have included (3.0, "4.567e8 foo") in the list, as well as some others. However the usual behaviour in cases like this (which the Double instance follows) is "maximal munch", meaning that the longest valid prefix is parsed.
I'm updating my Decimal library, which has a similar behaviour, and I'm wondering what the Right Thing is here. Both Scientific and Decimal are using Text.ParserCombinators.ReadP, which was designed to make it easy to write Read instances, and this seems to be a characteristic of ReadP parsers.
So my questions:
1: What is the Right Thing for "reads" to return in these cases? Should I file a bug for Data.Scientific?
2: If it should only return the maximal munch (like the Double instance does) then how do you get ReadP to do that?
I've decided that maximal munch is the Right Thing. Given "1.23" a parser that returns 1 is just wrong. I've been tripped up by this myself because I once tried to write a "maybeRead" looking like this:
maybeRead :: (Read a) => String -> Maybe a
maybeRead str = case reads str of
[v, ""] -> Just v
_ => Nothing
This worked fine for Double but failed for Decimal and Scientific. (Obviously it can be fixed to handle multiple return results, but I didn't expect to need to do this).
The problem turned out to be the implementation of "optional" in Text.ParserCombinators.ReadP. This uses the symmetric choice operator "+++", which returns the parse with and without the optional component. Hence when I wrote something like
expPart <- optional "" $ do {...}
the results included a parse without the expPart.
I wrote a different version of "optional" using the left-biased choice operator:
myOpt d p = p <++ return d
If the parser "p" consumes any text then the default is not used. This does the Right Thing if you want maximal munch.
For #2, you could change the scientific package to use this parser defined in terms of the old one: scientificPmaxmuch = scientificP <* eof :: ReadP Scientific.
I don't think there is much of a convention for #1: it doesn't make a difference for people using read or Text.Read.readMaybe. readS_to_P reads :: ReadP Double is probably faster than readS_to_P reads :: ReadP Scientific, but if efficiency mattered at all you would keep everything as ReadP until the end.

Better way to build lists of matches in parser generator

(I use Yecc, an Erlang parser generator similar to Yacc, so the syntax is different from Yacc)
The problem is simple, say we want parse a lispy syntax, i want match on expressions-lists.
An expression list is a list of expressions separated with blank space.
In Erlang, [1,3,4] is a list and ++ concatenates two lists.
We want match this 1 (1+2) 3. expression will match 1, (1+2) and 3. So, there i match on the list followed by one more expression, and if there is no match i end to match on a single expression. This builds a list recursively.
expressionlist -> expressionlist expression : '$1' ++ ['$2'].
expressionlist -> expression : ['$1'].
But i can do this too (invert the order):
expressionlist -> expression expressionlist : ['$1'] ++ '$2'.
expressionlist -> expression : ['$1'].
Both of theese seem to work, i would like to know if there was any difference.
With a separator
I want to match {name = albert , age = 43}. propdef matches name = value. So it is the same problem but with an additional separator ,. Is there any difference there from the first problem ?
proplist -> propdef ',' proplist : ['$1'] ++ '$3'.
proplist -> propdef : ['$1'].
proplist -> '{' proplist '}' : '$2'.
proplist -> '{' '}' : [].
%% Could write this
%% proplist -> proplist ',' propdef : '$1' ++ ['$3'].
Thank you.
Since Yecc is an LALR parser generator, the use of left recursion or right recursion doesn't matter much. In old times, people would prefer left recursion in Yacc/Bison and similar tools because it allows the parser to keep reducing instead of shifting everything onto the stack until the end of the list, but these days, stack space and speed isn't that important, so you can pick whatever suits you best. (Note that in an LL parser, left recursion causes an infinite loop, so for such parsers, right recursion is necessary.)
The more important thing, for your example above, is that '$1' ++ ['$2'] in the left recursive version will cause quadratic time complexity, since the "expressionlist" part is the one that's going to be longer and longer. You should never have the growing component on the left when you use the ++ operator. If you parse a list of thousands of elements, this complexity will hurt you. Using the right recursive version instead, ['$1'] ++ '$2' will give you linear time, even if the parser has to shift the whole list onto the stack before it starts reducing. You could try both versions and parse a really long list to see the difference.
Using a separator as in "propdef ',' proplist" does not change the problem.

Using Parsec to write a Read instance

Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)
However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.

Resources