I'm trying to implement a very simple markup language. I have an intermediate representation that looks like:
data Token = Str Text
| Explode Text
type Rep = [Token]
So, the idea is to turn an arbitrary text of the form:
The quick brown %%fox%% %%jumps%% over the %%lazy%% dog.
into:
[Str "The quick brown", Explode "fox", Explode "jumps", Str "over the", Explode "lazy", Str "dog"]
for further processing. Also, it is important that we treat:
%%fox%% %%jumps%%
differently than
%%fox jumps%%
The latter should (Explode "fox jumps")
I tried to implement this using attoparsec, but I don't think I have the tools I need. But I'm not so good with parsing theory (I studied math, not CS). What kind of grammar is this? What kind of parser combinator library should I use? I considered using Parsec with a stateful monad transformer stack to keep track of the context. Does that sound sensible?
You can take the cheap and easy way, without a proper parser. The important thing to recognise is that this grammar is actually fairly simple – it has no recursion or such. It is just a flat listing of Strs and Explodes.
The easy way
So we can start by breaking the string down into a list containing the text and the separators as separate values. We need a data type to separate the separators (%%) from actual text (everything else.)
data ParserTokens = Sep | T Text
Breaking it down
Then we need to break the list into its constituents.
tokenise = intersperse Sep . map T . Text.splitOn "%%"
This will first split the string on %%, so in your example it'll become
["The quick brown ","fox"," ","jumps"," over the ","lazy"," dog."]
then we map T over it, to turn it from a [Text] to a [ParserTokens]. Finally, we intersperse Sep over it, to reintroduce the %% separators but in a shape that's easier to deal with. The result is, in your example,
[T "The quick brown ",Sep,T "fox",Sep,T " ",Sep,T "jumps",Sep,T " over the ",Sep,T "lazy",Sep,T " dog."]
Building it up
With this done, all that remains is parsing this thing into the shape you want it. Parsing this amounts to finding the 1-2-3 punch of Sep–T "something"–Sep and replacing it with Explode "something". We write a recursive function to do this.
construct [] = []
construct (T s : rest) = Str s : construct rest
construct (Sep : T s : Sep : rest) = Explode s : construct rest
construct _ = error "Mismatched '%%'!"
This converts T s to Str s and the combination of separators and a T s into an Explode s. If the pattern matching fails, it's because there were a stray separator somewhere, so I've just set it to crash the program. You might want better error handling there – such as wrapping the result in Either String or something similar.
With this done, we can create the function
parseTemplate = construct . tokenise
and in the end, if we run your example through parseTemplate, we get the expected result
[Str "The quick brown ",Explode "fox",Str " ",Explode "jumps",Str " over the ",Explode "lazy",Str " dog."]
For such simple parser even Attoparsec seems to be overkill:
parse = map (\w -> case w of
'%':'%':expl -> Explode $ init $ init expl
str -> Str str) . words
Of course, this code needs some sanity checks for Explode case.
This doesn't handle whitespace the way you specified, but it should get you on the right track.
parseMU = zipWith ($) (cycle [Str,Explode]) . splitps where
splitps :: String -> [String]
splitps [] = [[]]
splitps ('%':'%':r) = [] : splitps r
splitps (c:r) = let
(a:r') = splitps r
in ((c:a):r')
Related
I'm making a parser for a DSL in Haskell using Alex + Happy.
My DSL uses dice rolls as part of the possible expressions.
Sometimes I have an expression that I want to parse that looks like:
[some code...] 3D6 [... rest of the code]
Which should translate roughly to:
TokenInt {... value = 3}, TokenD, TokenInt {... value = 6}
My DSL also uses variables (basically, Strings), so I have a special token that handle variable names.
So, with this tokens:
"D" { \pos str -> TokenD pos }
$alpha [$alpha $digit \_ \']* { \pos str -> TokenName pos str}
$digit+ { \pos str -> TokenInt pos (read str) }
The result I'm getting when using my parse now is:
TokenInt {... value = 3}, TokenName { ... , name = "D6"}
Which means that my lexer "reads" an Integer and a Variable named "D6".
I have tried many things, for example, i changed the token D to:
$digit "D" $digit { \pos str -> TokenD pos }
But that just consumes the digits :(
Can I parse the dice roll with the numbers?
Or at least parse TokenInt-TokenD-TokenInt?
PS: I'm using PosN as a wrapper, not sure if relevant.
The way I'd go about it would be to extend the TokenD type to TokenD Int Int so using the basic wrapper for convenience I would do
$digit+ D $digit+ { dice }
...
dice :: String -> Token
dice s = TokenD (read $ head ls) (read $ last ls)
where ls = split 'D' s
split can be found here.
This is an extra step that'd usually be done in during syntactic analysis but doesn't hurt much here.
Also I can't make Alex parse $alpha for TokenD instead of TokenName. If we had Di instead of D that'd be no problem. From Alex's docs:
When the input stream matches more than one rule, the rule which matches the longest prefix of the input stream wins. If there are still several rules which match an equal number of characters, then the rule which appears earliest in the file wins.
But then your code should work. I don't know if this is an issue with Alex.
I decided that I could survive with variables starting with lowercase letters (like Haskell variables), so I changed my lexer to parse variables only if they start with a lowercase letter.
That also solved some possible problems with some other reserved words.
I'm still curious to know if there were other solutions, but the problem in itself was solved.
Thank you all!
I'm working on an EDI file parser, and I'm having considerable difficulty implementing an escape for the 'segment terminator'. For anyone fortunate enough to not work with EDI, the segment terminator (usually an apostrophe) is the deliter between segments, which are like cells.
The desired behaviour looks something like this:
ABC+123'DEF+567' -> ["ABC+123", "DEF+567"]
ABC+123?'DEF+567' -> ["ABC+123?'DEF+567"]
Using FParsec, without escaping the apostrophe (and, for simplicity, ignoring parameterisation), the parser looks something like this:
let pSegment = //logic to parse the contents of a segment
let pAllSegments = sepEndBy pSegment (str "'")
This approach with the above example would yield ["ABC+123?", "DEF+567"].
My next consideration was to use a regex:
let pAllSegments = sepEndBy pSegment (regex #"[^\?]'")
The problem here is that the character prior to the apostrophe is also consumed, leading to incomplete messages.
I'm fairly certain I just don't understand FParsec well enough here. Does anyone have any pointers?
The issue is in the parse contents step.
The parser is working 'bottom up'. It finds the contents of the segments, which are not permitted to contain the terminator, then finds that all these segments are separated by the terminator, and constructs the list.
My error was in the pSegment step, which was using a parameterised version of (?:[A-Za-z0-9 \\.]|\?[\?\+:\?])*. See that second ?? That should have been a '.
Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160
Rebol2 has an /ANY refinement on the FIND function that can do wildcard searches:
>> find/any "here is a string" "s?r"
== "string"
I use this extensively in tight loops that need to perform well. But the refinement was removed in Rebol3.
What's the most efficient way of doing this in Rebol3? (I'm guessing a parse solution of some sort.)
Here's a stab at handling the "*" case:
like: funct [
series [series!]
search [series!]
][
rule: copy []
remove-each s b: parse/all search "*" [empty? s]
foreach s b [
append rule reduce ['to s]
]
append rule [to end]
all [
parse series rule
find series first b
]
]
used as follows:
>> like "abcde" "b*d"
== "bcde"
I had edited your question for "clarity" and changed it to say 'was removed'. That made it sound like it was a deliberate decision. Yet it actually turns out it may just not have been implemented.
BUT if anyone asks me, I don't think it should be in the box...and not just because it's a lousy use of the word "ALL". Here's why:
You're looking for patterns in strings...so if you're constrained to using a string to specify that pattern you get into "meta" problems. Let's say I want to extract the word *Rebol* or ?Red?, now there has to be escaping and things get ugly all over again. Back to RegEx. :-/
So what you might actually want isn't a STRING! pattern like s?r but a BLOCK! pattern like ["s" ? "r"]. This would permit constructs like ["?" ? "?"] or [{?} ? {?}]. That's better than rehashing the string hackery that every other language uses.
And that's what PARSE does, albeit in a slightly-less-declarative way. It also uses words instead of symbols, as Rebol likes to do. [{?} skip {?}] is a match rule where skip is an instruction that moves the parse position past any single element of the parse series between the question marks. It could also do so if it were parsing a block as input, and would match [{?} 12-Dec-2012 {?}].
I don't know entirely what the behavior of /ALL would-or-should be with something like "ab??cd e?*f"... if it provided alternate pattern logic or what. I'm assuming the Rebol2 implementation is brief? So likely it only matches one pattern.
To set a baseline, here's a possibly-lame PARSE solution for the s?r intent:
>> parse "here is a string" [
some [ ; match rule repeatedly
to "s" ; advance to *before* "s"
pos: ; save position as potential match
skip ; now skip the "s"
[ ; [sub-rule]
skip ; ignore any single character (the "?")
"r" ; match the "r", and if we do...
return pos ; return the position we saved
| ; | (otherwise)
none ; no-op, keep trying to match
]
]
fail ; have PARSE return NONE
]
== "string"
If you wanted it to be s*r you would change the skip "r" return pos into a to "r" return pos.
On an efficiency note, I'll mention that it is indeed the case that characters are matched against characters faster than strings. So to #"s" and #"r" to end make a measurable difference in the speed when parsing strings in general. Beyond that, I'm sure others can do better.
The rule is certainly longer than "s?r". But it's not that long when comments are taken out:
[some [to #"s" pos: skip [skip #"r" return pos | none]] fail]
(Note: It does leak pos: as written. Is there a USE in PARSE, implemented or planned?)
Yet a nice thing about it is that it offers hook points at all the moments of decision, and without the escaping defects a naive string solution has. (I'm tempted to give my usual "Bad LEGO alligator vs. Good LEGO alligator" speech.)
But if you don't want to code in PARSE directly, it seems the real answer would be some kind of "Glob Expression"-to-PARSE compiler. It might be the best interpretation of glob Rebol would have, because you could do a one-off:
>> parse "here is a string" glob "s?r"
== "string"
Or if you are going to be doing the match often, cache the compiled expression. Also, let's imagine our block form uses words for literacy:
s?r-rule: glob ["s" one "r"]
pos-1: parse "here is a string" s?r-rule
pos-2: parse "reuse compiled RegEx string" s?r-rule
It might be interesting to see such a compiler for regex as well. These also might accept not only string input but also block input, so that both "s.r" and ["s" . "r"] were legal...and if you used the block form you wouldn't need escaping and could write ["." . "."] to match ".A."
Fairly interesting things would be possible. Given that in RegEx:
(abc|def)=\g{1}
matches abc=abc or def=def
but not abc=def or def=abc
Rebol could be modified to take either the string form or compile into a PARSE rule with a form like:
regex [("abc" | "def") "=" (1)]
Then you get a dialect variation that doesn't need escaping. Designing and writing such compilers is left as an exercise for the reader. :-)
I've broken this into two functions: one that creates a rule to match the given search value, and the other to perform the search. Separating the two allows you to reuse the same generated parse block where one search value is applied over multiple iterations:
expand-wildcards: use [literal][
literal: complement charset "*?"
func [
{Creates a PARSE rule matching VALUE expanding * (any characters) and ? (any one character)}
value [any-string!] "Value to expand"
/local part
][
collect [
parse value [
; empty search string FAIL
end (keep [return (none)])
|
; only wildcard return HEAD
some #"*" end (keep [to end])
|
; everything else...
some [
; single char matches
#"?" (keep 'skip)
|
; textual match
copy part some literal (keep part)
|
; indicates the use of THRU for the next string
some #"*"
; but first we're going to match single chars
any [#"?" (keep 'skip)]
; it's optional in case there's a "*?*" sequence
; in which case, we're going to ignore the first "*"
opt [
copy part some literal (
keep 'thru keep part
)
]
]
]
]
]
]
like: func [
{Finds a value in a series and returns the series at the start of it.}
series [any-string!] "Series to search"
value [any-string! block!] "Value to find"
/local skips result
][
; shortens the search a little where the search starts with a regular char
skips: switch/default first value [
#[none] #"*" #"?" ['skip]
][
reduce ['skip 'to first value]
]
any [
block? value
value: expand-wildcards value
]
parse series [
some [
; we have our match
result: value
; and return it
return (result)
|
; step through the string until we get a match
skips
]
; at the end of the string, no matches
fail
]
]
Splitting the function also gives you a base to optimize the two different concerns: finding the start and matching the value.
I went with PARSE as even though *? are seemingly simple rules, there is nothing quite as expressive and quick as PARSE to effectively implementing such a search.
It might yet as per #HostileFork to consider a dialect instead of strings with wildcards—indeed to the point where Regex is replaced by a compile-to-parse dialect, but is perhaps beyond the scope of the question.
I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).