How do I convert PEG parser into ambiguous one? - parsing

As far as I understand, most languages are context-free with some exceptions. For instance, a * b may stand for type * pointer_declaration or multiplication in C++. Which one takes place depends on the context, the meaning of the first identifier. Another example is name production in VHDL
enum_literal ::= char_literal | identifer
physical_literal ::= [num] unit_identifier
func_call ::= func_identifier [parenthized_args]
array_indexing ::= arr_name (index_expr)
name ::= func_call | physical_literal | enum_litral | array_indexing
You see that syntactic forms are different but they can match if optional parameters are omitted, like f, does it stand for func_call, physical_literal, like 1 meter with optional amount 1 is implied, or enum_literal.
Talking to Scala plugin designers, I was educated to know that you build AST to re-evaluate it when dependencies change. There is no need to re-parse the file if you have its AST. AST also worth to display the file contents. But, AST is invalidated if grammar is context-sensitive (suppose that f was a function, defined in another file, but later user requalified it into enum literal or undefined). AST changes in this case. AST changes on whenever you change the dependencies. Another option, that I am asking to evaluate and let me know how to make it, is to build an ambiguous AST.
As far as I know, parser combinators are of PEG kind. They hide the ambiguity by returning you the first matched production and f would match a function call because it is the first alternative in my grammar. I am asking for a combinator that instead of falling back on the first success, it proceeds to the next alternative. In the end, it would return me a list of all matching alternatives. It would return me an ambiguity.
I do not know how would you display the ambiguous file contents tree to the user but it would eliminate the need to re-parse the dependent files. I would also be happy to know how modern language design solve this problem.
Once ambiguous node is parsed and ambiguity of results is returned, I would like the parser to converge because I would like to proceed parsing beyond the name and I do not want to parse to the end of file after every ambiguity. The situation is complicated by situations like f(10), which can be a function call with a single argument or a nullary function call, which return an array, which is indexed afterwards. So, f(10) can match name two ways, either as func_call directly or recursively, as arr_indexing -> name ~ (expr). So, it won't be ambiguity like several parallel rules, like fcall | literal. Some branches may be longer than 1 parser before re-converging, like fcall ~ (expr) | fcall.
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?

First you claim that "most languages are context-free with some exceptions", this is not totally true. When designing a computer language, we mostly try to keep it as context-free as possible, since CFGs are the de-facto standard for that. It will ease a lot of work. This is not always feasible, though, and a lot[?] of languages depend on the semantic analysis phase to disambiguate any possible ambiguities.
Parser combinators do not use a formal model usually; PEGs, on the other hand, are a formalism for grammars, as are CFGs. On the last decade a few people have decided to use PEGs over CFGs due to two facts: PEGs are, by design, unambiguous, and they might always be parsed in linear time. A parser combinator library might use PEGs as underlying formalism, but might as well use CFGs or even none.
PEGs are attractive for designing computer languages because we usually do not want to handle ambiguities, which is something hard (or even impossible) to avoid when using CFGs. And, because of that, they might be parsed O(n) time by using dynamic programming (the so called packrat parser). It's not simple to "add ambiguities to them" for a few reasons, most importantly because the language they recognize depend on the fact that the options are deterministic, which is used for example when checking for lookahead. It isn't as simple as "just picking the first choice". For example, you could define a PEG:
S = "a" S "a" / "aa"
Which only parse sequences of N "a", where N is a power of 2. So it recognizes a sequence of 2, 4, 8, 16, 32, 64, etc, letter "a". By adding ambiguity, as a CFG would have, then you would recognize any even number of "a" (2, 4, 6, 8, 10, etc), which is a different language.
To answer your question,
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First I must say that this is probably not a good idea. If you wish to keep ambiguity on the AST, you probably should use a CFG parser instead.
One could, for example, make a parser for PEGs which is similar to a parser for boolean grammars, but then our asymptotic parsing time would grow from O(n) to O(n3) by keeping all alternatives alive while keeping the same language. And we actually lose both good things about PEGs at once.
Another way would be to keep a packrat parser in memory, and transverse its table to handle the semantics from the AST. Not really a good idea either, since this would imply a large memory footprint.
Ideally, one should build an AST which already has information regarding possible ambiguities by changing the grammar structure. While this requires manual work, and usually isn't simple, you wouldn't have to go back a phase to check the grammar again.

Related

Can we use BNF for parsing AND lexing instead of regex?

With a Backus-Naur form grammar (BNF), we can specify the syntax of the programming language in order to parse it and produce an abstract syntax tree (AST).
<if> ::= "if" <expression> "then" <action> "end"
But we can also specify the tokens with a BNF grammar, as the first usage of BNF did for ALGOL-60:
<digit> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit_with_zero> ::= <digit> | "0"
<integer> ::= <digit> | <digit_with_zero> <integer>
However, this usage of the BNF in order to lex (= produce a list of minimal meaningful units aka tokens) has been deprecated in favor of regular expressions (like [1-9][0-9]*).
It seems clear that the regex are much more concise.
It seems also that keeping the structure of an if statement is interesting for the interpreter or the compiler which will handle the AST produced by the parser, but keeping the structure of an integer (or a float) is not.
But do you agree that BNF could be used for both lexing and parsing?
And do you agree with the reasons which make regex much more suited for lexing?
Or are there others?
Regular expressions (in the mathematical sense) are equivalent in power to regular grammars and regular grammars can be written in BNF. So in that sense, it is clearly possible to write a full grammar for any context-free language in pure BNF.
Indeed, it is not even necessary to maintain the lexer/parser dichotomy. Some programmers find it convenient to use scannerless parsing (the article is not great but it has some interesting references), although many of these are based on the PEG formalism (which is not context-free) rather than BNF. (These are not the same despite the superficial resemblance.)
That said, it might not be convenient. In general, like most questions related to the structure of parsers, the answer is going to be based less on theory and more on a combination of practicality (with reference to a specific use case) and programmer prejudice.
As is well known, purity is rarely the most practical. Most real-life parser and scanner generators deviate from the pure theoretical models in order to provide mechanisms which are easier to use, easier to implement efficiently, or more powerful. For example, the character class syntax ([a-zA-Z]), which is almost universal in scanner generators, is a clear extension to regular expression syntax which deliberately avoids the need to explicitly list the entire contents of the set. One could say that the listing is implicit and unambiguous in the example I just presented, but most scanner generators also allow the use of classes like [[:alnum:]] ("alphanumeric symbols"), where the precise list of matched symbols is either locale-dependent or, in the Unicode world, extensible in the future. Regardless, this is obviously a useful extension.
While it is true that some aspects of regular expressions are more compact than their equivalent regular grammars -- especially the Kleene star operator, which in BNF requires an additional non-terminal and thus an additional name -- there are also cases where the ability to name subexpressions makes regular grammars more compact. Many scanner generators, starting with Lex, allowed named subpatterns as another regular expression extension. Furthermore, it is possible (with some caveats) to add the Kleene star and other operators to BNF as macros, and many parser generators do so. So there is a certain convergence of notation.
As you say, one difference between scanners and parsers is that the scanner generally makes no attempt to parse the substructure of a lexeme. But it is not true that no lexeme has substructure, and these substructures often do need to be analysed. The most notorious example is probably floating point numbers, which have to be analysed into a multiplier and an exponent, and the multiplier also analysed into an integer part and a fractional part. This analysis is commonly done using primitive functions available in the scanner implementation language (such as strtod for C scanners), but that does mean a second lexical scan. (Using the built-in avoids the considerable inconvenience of writing a mathematically correct string-to-internal converter, which is a much more difficult problem than it first appears. Rolling your own number converter is not recommended.)
Other lexemes with internal structure include string literals (which may contain escape sequences) and a large variety of more complex lexemes available in certain languages (dates and times, IP addresses, HTML tags, etc., etc.). All of these things tend to blur the boundary between scanning and parsing. Which is fine, because, as I said, the boundary is situational and not restrained by any absolute law of nature.
Still, it is certainly the case that many lexemes do not have any interesting internal structure, and furthermore that while it is easy to rewrite a regular expression as a regular grammar, it is considerably harder to rewrite it as an unambiguous, deterministic regular grammar, much less an LALR(1) regular grammar. (This is one of the reasons scannerless parsing is often associated with PEG, but it can also be solved with GLL or GLR parsers, at a slight loss of efficiency.)

Can this be parsed by a LALR(1) parser?

I am writing a parser in Bison for a language which has the following constructs, among others:
self-dispatch: [identifier arguments]
dispatch: [expression . identifier arguments]
string slicing: expression[expression,expression] - similar to Python.
arguments is a comma-separated list of expressions, which can be empty too. All of the above are expressions on their own, too.
My problem is that I am not sure how to parse both [method [other_method]] and [someString[idx1, idx2].toInt] or if it is possible to do this at all with an LALR(1) parser.
To be more precise, let's take the following example: [a[b]] (call method a with the result of method b). When it reaches the state [a . [b]] (the lookahead is the second [), it won't know whether to reduce a (which has already been reduced to identifier) to expression because something like a[b,c] might follow (which could itself be reduced to expression and continue with the second construct from above) or to keep it identifier (and shift it) because a list of arguments will follow (such as [b] in this case).
Is this shift/reduce conflict due to the way I expressed this grammar or is it not possible to parse all of these constructs with an LALR(1) parser?
And, a more general question, how can one prove that a language is/is not parsable by a particular type of parser?
Assuming your grammar is unambiguous (which the part you describe appears to be) then your best bet is to specify a %glr-parser. Since in most cases, the correct parse will be forced after only a few tokens, the overhead should not be noticeable, and the advantage is that you do not need to complicate either the grammar or the construction of the AST.
The one downside is that bison cannot verify that the grammar is unambiguous -- in general, this is not possible -- and it is not easy to prove. If it turns out that some input is ambiguous, the GLR parser will generate an error, so a good test suite is important.
Proving that the language is not LR(1) would be tricky, and I suspect that it would be impossible because the language probably is recognizable with an LALR(1) parser. (Impossible to tell without seeing the entire grammar, though.) But parsing (outside of CS theory) needs to create a correct parse tree in order to be useful, and the sort of modifications required to produce an LR grammar will also modify the AST, requiring a post-parse fixup. The difficultly in creating a correct AST spring from the difference in precedence between
a[b[c],d]
and
[a[b[c],d]]
In the first (subset) case, b binds to its argument list [c] and the comma has lower precedence; in the end, b[c] and d are sibling children of the slice. In the second case (method invocation), the comma is part of the argument list and binds more tightly than the method application; b, [c] and d are siblings in a method application. But you cannot decide the shape of the parse tree until an arbitrarily long input (since d could be any expression).
That's all a bit hand-wavey since "precedence" is not formally definable, and there are hacks which could make it possible to adjust the tree. Since the LR property is not really composable, it is really possible to provide a more rigorous analysis. But regardless, the GLR parser is likely to be the simplest and most robust solution.
One small point for future reference: CFGs are not just a programming tool; they also serve the purpose of clearly communicating the grammar in question. Nirmally, if you want to describe your language, you are better off using a clear CFG than trying to describe informally. Of course, meaningful non-terminal names will help, and a few examples never hurt, but the essence of the grammar is in the formal description and omitting that makes it harder for others to "be helpful".

Implementing "*?" (lazy "*") regexp pattern in combinatorial GLR parsers

I have implemented combinatorial GLR parsers. Among them there are:
char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.
Example: "char('a').many()" will match a string with any number of "a"-s.
But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.
I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?
Edit:
May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.
Parser = f(Step) -> Collection of TreeNodes of Steps.
So when I parse input, I:
Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.
I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.
I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:
char = "a" ;
char = char "a" ;
GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.
If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.
Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.
Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)
(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)
Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?
Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?
This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.
One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.
Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".
Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.

Practical consequences of formal grammar power?

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Parsing rules - how to make them play nice together

So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.
LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.
My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.

Resources