Simple XML parser in bison/flex - xml-parsing

I would like to create simple xml parser using bison/flex. I don't need validation, comments, arguments, only <tag>value</tag>, where value can be number, string or other <tag>value</tag>.
So for example:
If it helps, I know the names of all tags that may occur. I know how many sub-tag can be hold by given tag. Is it possible to create bison parser that would do something like that:
- new Tag("num", 1) // tag1
- new Tag("num", 5) // tag2
- new Tag("add", tag1, tag2) // tag3
- new Tag("num", 20) // tag4
- new Tag("mul", tag4, tag3)
- root = top_tag
Tag & number of sub-tags:
num: 1 (only value)
str: 1 (only value)
add | sub | mul | div: 2 (num | str | tag, num | str | tag)
Could you help me with grammar to be able to create AST like given above?

For your requirements, I think the yax system would work well.
From the README:
The goal of the yax project is to allow the use of YACC (Gnu Bison actually) to parse/process XML documents.
The key piece of software for achieving the above goal is to provide a library that can produce an XML lexical token stream from an XML document.
This stream can be wrapped to create an instance of yylex() to feed tokens to a Bison grammar to parse and process the XML document.
Using the stream plus a Bison grammar, it is possible to carry at least the following kinds of activities.
Validate XML documents,
Directly parse XML documents to create internal data structures,
Construct DOM trees.

I do not think that it's the best tool to use to create a xml parser.
If I have to do this job, I'll do it by hand.
Flex code will contains :
NUM match integer in this example.
STR match match any string which does not contains a '<' or '>'.
STOP match all closing tags.
START match starting tags.
<\?.*\?> { ;}
<[a-z]+> { return START; }
</[a-z]+> { return STOP; }
[0-9]+ { return NUM; }
[^><]+ { return STR; }
Bison code will look like
simple_xml : START value STOP
value : simple_xml
| value simple_xml


What is the best way to handle overlapping lexer patterns that are sensitive to context?

I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[#9,62:80='"a string property"',<STRING>,3:12]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;

How can I get mandatory whitespace in a specific rule while having other whitespace ignored?

The relevant part of my grammar is structured like this:
someRule: subrule1 | WS sign=('+' | '-') subrule2 ; // whitespace required here
// ... etc
WS: [ \t\r\n]+ -> channel(HIDDEN) ; // whitespace is usually ignored
I want to ignore whitespace, but require it on a specific rule. I'm pretty sure there was a way to do it in a previous ANTLR version (though I don't remember exactly, I think there was a syntax allowing to not hide them on a specific rule). I don't know how to do it in ANTLR4, of if it can be done at all without using language-specific actions.
I thought about making WS a parser rule somehow, but I don't think that's the right approach...
(and obviously I don't want to put WS? everywhere in the grammar)
Is there a (preferably language-independent) way to either (a) ensure that a specific point has whitespace, or (b) ensure both ends on a specific point are not "touching" on that channel, or (c) selectively choose the WS channel (default or hidden) depending on which rule it appears in somehow?
I'm guessing (c) is impossible and (a|b) would require language-dependent actions, unless I'm missing something?
I don't believe there is any way to have parser rules evaluate tokens on the HIDDEN channel (or any channel other than 0). Maybe I'm missing something but i couldn't find it.
The question I can't answer from your excerpts is whether there is another parser rule that should match if there is NOT a WS before your sign. That makes a big difference.
I tend to think of a successful grammar as one that will produce a parse tree that represents the correct way to interpret the input stream. IMHO, too many people complicate grammars by trying to encode "all the rules" into the grammar. If you have an accurate tree of the only way to interpret the input (whether it's "error free" or not), then you can write a Listener (maybe a visitor) that visits the tree and performs edits for additional rules (such as "the 'sign' much be preceded by whitespace).
This accomplishes a couple of things:
keeps the grammar simpler
allows you to be very specific in your error messages.
ANTLR is pretty good about error messages, for what information it has, but "expected WS, but saw '+'", is just not going to be as good an error message as "signs must follow whitespace".
With that in mind, you can get to the HIDDEN channel inside a listener.
First of all you'll need to make the token Stream available in your Listener:
class TestListener extends TestBaseListener {
BufferedTokenStream tokens;
public TestListener(BufferedTokenStream tokens) {
this.tokens = tokens;
// ...
and pass it to the constructor of your listener:
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestListener listener = new TestListener(tokens);
then, in the enter* method for whatever rule you need to add this to, you can do something like the following:
const int HIDDEN = 1;
public void enterAddSub(TxlParser.AddSubContext ctx) {
Token op = ctx.op;
int opIndex = op.getTokenIndex();
List<Token> hiddenChannel = tokens.getHiddenTokensToLeft(opIndex, HIDDEN);
if (hiddenChannel != null) {
Token ws = hiddenChannel.get(0);
if (ws != null) {
System.out.println("Found Ws (" + ws.getText() + ")");
} else {
System.out.println("There was no WS to the left of the operator");
// Your code here to add an error
for reference, this was the rule for I used with AddSub
expr (MULT | DIV) expr # MulDiv
| lExpr = expr op = (PLUS | MINUS) rExpr = expr # AddSub
// ...
If I run this with input a=x+y I get:
There was no WS to the left of the operator
But the input a=x +y gives me:
Found Ws ( )

Make lexer consider parser before determining tokens?

I'm writing a lexer and parser in ocamllex and ocamlyacc as follows. function_name and table_name are same regular expression, i.e., a string containing only english alphabets. The only way to determine if a string is function_name or table_name is to check its surroundings. For example, if such a string is surrounded by [ and ], then we know that it is a table_name. Here is the current code:
In lexer.mll,
... ...
let function_name = ['a'-'z' 'A'-'Z']+
let table_name = ['a'-'z' 'A'-'Z']+
rule token = parse
| function_name as s { FUNCTIONNAME s }
| table_name as s { TABLENAME s }
... ...
In parser.mly:
... ...
... ...
As I wrote | function_name as s { FUNCTIONNAME s } before | table_name as s { TABLENAME s }, the above code failed to parse [haha]; it firstly considered haha as a function_name in the lexer, then it could not find any corresponding rule for it in the parser. If it could consider haha as a table_name in the lexer, it would match [haha] as a table in the parser.
One workaround for this is to be more precise in the lexer. For example, we define let table_name_with_brackets = '[' ['a'-'z' 'A'-'Z']+ ']' and | table_name_with_brackets as s { TABLENAMEWITHBRACKETS s } in the lexer. But, I would like to know if there is any other options. Is it not possible to make lexer and parser work together to determine the tokens and the reduction?
You should avoid trying to get the lexer to do the parser's work. The lexer should just identify lexemes; it should not try to figured out where a lexeme fits into the syntax. So in your (simplified) example, there should be only one lexical type, name. The parser will figure it out from there.
But it seems, from the comments, that in the unsimplified original, the two patterns are overlapping rather than identical. That's more annoying, although it's only slightly more complicated. Basically, you need to separate out the common pattern as one lexical type, and then add the additional matches as one or two other lexical types (depending on whether or not one pattern is a strict superset of the other).
That might not be too difficult, depending on the precise relationship between the two patterns. You might be able to find a very simple solution by writing the patterns in the correct order, for example, because of the longest match rule:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
Most of the time, that's all it takes: first define the intersection of the two patterns as a based lexeme, and then add the full lexical patterns of each contextual type to provide additional matches. Your parser will then have to match name | function_name in one context and name | table_name in the other context. But that's not too bad.
Where it will fail is when an input stream cannot be unambiguously divided in lexemes. For example, suppose that in a function context, a name could include a ? character, but in a table context the ? is a valid postscript operator. In that case, you have to actively prevent foo? from being analysed as a single token in the table context, which means that the lexer does have to be aware of parser context.

Parse a block where each line starts with a specific symbol

I need to parse a block of code which looks like this:
* Block
| Line 1
| Line 2
| ...
It is easy to do:
block : head lines;
head : '*' line;
lines : lines '|' line
| '|' line
Now I wonder, how can I add nested blocks, e.g.:
* Block
| Line 1
| * Subblock
| | Line 1.1
| | ...
| Line 2
| ...
Can this be expressed as a LALR grammar?
I can, of course, parse the top-level blocks and than run my parser again to deal with each of these top-level blocks. However, I'm just learning this topic so it's interesting for me to avoid such approach.
The nested-block language is not context-free [Note 2], so it cannot be parsed with an LALR(k) parser.
However, nested parenthesis languages are, of course, context-free and it is relatively easy to transform the input into a parenthetic form by replacing the initial | sequences in the lexical scanner. The transformation is simple:
when the initial sequence of |s is longer than the previous line, insert an BEGIN_BLOCK. (The initial sequence must be exactly one | longer; otherwise it is presumably a syntax error.)
when the initial sequence of |s, is shorter then the previous line, enough END_BLOCKs are inserted to bring the expected length to the correct value.
The |s themselves are not passed through to the parser.
This is very similar to the INDENT/DEDENT strategy used to parse layout-aware languages like Python an Haskell. The main difference is that here we don't need a stack of indent levels.
Once that transformation is finished, the grammar will look something like:
content: /* empty */
| content line
| content block
block : head BEGIN_BLOCK content END_BLOCK
| head
head : '*' line
A rough outline of a flex implementation would be something like this: (see Note 1, below).
static int depth = 0, new_depth = 0;
/* Handle pending END_BLOCKs */
if (new_depth < depth) {
return END_BLOCK;
^"|"[[:blank:]]* { new_depth = 1; BEGIN(INDENT); }
^. { new_depth = 0; yyless(0); BEGIN(CONTENT);
goto send_end; }
^\n /* Ignore blank lines */
"|"[[:blank:]]* ++new_depth;
. { yyless(0); BEGIN(CONTENT);
if (new_depth > depth) {
if (new_depth > depth) { /* Report syntax error */ }
} else goto send_end;
\n BEGIN(INITIAL); /* Maybe you care about this blank line? */
/* Put whatever you use here to lexically scan the lines */
Not everyone will be happy with the goto but it saves some code-duplication. The fact that the state variable (depth and new_depth) are local static variables makes the lexer non-reentrant and non-restartable (after an error). That's only useful for toy code; for anything real, you should make the lexical scanner re-entrant and put the state variables into the extra data structure.
The terms "context-free" and "context-sensitive" are technical descriptions of grammars, and are therefore a bit misleading. Intuitions based on what the words seem to mean are often wrong. One very common source of context-sensitivity is a language where validity depends on two different derivations of the same non-terminal producing the same token sequence. (Assuming the non-terminal could derive more than one token sequence; otherwise, the non-terminal could be eliminated.)
There are lots of examples of such context-sensitivity in normal programming languages; usually, the grammar will allow these constructs and the check will be performed later in some semantic analysis phase. These include the requirement that an identifier be declared (two derivations of IDENTIFIER produce the same string) or the requirement that a function be called with the correct number of parameters (here, it is only necessary that the length of the derivations of the non-terminals match, but that is sufficient to trigger context-sensitivity).
In this case, the requirement is that two instances of what might be called bar-prefix in consecutive lines produce the same string of |s. In this case, since the effect is really syntactic, deferring to a later semantic analysis defeats the point of parsing. Whether the other examples of context-sensitivity are "syntactic" or "semantic" is a debate which produces a surprising amount of heat without casting much light on the discussion.
If you write an explicit end-of-block token, things become clearer:
|Line 1
| line 1.1
| line 1.2
|Line 2
and grammar becomes:
block : '*' ID '{' lines '}'
lines : lines '|' line
| lines block

Can nested parentheticals be parsed in chemical formulae?

I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]
So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".
It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
if (molecule[i] == ')')
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
if (molecule[i] == '(')
if (multdepth < 0)
error("Opening bracket not terminated");
