Can nested parentheticals be parsed in chemical formulae? - parsing

I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]

So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
Edit:
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".

It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
{
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
{
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
}
if (molecule[i] == ')')
{
multdepth++;
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
}
if (molecule[i] == '(')
{
multdepth--;
if (multdepth < 0)
error("Opening bracket not terminated");
}
}

Related

How to parse dot operator in language syntax?

Let's say I'm writing a parser that parses the following syntax:
foo.bar().baz = 5;
The grammar rules look something like this:
program: one or more statement
statement: expression followed by ";"
expression: one of:
- identifier (\w+)
- number (\d+)
- func call: expression "(" ")"
- dot operator: expression "." identifier
Two expressions have a problem, the func call and the dot operator. This is because the expressions are recursive and look for another expression at the start, causing a stack overflow. I will focus on the dot operator for this quesition.
We face a similar problem with the plus operator. However, rather than using an expression you would do something like this to solve it (look for a "term" instead):
add operation: term "+" term
term: one of:
- number (\d+)
- "(" expression ")"
The term then includes everything except the add operation itself. To ensure that multiple plus operators can be chained together without using parenthesis, one would rather do:
add operation: term, one or more of ("+" followed by term)
I was thinking a similar solution could for for the dot operator or for function calls.
However, the dot operator works a little differently. We always evaluate from left-to-right and need to allow full expressions so that you can do function calls etc. in-between. With parenthesis, an example might be:
(foo.bar()).baz = 5;
Unfortunately, I do not want to require parenthesis. This would end up being the case if following the method used for the plus operator.
How could I go about implementing this?
Currently my parser never peeks ahead, but even if I do look ahead, it still seems tricky to accomplish.
The easy solution would be to use a bottom-up parser which doesn't drop into a bottomless pit on left recursion, but I suppose you have already rejected that solution.
I don't understand your objection to using a looping construct, though. Postfix modifiers like field lookup and function call are not really different from binary operators like addition (except, of course, for the fact that they will not need to claim an eventual right operand). Plus and minus intermingle freely, which you can parse with a repetition like:
additive: term ( '+' term | '-' term )*
Similarly, postfix modifiers can be easily parsed with something like:
postfixed: atom ( '.' ID | '(' opt-expr-list `)` )*
I'm using a form of extended BNF: parentheses group; | separates alternatives and binds less stringly than concatenation; and * means "zero or more repetitions" of the atom on its left.
Another postfix operator which falls into the same category is array/map subscripting ('[' expr ']'), although you might also have other postfix operators.
Note that like the additive syntax above, selecting the appropriate alternative does not require looking beyond the next token. It's hard to parse without being able to peek one token into the future. Fortunately, that's very little overhead.
One way could be for the dot operator to parse a non-dot expression, that is, a rule that is the same as expression but without the dot operator. This prevents recursion.
Then, when the non-dot expression has been parsed, check if a dot and an identifier follows. If this is not the case, we are done. If this is the case, wrap the current node up in a dot operation node. Then, keep track of the entire string text that has been parsed for this operation so far. Then revert everything back to before the operation was being parsed, and now re-parse a "custom expression", where the first directly-nested expression would really be trying to match the exact string that was parsed before rather than a real expression. Repeat until there are no more dot-identifier pairs (this should happen automatically by the new "custom expression").
This is messy, complicated and possibly slow, and I'm not entirely sure if it'll work but I'll try it out. I'd appreciate alternative solutions.

Parse a block where each line starts with a specific symbol

I need to parse a block of code which looks like this:
* Block
| Line 1
| Line 2
| ...
It is easy to do:
block : head lines;
head : '*' line;
lines : lines '|' line
| '|' line
;
Now I wonder, how can I add nested blocks, e.g.:
* Block
| Line 1
| * Subblock
| | Line 1.1
| | ...
| Line 2
| ...
Can this be expressed as a LALR grammar?
I can, of course, parse the top-level blocks and than run my parser again to deal with each of these top-level blocks. However, I'm just learning this topic so it's interesting for me to avoid such approach.
The nested-block language is not context-free [Note 2], so it cannot be parsed with an LALR(k) parser.
However, nested parenthesis languages are, of course, context-free and it is relatively easy to transform the input into a parenthetic form by replacing the initial | sequences in the lexical scanner. The transformation is simple:
when the initial sequence of |s is longer than the previous line, insert an BEGIN_BLOCK. (The initial sequence must be exactly one | longer; otherwise it is presumably a syntax error.)
when the initial sequence of |s, is shorter then the previous line, enough END_BLOCKs are inserted to bring the expected length to the correct value.
The |s themselves are not passed through to the parser.
This is very similar to the INDENT/DEDENT strategy used to parse layout-aware languages like Python an Haskell. The main difference is that here we don't need a stack of indent levels.
Once that transformation is finished, the grammar will look something like:
content: /* empty */
| content line
| content block
block : head BEGIN_BLOCK content END_BLOCK
| head
head : '*' line
A rough outline of a flex implementation would be something like this: (see Note 1, below).
%x INDENT CONTENT
%%
static int depth = 0, new_depth = 0;
/* Handle pending END_BLOCKs */
send_end:
if (new_depth < depth) {
--depth;
return END_BLOCK;
}
^"|"[[:blank:]]* { new_depth = 1; BEGIN(INDENT); }
^. { new_depth = 0; yyless(0); BEGIN(CONTENT);
goto send_end; }
^\n /* Ignore blank lines */
<INDENT>{
"|"[[:blank:]]* ++new_depth;
. { yyless(0); BEGIN(CONTENT);
if (new_depth > depth) {
++depth;
if (new_depth > depth) { /* Report syntax error */ }
return BEGIN_BLOCK;
} else goto send_end;
}
\n BEGIN(INITIAL); /* Maybe you care about this blank line? */
}
/* Put whatever you use here to lexically scan the lines */
<CONTENT>{
\n BEGIN(INITIAL);
}
Notes:
Not everyone will be happy with the goto but it saves some code-duplication. The fact that the state variable (depth and new_depth) are local static variables makes the lexer non-reentrant and non-restartable (after an error). That's only useful for toy code; for anything real, you should make the lexical scanner re-entrant and put the state variables into the extra data structure.
The terms "context-free" and "context-sensitive" are technical descriptions of grammars, and are therefore a bit misleading. Intuitions based on what the words seem to mean are often wrong. One very common source of context-sensitivity is a language where validity depends on two different derivations of the same non-terminal producing the same token sequence. (Assuming the non-terminal could derive more than one token sequence; otherwise, the non-terminal could be eliminated.)
There are lots of examples of such context-sensitivity in normal programming languages; usually, the grammar will allow these constructs and the check will be performed later in some semantic analysis phase. These include the requirement that an identifier be declared (two derivations of IDENTIFIER produce the same string) or the requirement that a function be called with the correct number of parameters (here, it is only necessary that the length of the derivations of the non-terminals match, but that is sufficient to trigger context-sensitivity).
In this case, the requirement is that two instances of what might be called bar-prefix in consecutive lines produce the same string of |s. In this case, since the effect is really syntactic, deferring to a later semantic analysis defeats the point of parsing. Whether the other examples of context-sensitivity are "syntactic" or "semantic" is a debate which produces a surprising amount of heat without casting much light on the discussion.
If you write an explicit end-of-block token, things become clearer:
*Block{
|Line 1
*SubBlock{
| line 1.1
| line 1.2
}
|Line 2
|...
}
and grammar becomes:
block : '*' ID '{' lines '}'
lines : lines '|' line
| lines block
|

bison and grammar: replaying the parse stack

I have not messed with building languages or parsers in a formal way since grad school and have forgotten most of what I knew back then. I now have a project that might benefit from such a thing but I'm not sure how to approach the following situation.
Let's say that in the language I want to parse there is a token that means "generate a random floating point number" in an expression.
exp: NUMBER
{$$ = $1;}
| NUMBER PLUS exp
{$$ = $1 + $3;}
| R PLUS exp
{$$ = random() + $3;}
;
I also want a "list" generating operator that will reevaluate an "exp" some number of times. Maybe like:
listExp: NUMBER COLON exp
{
for (int i = 0; i < $1; i++) {
print $3;
}
}
;
The problem I see is that "exp" will have already been reduced by the time the loop starts. If I have the input
2 : R + 2
then I think the random number will be generated as the "exp" is parsed and 2 added to it -- lets say the result is 2.0055. Then in the list expression I think 2.0055 would be printed out twice.
Is there a way to mark the "exp" before evaluation and then parse it as many times as the list loop count requires? The idea being to get a different random number in each evaluation.
Your best bet is to build an AST and evaluate the entire AST at the end of the parse. In-line evaluation is only possible for very simple (i.e. "calculator-like") projects.
Instead of an AST, you could construct code for a stack- or three-address- virtual machine. That's generally more efficient, particularly if you intend to execute the code frequently, but the AST is a lot simpler to construct, and executing it is a single depth-first scan.
Depending on your language design there are at least 5 different points at which a token in the language could be bound to a value. They are:
Pre-processor (like C #define)
Lexer: recognise tokens
Parser: recognise token structure, output AST
Semantic analysis: analyse AST, assign types and conversions etc
Code generation: output executable code or execute code directly.
If you have a token that can occur multiple times and you want to assign it a different random value each time, then phase 4 is the place to do it. If you generate an AST, walk the tree and assign the values. If you go straight to code generation (or an interpreter) do it then.

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

Gold Parsing System - What can it be used for in programming?

I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).

Resources