How to support different language versions in my lexer/parser - parsing

I am wondering what is the best way to support different versions of a language in my grammar.
I am working on modifying an existing grammar for a language and there is a new version of the language, introducing new keywords and additional syntax I should be able to parse. However, existing codebase written in the language can already use these new keywords as identifiers for example, so I have to make this extension optional.
So my question is: what is the preferred way to write conditional lexer and parser rules, based on a boolean value? Semantic predicates came to my mind, but I am relatively new to antlr and I'm not sure if it is a good idea to use them for such a purpose.

I had very good success with semantic predicates in the MySQL grammar, to support various MySQL versions. This includes new features, removed features and features that were valid only for a certain MySQL version range. Additionally, you can use the semantic predicates to tell the user in which version a specific syntax would be valid. But you have to parse the predicates yourself for that.
As an example, in this line a new import statement is conditionally added:
simpleStatement:
// DDL
...
| {serverVersion >= 80000}? importStatement
I have a field serverVersion in my common recognizer class from which both generated lexer and parser classes derive. This field is set with a valid version, right before the parsing process is triggered.
Also in the lexer you can guard keywords with this approach, like shown in this and surrounding lines in the MySQL lexer:
MASTER_SYMBOL: M A S T E R;
MASTER_TLS_VERSION_SYMBOL: M A S T E R '_' T L S '_' V E R S I O N {serverVersion >= 50713}?;
MASTER_USER_SYMBOL: M A S T E R '_' U S E R;
MASTER_HEARTBEAT_PERIOD_SYMBOL: M A S T E R '_' H E A R T B E A T '_' P E R I O D?;
MATCH_SYMBOL: M A T C H; // SQL-2003-R
MAX_CONNECTIONS_PER_HOUR_SYMBOL: M A X '_' C O N N E C T I O N S '_' P E R '_' H O U R;
MAX_QUERIES_PER_HOUR_SYMBOL: M A X '_' Q U E R I E S '_' P E R '_' H O U R;
MAX_ROWS_SYMBOL: M A X '_' R O W S;
MAX_SIZE_SYMBOL: M A X '_' S I Z E;
MAX_STATEMENT_TIME_SYMBOL:
M A X '_' S T A T E M E N T '_' T I M E {50704 < serverVersion && serverVersion < 50708}?
;
MAX_SYMBOL: M A X { setType(determineFunction(MAX_SYMBOL)); }; // SQL-2003-N
MAX_UPDATES_PER_HOUR_SYMBOL: M A X '_' U P D A T E S '_' P E R '_' H O U R;
MAX_USER_CONNECTIONS_SYMBOL: M A X '_' U S E R '_' C O N N E C T I O N S;

There are two approaches you can take:
If the additional syntax is not valid with the earlier version of the grammar and the interpretation of the previously valid expressions are not changing - only then you can consider using something like semantic predicates to be able to gauge which part of input is parsed with the new grammar and which one with the old one.
Example being: extending integer calculator to support floats
1.0 is invalid with the earlier grammar and new grammar does not change semantics of 1 (integer) calculations.
This condition is not so easy to be met as it may seem - there might be quite nuanced conditions particularly if the grammar or its new versions are complex.
Have two versions of the lexer/parser and switch them on independently as #lex-li suggests. This is the safe path that does not have to deal with the semantic changes of the old expressions with the additions of the new grammar syntax.

Related

parsing and semantic analysis using CUP - Access parser stack

I have a rule in my grammar such as
A -> B C D E {: ...some actions... :}
;
D -> /*empty*/ {: some actions using attributes of B and C :}
;
To implement the actions associated with production rule of D, I need to access the parser stack. How can I do that in CUP?
Rewrite your grammar:
A -> A1 E
A1 -> B C D
If the action for the first production requires B and C as well, then the semantic value of A1 will have to be more complicated in order to pass the semantic values through.

How to learn pocketsphinx for bi-lingual system?

I did create a dictionary with 2 languages(English/Persian) at the one file like this:
بگو B E G U
خزنده KH A Z A N D E
قدت GH A D E T
چنده CH A N D E
قد GH A D
من M A N
شب SH A B
hi H AA Y
hello H E L L O
how H O V
are AA R
you Y U
what V AA T
is I Z
your Y O R
name N E Y M
old O L D
where V E R
from F E R AA M
And used http://www.speech.cs.cmu.edu/tools/lmtool-new.html to build the language model. Then I tried to learn an acoustic model with that language model and test it.
It works good for Persian voices but doesn't work for English words. After some try&error I found that the problem is about my phoneset. I used my own phoneset as you can see above, but it seems pocketsphinx doesn't accept this phoneset for English words and it only accepts it's own phoneset for English!
So I want to know did I found the problem true? Should I use the pocketsphinx phoneset for my Persian words as well? Where should I find it's complete phoneset and a guide to learn how to use it for Persian words?
You have to build a new acoustic model with joined phoneset

Operator precedence with LR(0) parser

A typical BNF defining arithmetic operations:
E :- E + T
| T
T :- T * F
| F
F :- ( E )
| number
Is there any way to re-write this grammar so it could be implemented with an LR(0) parser, while still retaining the precedence and left-associativity of the operators?
I'm thinking it should be possible by introducing some sort of disambiguation non-terminals, but I can't figure out how to do it.
Thanks!
A language can only have an LR(0) grammar if it's prefix-free, meaning that no string in the language is a prefix of another. In this case, the language you're describing isn't prefix-free. For example, the string number + number is a prefix of number + number + number.
A common workaround to address this would be to "endmark" your language by requiring all strings generated to end in a special "done" character. For example, you could require that all strings generated end in a semicolon. If you do that, you can build an LR(0) parser for the language with this grammar:
S → E;
E → E + T | T
T → T * F | F
F → number | (E)

How do I rewrite a context free grammar so that it is LR(1)?

For the given context free grammar:
S -> G $
G -> PG | P
P -> id : R
R -> id R | epsilon
How do I rewrite the grammar so that it is LR(1)?
The current grammar has shift/reduce conflicts when parsing the input "id : .id", where "." is the input pointer for the parser.
This grammar produces the language satisfying the regular expression (id:(id)*)+
It's easy enough to produce an LR(1) grammar for the same language. The trick is finding one which has a similar parse tree, or at least from which the original parse tree can be recovered easily.
Here's a manually generated grammar, which is slightly simplified from the general algorithm. In effect, we rewrite the regular expression:
(id:id*)+
to:
id(:id+)*:id*
which induces the grammar:
S → id G $
G → P G | P'
P' → : R'
P → : R
R' → ε | id R'
R → ε | id R
which is LALR(1).
In effect, we've just shifted all the productions one token to the right, and there is a general algorithm which can be used to create an LR(1) grammar from an LR(k+1) grammar for any k≥1. (The version of this algorithm I'm using comes from Parsing Theory by S. Sippu & E. Soisalon-Soininen, Vol II, section 6.7.)
The non-terminals of the new grammar will have the form (x, V, y) where V is a symbol from the original grammar (either a terminal or a non-terminal) and x and y are terminal sequences of maximum length k such that:
y ∈ FOLLOWk(V)
x ∈ FIRSTk(Vy)
(The lengths of y and consequently x might be less than k if the end of input is included in the follow set. Some people avoid this issue by adding k end symbols, but I think this version is just as simple.)
A non-terminal (x, V, y) will generate the x-derivative of the strings derived from Vy from the original grammar. Informally, the entire grammar is shifted k tokens to the right; each non-terminal matches a string which is missing the first k tokens but is augmented with the following k tokens.
The productions are generated mechanically from the original productions. First, we add a new start symbol, S' with productions:
S' → x (x, S, ε)
for every x ∈ FIRSTk(S). Then, for every production
T → V0 V1 … Vm
we generate the set of productions:
(x0,T,xm+1) → (x0,V0,x1) (x1,V1,x2) … (xm,Vm,xm+1)
and for every terminal A we generate the set of productions
(Ax,A,xB) → B if |x| = k
(Ax,A,x) → ε if |x| ≤ k
Since there is an obvious homomorphism from the productions in the new grammar to the productions in the old grammar, we can directly create the original parse tree, although we need to play some tricks with the semantic values in order to correctly attach them to the parse tree.

Bottleneck in math parser Haskell

I got this code below from the wiki books page here. It parses math expressions, and it works very well for the code I'm working on. Although there is one problem, when I start to add layers of brackets to my expression the program slows down dramatically, crashing my computer at some point. It has something to do with the number of operators I have it check for, the more operators I have the less brackets I can parse. Is there anyway to get around or fix this bottleneck?
Any help is much appreciated.
import Text.ParserCombinators.ReadP
-- slower
operators = [("Equality",'='),("Sum",'+'), ("Product",'*'), ("Division",'/'), ("Power",'^')]
-- faster
-- ~ operators = [("Sum",'+'), ("Product",'*'), ("Power",'^')]
skipWhitespace = do
many (choice (map char [' ','\n']))
return ()
brackets p = do
skipWhitespace
char '('
r <- p
skipWhitespace
char ')'
return r
data Tree op = Apply (Tree op) (Tree op) | Branch op (Tree op) (Tree op) | Leaf String deriving Show
leaf = chainl1 (brackets tree
+++ do
skipWhitespace
s <- many1 (choice (map char "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-[]" ))
return (Leaf s))
(return Apply)
tree = foldr (\(op,name) p ->
let this = p +++ do
a <- p +++ brackets tree
skipWhitespace
char name
b <- this
return (Branch op a b)
in this)
(leaf +++ brackets tree)
operators
readA str = fst $ last $ readP_to_S tree str
main = do loop
loop = do
-- ~ try this
-- ~ (a+b+(c*d))
str <- getLine
print $ last $ readP_to_S tree str
loop
This is a classic problem in backtracking (or parallel parsing, they are basically the same thing).... Backtracking grows (at worst) exponentially with the size of the input, so the time to parse something can suddenly explode. In practice backtracking works OK in language parsing for most input, but explodes with recursive infix operator notation. You can see why by considering how many possibile ways this could be parsed (using made up & and % operators):
a & b % c & d
could be parsed as
a & (b % (c & d))
a & ((b % c) & d)
(a & (b % c)) & d
((a & b) % c) & d
This grows like 2^(n-1). The solution to this is to add some operator precidence information earlier in the parse, and throw away all but the sensible cases.... You will need an extra stack to hold pending operators, but you can always go through infix operator expressions in O(1).
LR parsers like yacc do this for you.... With a parser combinator you need to do it by hand. In parsec, there is a Expr package with a buildExpressionParser function that builds this for you.

Resources