What does the ⇒ mean in Context Free Grammar? Also, how do I derive/get the L(CFG) from a CFG? - cfg

I'm a bit confused. Does the ⇒ simply denote you going through the CFG step by step ? To use a pretty basic-ish example:
sentence ⇒ subject | predicatesubject ⇒ article | nounpredicate ⇒ verb | direct objectarticle ⇒ THE | Anoun ⇒ CAT | MAN
And so on and so forth? Is the above an L(CFG)? Or does L(CFG) just refer to any string within the language defined by the CFG? Not asking for anything but a concrete answer, because I've already looked at like ten guides and not understood. Thanks!

Related

Can this language be described by a non-ambiguous BNF grammar?

I'm getting back into language design/specification (via BNF/EBNF grammars) after 20+ years of leaving the topic alone (since my CS undergrad degree).
I only vaguely recall the various related terms in this space, like LR(1), LALR, etc. I've been attempting to refresh via some googling and reading, but it's coming slowly (probably because I didn't fully understand this stuff back in school). So I'm probably doing things quite roughly.
I decided I would describe a toy language with a grammar, and then try to analyze and possibly optimize it, as a part of my re-learning.
NOTE: all the snippets below can also be found in a gist here.
I started with an EBNF representation (as processed/validated by this tool):
Program := WhSp* (StmtSemi WhSp*)* StmtSemiOpt? WhSp*;
Stmt := AStmt | BStmt | CStmt | DStmt;
StmtSemi := Stmt? (WhSp* ";")+;
StmtSemiOpt := Stmt? (WhSp* ";")*;
WhSp := "_";
AStmt := "a";
BStmt := "b";
CStmt := "c";
DStmt := "d";
Here are some valid matches for this language (one match per line):
_____
;;;;;
_;_;_
a
__a__
a;
a;b;
a;_b;
_a;_b;_
_a_;_b_;_
__a__;;
_;_a;_b;c;;;__;;__d;___a___
And here are some values that wouldn't be in the language (again, one per line):
ab
a_b
a;_b_c
I then hand-converted this to the following BNF form (as processed/analyzed by this tool):
Program -> StmtSemi FinalStmtSemiOpt .
StmtSemi -> WhSp StmtSemiOpt | StmtSemiOpt .
FinalStmtSemiOpt -> StmtOpt SemiOpt WhSpOpt | WhSpOpt .
Stmt -> AStmt | BStmt | CStmt | DStmt .
StmtOpt -> Stmt | .
StmtSemiOpt -> StmtOpt Semi | StmtOpt Semi WhSpOpt StmtSemiOpt | .
Semi -> WhSpOpt ; | WhSpOpt ; Semi .
SemiOpt -> Semi | .
WhSp -> _ | _ WhSp .
WhSpOpt -> WhSp | .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
That tool's analysis says my grammar is ambiguous. I guess that's not surprising or necessarily a bad outcome, but I know ambiguous grammars limit some kinds of analysis and automatic conversion or parser generation.
So... finally, here are my questions:
Is this a context-free grammar? What specifically makes it so, or would make it non-CFG?
[Edit: "Yes", see #rici's answer]
Can the language I'm describing be specified in a non-ambiguous grammar (BNF or EBNF)? Or is it just inherently ambiguous?
If it's inherently ambiguous, what specific aspects of the language make it so? In other words, what minimally would I have to change/remove to arrive at a language that had a non-ambiguous grammar?
Are there meaningful ways my BNF grammar form could be simplified and still describe the same language as the EBNF?
Does the BNF grammar currently have left-recursion, right-recursion, or both? I'm having trouble convincing myself of the answer. Could the BNF be re-arranged to avoid one or the other, and what would the impacts (performance, etc) of that be?
[Edit: I believe the updated BNF only has right-recursion, according to the analysis tool.]
Apologies if I'm fumbling around with incorrect terminology or asking imprecise questions. Thanks for any insight you might be able to offer.
[EDIT: Here's a new BNF that I believe is equivalent but isn't ambiguous -- thanks to #rici for confirming it was possible. I didn't use any particular algorithm/strategy for this, just keep trial-n-error fiddling.]
Leading -> WhSp Leading Program | Semi Leading Program | Program .
Program -> Stmt | Stmt WhSp | Stmt WhSpOptSemi Program | Stmt WhSpOptSemi WhSp Program | .
Stmt -> AStmt | BStmt | CStmt | DStmt .
WhSpOptSemi -> Semi | WhSp Semi | Semi WhSpOptSemi | WhSp Semi WhSpOptSemi .
WhSp -> _ | _ WhSp .
Semi -> ; .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
So that seems to answer questions (2), (3), and partly (4) I guess.
It's not always easy (or even possible) to demonstrate that a grammar is ambiguous, but if there is a short ambiguous sentence, then it can be found with brute-force enumeration, which is what I believe that tool does. And the output is revealing; the shortest ambiguous sentence is the empty string.
So it remains only to figure out why the empty string can be parsed in two (or more) ways, and a quick look at the productions reveals that FinalStmtSemiOpt has two nullable productions, which means that it has two ways of deriving the empty string. That's evident by inspection, if you believe that every production whose name ends with Opt in fact describes an optional sequence, since FinalStmtSemiOpt has two productions each of which consist only of XOpts. A little more effort is required to verify that the optional non-terminals are, in fact, optional, which is as it should be.
So the solution is to rework FinalStmtSemiOpt without using two nullable right-hand sides. That shouldn't be too hard.
Almost all the other questions you raise are answerable by inspection:
A grammar is context-free (by definition) iff every production has precisely one symbol on the left hand side. (If it had more than one symbol, then the extra symbols define a context in which the substitution is valid. If no production has a context, the grammar is context-free.)
A non-terminal is left-recursive if some production for the non-terminal either starts with the non-terminal itself (direct left-recursion), or starts with a sequence of nullable non-terminals followed by the non-terminal itself (indirect left-recursion). In other words, the non-terminal is or could be the left-most symbol in a production. Similarly, a non-terminal is right-recursive if some production ends with the non-terminal (again, possibly followed by nullable non-terminals).
There is no algorithm which can tell you in general if a language is inherently ambiguous, but in the particular case that a language is regular, then it is definitely not inherently ambiguous. A language with a regular grammar is regular, but a non-regular grammar could still describe a regular language. Your grammar is not regular, but it can be made regular without too much work (mostly refactoring). A hint that it is probably regular is that there is no recursively nested syntax (eg. parentheses or statement blocks). Most useful grammars are recursively nested, and nesting in no way implies ambiguity. So that doesn't take you too far, but it happens to be the case here.

Parse grammar alternating and repeating

I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.
The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.
Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .

Converting given ambiguous arithmetic expression grammar to unambiguous LL(1)

In this term, I have course on Compilers and we are currently studying syntax - different grammars and types of parsers. I came across a problem which I can't exactly figure out, or at least I can't make sure I'm doing it correctly. I already did 2 attempts and counterexamples were found.
I am given this ambiguous grammar for arithmetic expressions:
E → E+E | E-E | E*E | E/E | E^E | -E | (E)| id | num , where ^ stands for power.
I figured out what the priorities should be. Highest priority are parenthesis, followed by power, followed by unary minus, followed by multiplication and division, and then there is addition and substraction. I am asked to convert this into equivalent LL(1) grammar. So I wrote this:
E → E+A | E-A | A
A → A*B | A/B | B
B → -C | C
C → D^C | D
D → (E) | id | num
What seems to be the problem with this is not equivalent grammar to the first one, although it's non-ambiguous. For example: Given grammar can recognize input: --5 while my grammar can't. How can I make sure I'm covering all cases? How should I modify my grammar to be equivalent with the given one? Thanks in advance.
Edit: Also, I would of course do elimination of left recursion and left factoring to make this LL(1), but first I need to figure out this main part I asked above.
Here's one that should work for your case
E = E+A | E-A | A
A = A*C | A/C | C
C = C^B | B
B = -B | D
D = (E) | id | num
As a sidenote: pay also attention to the requirements of your task since some applications might assign higher priority to the unary minus operator with respect to the power binary operator.

A grammar expression for representing comma-delimited lists

Based on my experience, formal grammars typically express comma-delimited lists in a form similar to this:
foo_list -> foo ("," foo)*
What alternatives are there to avoid mentioning foo twice? Although this contrived example may seem innocent enough, I am encountering non-trivial expressions instead of foo. For example:
foo_list -> ( ( bar | baz | cat ) ) ( "," ( bar | baz | cat ) )*
I remember a (proprietary) parser generator that I once worked with, which would have this production written as
foo_list ::= <* bar | baz | cat ; "," *>
Yes, exactly like that. The actual metacharacters above are disputable, but I deem the general approach acceptable.
When writing another parser generator, I considered something alike for a while, but dropped it in favor of keeping the model simple.
A syntax diagram of course can nicely represent it without the unwanted repetition:
During my experimentation, this syntax showed some potential:
foo_list -> ( bar | baz | cat ) ("," ...)*
The ... token refers to the preceding expression (in this case, ( bar | baz | cat )).
This is not a perfect solution, but I am putting it out there for discussion.

Hints on parsing

I want to implement a minimal templating language like Template Toolkit but much more simple.
I don't want to use an existing implementation/library, but start from scratch because I want to learn something from it and I want to completely understand it in order to adopt it to my needs.
The end product should be in C but I will probably try to make a prototype in Perl first.
For the beginning I only want it to handle including other files, substituting variables and, now comes the hard part, arbitrarily nestable if/elseif/else/endif-constructs which require some advanced parsing.
Here is an example illustrating its intended usage:
<h1>[% substitute title %]</h1>
<p>
[% if foo %]
foo is true
[% elseif bar %]
[% if baz %]
bar and baz are true
[% endif %]
bar is true
[% else %]
<em>none<em> is true
[% endif %]
</p>
I have decent C and some Perl skills but absolutely no knowledge in parsing, so I don't even know what exactly I am looking for.
So I would be interested in
which algorithms can handle parsing like this
reading recommendations on such algorithms, minimal introductions to parsing relevant here, or tutorials
minimal, well documented/commented examples (I could not make much sense from TT source)
TIA.
If you are using C, try (f)lex and yacc/bison. They are not that hard to use.
Besides there are several questions on the basics of compilers on SO.
Just the basics:
The first step is to translate the character stream to a token stream.
For example [% and %] are two tokens. But an identifier is also a token.
The next step, is to detect and execute the grammar. You can do this by building a syntax tree:
[if]
/ | \
/ | \
| Exp |
| | |
| foo |
| |
"foo is.." elsif
/ | \
/ | \
| Exp |
| | |
| bar |
| |
if "none is true"
/ | \
/ | \
| Exp |
| | |
| baz |
| |
"bar and..." empty
And execute the tree. Which implies: for each (else)if node, evaluate the expression, and execute the true branch if true and the fase branch if false.
I've written a general answer to a similar question some time before. Hopefuly, it can help you to find some starting point.
JavaCC is the Java Compiler Compiler, its for making compilers in java. Quite a useful bit of software if you want to make a programming language or interpreter.

Resources