All viable prefixes of a Context Free Grammer - parsing

I am stuck to a problem from the famous dragon Book of Compiler Design.How to find all the viable prefixes of the following grammar:
S -> 0S1 | 01
The grammar is actually the language of the regex 0n1n.
I presume the set of all viable prefixes might come as a regex too.I came up with the following solution
0+
0+S
0+S1
0+1
S
(By plus , I meant no of zeroes is 1..inf)
after reducing string 000111 with the following steps:
stack input
000111
0 00111
00 0111
000 111
0001 11
00S 11
00S1 1
0S 1
0S1 $
S $
Is my solution correct or I am missing something?

0n1n is not a regular language; regexen don't have variables like n and they cannot enforce an equal number of repetitions of two distinct subsequences. Nonetheless, for any context-free grammar, the set of viable prefixes is a regular language. (A proof of this fact, in some form, appears at the beginning of Part II of Donald Knuth's seminal 1965 paper, On the Translation of Languages from Left to Right, which demonstrated both a test for the LR(k) property and an algorithm for parsing LR(k) grammars in linear time.)
OK, to the actual question. A viable prefix for a grammar is (by definition) the prefix of a sentential form which can appear on the stack during a parse using that grammar. It's called "viable" (which means "still alive" or "could continue growing") precisely because it must be the prefix of some right sentential form whose suffix contains no non-terminal symbol. In other words, there exists a sequence of terminals which can be appended to the viable prefix in order to produce a right-sentential form; the viable prefix can grow.
Knuth shows how to create a DFA which produces all viable prefixes, but it's easier to see this DFA if we already have the LR(k) parser produced by an LR(k) algorithm. That parser is a finite-state machine whose alphabet is the set of terminal and non-terminal symbols of a grammar. To get the viable-prefix grammar, we use exactly the same state machine, but we remove the stack (so that it becomes just a state machine) and the reduce actions, leaving only the shift and goto actions as transitions. All states in the viable-prefix machine are accepting states, since any prefix of a viable prefix is itself a viable prefix.
A key feature of this new automaton is that it cannot extend a prefix with a reduce action (since we removed all the reduce actions). A prefix with a reduce action is a prefix which ends in a handle -- recall that a handle is the right-hand side of some production -- so another definition of a viable prefix is that it is a right-sentential form (that is, a possible step in a derivation) which does not extend beyond the right-most handle.
The grammar you are working with has only two productions, so there are only two handles, 01 and 0S1. Note that 10 and 1S cannot be subsequences of any right-sentential form, nor can a right-sentential form contain more than one S. Any right-sentential form must either be a sentence 0n1n or a sentential form 0nS1n where n>0. But every handle ends at the first 1 of a sentential form, and so a viable prefix must end at or before the first 1. This produces precisely the four possibilities you list, which we can condense to the regular expression 0*0(S1?)?.
Chopping off the suffix removed the second n from the formula, so there is no longer a requirement of concordance and the language is regular.
Note:
Questions like this and their answers are begging to be rendered using MathJax. StackOverflow, unfortunately, does not provide this extension, which is apparently considered unnecessary for programming. However, there is a site in the StackExchange constellation dedicated to computing science questions, http://cs.stackexchange.com, and another one dedicated to mathematical questions, http://math.stackexchange.com. Formal language theory is part of both computing science and mathematics. Both of those sites permit MathJax, and questions on those sites will not be closed because they are not programming questions. I suggest you take this information into account for questions like this one.

Related

What is a valid character in an identifier called?

Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.
Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C++23. So I guess it's pretty well mainstream.
UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.
That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)
That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.
You'll see that in a number of grammar documents:
Pythonidentifier ::= xid_start xid_continue*
RustIDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
| _ XID_Continue+
C++identifier:
identifier-start
identifier identifier-continue
So that's what I'd suggest. But there are many other possibilities:
SwiftCalls the sets identifier-head and identifier-characters
JavaCalls them JavaLetter and JavaLetterOrDigit
CDefines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.

Antlr: lookahead and lookbehind examples

I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.
What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.

Do we have to assume a parser is pre-reading tokens ahead of time?

Coming back to lexers and parsers after many years away, I find myself puzzled over the concept of a state change, for the purposes of context. I'm using Lemon as a parser and putting together my own lexer.
Let's take an example input like this one:
[groups]
syscon:
0x000 sysmemremap
0x004 presetctrl
[registers]
sysmemremap:
map 1-0
rsvd 31-2
presetctrl:32
mux 2-0
..etc...
So "syscon:" and "sysmemremap:" look the same but one is a GROUPNAME and the other is a REGISTERNAME. There's a context change between [groups] and [registers] that determines what each token is in reality.
Is it the parser that is in the best position to make that contexual change? As the parser doesn't have a sectional grammar, where one set of grammar applies in one set of circumstances and another in a different set, I presume the lexer should be the one deciding that "syscon:" generates a GROUPNAME if the mode is such that it should.
EDIT: Just spotted "the lexer hack" entry in Wikipedia that summarises the issue:
Without added context, the lexer cannot distinguish type identifiers
from other identifiers because all identifiers have the same format.
.... The solution generally consists of feeding information from the
semantic symbol table back into the lexer. That is, rather than
functioning as a pure one-way pipeline from the lexer to the parser,
there is a backchannel from semantic analysis back to the lexer.
Except (and this is the question I have) what can you assume about the parser's pre-reading of tokens? If the parser is bashing ahead and reading more tokens to do a better match - which I would expect it to do to some extent at least, it could well run into the situation that a state change in the parser is too late for the lexer as it already met and processed that token!
Or am I overthinking this?
I may be able to answer my own question. I think that any sort of reliance on what the parser is doing internally is likely a Bad Plan.
Now that I've found my Lex and Yacc book (the O'Reilly one), one of the examples in the Lex section is a state change one - If it see the word "verb" it starts defining verbs as opposed to looking them up. That work is done in the lexer so I guess that's the way it is - do it in the lexer.

Converting a function (as a string) to be graphed by TChart?

I am getting the user to input a function, e.g. y = 2x^2 + 3, as a string. What I am looking to do is to enter that string into TChart and for TChart to graph the function.
As far as I know, TChart/TeeChart will only accept X values that are assigned values, e.g. -10 to 10 for X, so the X value would need to be calculated each time - this isn't an issue.
The issue is getting each part of the inputted function and substituting the X-values into each part. The workaround I have found is to get the user to enter the degree for each part of the function, e.g. 2 for X^2, 3 for X^3, etc. but is there a cleaner way of doing this?
If I could convert the inputted string into a Mathematical formula which TeeChart would accept, that would be the ideal outcome.
Saying that you can't use external units effectively makes your question unanswerable in the SO format, because the topic is far broader (and deeper) that can comfortably be dealt with in SO's Q&A format. So the following is at best an outline:
If you want to, or have to, write a DIY expression evaluator, one way to do it is to proceed as follows:
Write yourself a class that takes a string as input and snips it up into a series of symbols, aka "tokens" which represent the component parts of the expression, e.g, numbers, operators, parentheses, names of functions, names of variables, etc; these tokens might themselves be records or class instances and need to include a mechanisms for storing values associated with particular symbols (e.g. the tokens that represent numbers in the input). This step is called "tokenisation" or "lexing". Store the resulting list of symbols in a list or similiar structure. This class needs to implement a mechanism to retrieve the next symbol from the list (usually, this method is called something like "NextToken") and indicate whether there are any symbols left. This class also needs a mechanism to "put back" a symbol (or, equivalently, "peek" the symbol following the current one).
Then, write yourself a s/ware machine which takes the tokenised symbols and "evaluates" the list of symbols to produce the (mathematical) result you're after. This step is an order of magnitude or two more difficult than the tokenisation step. There are numerous ways to do it. As I said an a comment earlier, a recursive descent parser is probably the most tractable approach if you've never done anything like this before. There are countless examples in textbooks, but here's a link to an article about a Delphi implementation that should be understandable as an intro:
http://www8.umoncton.ca/umcm-deslierres_michel/Calcs/ParsingMathExpr-1.html
That article begins by noting that there are numerous pre-existing Delphi expression evaluators but makes the point that they are not necessarily the best place to start for someone wanting to learn how to write an evaluator/parser rather than just use one. Instead it goes through the coding of an evaluator to implement this simple expression grammar:
expression : term | term + term | term − term
term : factor | factor * factor | factor / factor
factor : number | ( expression ) | + factor | − factor
(the vertical bar | denotes ‘or’)
The article has a link to a second part which shows had to add exponentiation to the evaluator - this is trickier than it might sound and involves issues of ambiguity: e.g. how to evaluate - and what does it mean to write - an expression like
x^y^z
? This relates to the issue of "associativity": most operators are "left associative" which means that they bind more tightly to what's on the left of them than what's on their right. The exponentiation operator is an example of the reverse, where the operator binds more tightly to what's on its right.
Have fun!
By the way, you used to see suggestions to implement an evaluator using the "shunting yard algorithm"
http://en.wikipedia.org/wiki/Shunting-yard_algorithm
to convert an "infix" expression where the operators are between the operands, as in 1 + 3 * 4 to RPN (reverse Polish notation), as used on older HP calculators. The reason to do that was that RPN makes for much more efficient evaluation of an expression that the infix equivalent. Ymmv, but personally I found that implementing the SY algorithm properly was actually trickier than learning how to write an evaluator in the expression/term/factor style.
Fwiw, RPN is the basis of the Forth programming language, http://en.wikipedia.org/wiki/Forth_%28programming_language%29, so you could write a Forth implementation in Delphi if you wanted!

Parsing special cases

If I understand correctly, parsing turns a sequence of symbols into a tree. My question is, is it possible to use some standard procedure (LR, LL, PEG, ..?) to parse the following two examples or is it necessary to write a specialized parser by hand?
Python source code, i.e. the whitespace-indented blocks
I think I read somewhere that the parser keeps track of the number of leading spaces, and pretends to replace them with curly brackets to delimitate the blocks. Is it fundamentally required because the standard parsing techniques are not powerful enough or is it for performance reasons?
PNG image format, where a block starts with a header and block size, after which there is the content of the block
The content could contain bytes which resemble some header so it is necessary to "know" that the next x bytes are not to be "parsed", i.e. they should be skipped. How to express this, say, with PEG? In other words, the "closing bracket" is represented by the length of the content.
Neither of the examples in the question are context-free, so strictly speaking they cannot be parsed with context-free grammars. But in practical terms, they are both pretty easy to parse.
The python algorithm is well-described in the Python reference manual (although you need to read that in context.) What's described there is a pre-processing step in which whitespace at the beginning of a line is systematically replaced with INDENT and DEDENT tokens.
To clarify: It's not really a preprocessing step, and it's important to observe that it happens after implicit and explicit line joining. (There are previous sections in the reference manual which describe these procedures.) In particular, lines are implicitly joined inside parentheses, braces and brackets, so the process is intertwined with parsing.
In practical terms, both the line-joining and indentation algorithms can be accomplished programmatically; typically, these would be done inside a custom scanner (tokenizer) which maintains both a stack of parentheses and indent levels. The token stream can then be parsed with normal context-free algorithms, but the tokenizer -- although it might use regular expressions -- needs context-sensitive logic (counting spaces, for example). [Note 1]
Similarly, formats which contain explicit sizes (such as most serialization formats, including PNG files, Google protobufs, and HTTP chunked encoding) are not context-free, but are obviously easy to tokenize since the tokenizer simply has to read the length and then read that many bytes.
There are a variety of context-sensitive formalisms, and these definitely have their uses, but in practical parsing the most common strategy is to use a Turing-equivalent formalism (such as any programming language, possibly augmented with a scanner-generator like flex) for the tokenizer and a context-free formalism for the parser. [Note 2]
Notes:
It may not be immediately obvious that Python indenting is not context-free, since context-free grammars can accept some categories of agreement. For example, {ωω-1 | ω∈Σ*} (the language of all even-length palindromes) is context-free, as is {anbn}.
However, these examples can't be extended, because the only count-agreement possible in a context-free language is bracketing. So while palindromes are context-free (you can implement the check with a single stack), the apparently very similar {ωω | ω∈Σ*} is not, and neither is {anbncn}
One such formalism is back-references in "regular" expressions, which might be available in some PEG implementation. Back-references allow the expression of a variety of context-sensitive languages, but do not allow the expression of all context-free languages. Unfortunately, regular expressions with back-references really suck in practice, because the problem of determining whether a string matches a regex with back-references is NP complete. You might find this question on a sister SE site interesting. (And you might want to reformulate your question in a way that could be asked on that site, http://cs.stackexchange.com.)
As a practical matter, almost all parser construction requires some clever hacks around the edges to overcome the limitations of the parsing machinery.
Pure context free parsers can't do Python; all the parser technologies you have listed are weaker than pure-context free, so they can't do it either. A hack in the lexer to keep track of indentation, and generate INDENT/DEDENT tokens, turns the indenting problem into explicit "parentheses", which are easily handled by context-free parsers.
Most binary files can't be processed either, as they usually contain, somewhere, a list of length N, where N is provided before the list body is encountered (this is kind of the example you gave). Again, you can get around this, with a more complicated hack; something must keep a stack of nested list lengths, and the parser has to signal when it moves from one list element to the next. The top-most length counter gets decremented, and the parser gets back a signal "reduce" or "shift". Other more complex linked structures are generally pretty hard to parse this way. Getting the parser to cooperate this way isn't always easy.

Resources