Deterministic Finite Automata - automata

I am new to automata theory. This question below is for practice:
Let there be a language that is made of words that start and end with different symbols and have the alphabet {0,1}. For example, 001, 10110101010100, 10 and 01 are all accepted. But 101, 1, 0, and 1010001101 are rejected.
How do I:
Construct a Deterministic Finite Automata (DFA) diagram?
Find the regular expression for the DFA?
I tried to post an image of the DFA I drew, but I need 10 reputations to post images unfortunately, which I do not yet have.

To answer this question, I think it's easier to identify the regular expression first.
Regular Expression
1(1|0)*0 | 0(1|0)*1
(* denotes Kleene's star operation)
Now we convert this regular expression into an equivalent finite automata.
Constructing a DFA
You can design the NFA-∧(or NFA-ε in some texts) easily using Thompson constructors[1] for a given language(regex) which is then converted into an NFA without lambda transitions.
This NFA can then be mapped to an equivalent DFA using subset construction method. [2]
If you want, you can further reduce this DFA to obtain a minimal DFA which is unique for a given regular language. (Myhill-Nerode theorem) [3]
Regex → NFA-∧ → NFA → DFA → DFA(minimal),
This is the standard procedure.
[1]http://en.wikipedia.org/wiki/Thompson%27s_construction_algorithm
[2] http://www.cs.nuim.ie/~jpower/Courses/Previous/parsing/node9.html
[3]http://en.wikipedia.org/wiki/Myhill%E2%80%93Nerode_theorem

We can get two possibilities here-
1) String starts with 0 and ends with 1 => [0(0|1)*1]
2) Strings staring with 1 and ending with 0 => [1(0|1)*0]
Also from rejected strings we know that minimum length would be 2.
Therefore final expression would be [0(0|1)*1]|[1(0|1)*0]
NFA would be something like this
NFA for given language

Related

Starting a parser for scheme language

I am writing a basic parser for a Scheme interpreter and here are the definitions I have set up to define the various type of tokens:
# 1. Parens
Type:
PAREN
Subtype:
LEFT_PAREN
Value:
'('
# 2. Operators (<=, =, +, ...)
Type:
OPERATOR
Subtype:
EQUALS
Value:
'='
Arity:
2
# 3. Types (2.5, "Hello", #f, etc.)
Type:
DATA
Subtype:
NUMBER
Value:
2.4
# 4. Procedures, builtins, and such
Type:
KEYWORD
Subtype:
BUILTIN
Value:
"set"
Arity:
2
PROCEDURE:
... // probably need a new class for this
Does the above seem like it's a good starting place? Are there some obvious things I'm missing here, or does this give me a "good-enough" foundation?
Your approach makes distinctions which really don't exist in the syntax of the language, and also makes decisions far too early. For example consider this program:
(let ((x 1))
(with-assignment-notes
(set! x 2)
(set! x 3)
x))
When I run this:
> (let ((x 1))
(with-assignment-notes
(set! x 2)
(set! x 3)
x))
setting x to 2
setting x to 3
3
In order for this to work with-assignment-notes has to somehow redefine what (set! ...) means in its body. Here's a hacky and probably incorrect (Racket) implementation of that:
(define-syntax with-assignment-notes
(syntax-rules (set!)
[(_ form ...)
(let-syntax ([rewrite/maybe
(syntax-rules (set!)
[(_ (set! var val))
(let ([r val])
(printf "setting ~A to ~A~%" 'var r)
(set! var r))]
[(_ thing)
thing])])
(rewrite/maybe form) ...)]))
So the critical features of any parser for a Lisp-family language are:
it should not make any decision about the semantics of the language that it can avoid making;
the structure it constructs must be available to the language itself as first-class objects;
(and optionally) the parser should be modifiable from the language itself.
As examples:
it is probably inevitable that the parser needs to make decisions about what is and is not a number and what sort of number it is;
it would be nice if it had default handling for strings, but this should ideally be controllable by the user;
it should make no decision at all about what, say (< x y) means but rather should return a structure representing it for interpretation by the language.
The reason for the last, optional, requirement is that Lisp-family languages are used by people who are interested in using them for implementing languages. Allowing the reader to be altered from within the language makes that hugely easier, since you don't have to start from scratch each time you want to make a language which is a bit like the one you started with but not completely.
Parsing Lisp
The usual approach to parsing Lisp-family languages is to have machinery which will turn a sequence of characters into a sequence of s-expressions consisting of objects which are defined by the language itself, notably symbols and conses (but also numbers, strings &c). Once you have this structure you then walk over it to interpret it as a program: either evaluating it on the fly or compiling it. Critically, you can also write programs which manipulate this structure itself: macros.
In 'traditional' Lisps such as CL this process is explicit: there is a 'reader' which turns a sequence of characters into a sequence of s-expressions, and macros explicitly manipulate the list structure of these s-expressions, after which the evaluator/compiler processes them. So in a traditional Lisp (< x y) would be parsed as (a cons of a symbol < and (a cons of a symbol x and (a cons of a symbol y and the empty list object)), or (< . (x . (y . ()))), and this structure gets handed to the macro expander and hence to the evaluator or compiler.
In Scheme it is a little more subtle: macros are specified (portably, anyway) in terms of rules which turn a bit of syntax into another bit of syntax, and it's not (I think) explicit whether such objects are made of conses & symbols or not. But the structure which is available to syntax rules needs to be as rich as something made of conses and symbols, because syntax rules get to poke around inside it. If you want to write something like the following macro:
(define-syntax with-silly-escape
(syntax-rules ()
[(_ (escape) form ...)
(call/cc (λ (c)
(define (escape) (c 'escaped))
form ...))]
[(_ (escape val ...) form ...)
(call/cc (λ (c)
(define (escape) (c val ...))
form ...))]))
then you need to be able to look into the structure of what came from the reader, and that structure needs to be as rich as something made of lists and conses.
A toy reader: reeder
Reeder is a little Lisp reader written in Common Lisp that I wrote a little while ago for reasons I forget (but perhaps to help me learn CL-PPCRE, which it uses). It is emphatically a toy, but it is also small enough and simple enough to understand: certainly it is much smaller and simpler than the standard CL reader, and it demonstrates one approach to solving this problem. It is driven by a table known as a reedtable which defines how parsing proceeds.
So, for instance:
> (with-input-from-string (in "(defun foo (x) x)")
(reed :from in))
(defun foo (x) x)
Reeding
To read (reed) something using a reedtable:
look for the next interesting character, which is the next character not defined as whitespace in the table (reedtables have a configurable list of whitespace characters);
if that character is defined as a macro character in the table, call its function to read something;
otherwise call the table's token reader to read and interpret a token.
Reeding tokens
The token reader lives in the reedtable and is responsible for accumulating and interpreting a token:
it accumulates a token in ways known to itself (but the default one does this by just trundling along the string handling single (\) and multiple (|) escapes defined in the reedtable until it gets to something that is whitespace in the table);
at this point it has a string and it asks the reedtable to turn this string into something, which it does by means of token parsers.
There is a small kludge in the second step: as the token reader accumulates a token it keeps track of whether it is 'denatured' which means that there were escaped characters in it. It hands this information to the token parsers, which allows them, for instance, to interpret |1|, which is denatured, differently to 1, which is not.
Token parsers are also defined in the reedtable: there is a define-token-parser form to define them. They have priorities, so that the highest priority one gets to be tried first and they get to say whether they should be tried for denatured tokens. Some token parser should always apply: it's an error if none do.
The default reedtable has token parsers which can parse integers and rational numbers, and a fallback one which parses a symbol. Here is an example of how you would replace this fallback parser so that instead of returning symbols it returns objects called 'cymbals' which might be the representation of symbols in some embedded language:
Firstly we want a copy of the reedtable, and we need to remove the symbol parser from that copy (having previously checked its name using reedtable-token-parser-names).
(defvar *cymbal-reedtable* (copy-reedtable nil))
(remove-token-parser 'symbol *cymbal-reedtable*)
Now here's an implementation of cymbals:
(defvar *namespace* (make-hash-table :test #'equal))
(defstruct cymbal
name)
(defgeneric ensure-cymbal (thing))
(defmethod ensure-cymbal ((thing string))
(or (gethash thing *namespace*)
(setf (gethash thing *namespace*)
(make-cymbal :name thing))))
(defmethod ensure-cymbal ((thing cymbal))
thing)
And finally here is the cymbal token parser:
(define-token-parser (cymbal 0 :denatured t :reedtable *cymbal-reedtable*)
((:sequence
:start-anchor
(:register (:greedy-repetition 0 nil :everything))
:end-anchor)
name)
(ensure-cymbal name))
An example of this. Before modifying the reedtable:
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(x y . z)
After:
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(#S(cymbal :name "x") #S(cymbal :name "y") . #S(cymbal :name "z"))
Macro characters
If something isn't the start of a token then it's a macro character. Macro characters have associated functions and these functions get called to read one object, however they choose to do that. The default reedtable has two-and-a-half macro characters:
" reads a string, using the reedtable's single & multiple escape characters;
( reads a list or a cons.
) is defined to raise an exception, as it can only occur if there are unbalanced parens.
The string reader is pretty straightforward (it has a lot in common with the token reader although it's not the same code).
The list/cons reader is mildly fiddly: most of the fiddliness is dealing with consing dots which it does by a slightly disgusting trick: it installs a secret token parser which will parse a consing dot as a special object if a dynamic variable is true, but otherwise will raise an exception. The cons reader then binds this variable appropriately to make sure that consing dots are parsed only where they are allowed. Obviously the list/cons reader invokes the whole reader recursively in many places.
And that's all the macro characters. So, for instance in the default setup, ' would read as a symbol (or a cymbal). But you can just install a macro character:
(defvar *qr-reedtable* (copy-reedtable nil))
(setf (reedtable-macro-character #\' *qr-reedtable*)
(lambda (from quote table)
(declare (ignore quote))
(values `(quote ,(reed :from from :reedtable table))
(inch from nil))))
And now 'x will read as (quote x) in *qr-reedtable*.
Similarly you could add a more compllicated macro character on # to read objects depending on their next character in the way CL does.
An example of the quote reader. Before:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
\'
The object it has returned is a symbol whose name is "'", and it didn't read beyond that of course. After:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
`(x y . z)
Other notes
Everything works one-character-ahead, so all of the various functions get the stream being read, the first character they should be interested in and the reedtable, and return both their value and the next character. This avoids endlessly unreading characters (and probably tells you what grammar class it can handle natively (obviously macro character parsers can do whatever they like so long as things are sane when they return).
It probably doesn't use anything which isn't moderately implementable in non-Lisp languages. Some
Macros will cause pain in the usual way, but the only one is define-token-parser. I think the solution to that is the usual expand-the-macro-by-hand-and-write-that-code, but you could probably help a bit by having an install-or-replace-token-parser function which dealt with the bookkeeping of keeping the list sorted etc.
You'll need a language with dynamic variables to implement something like the cons reeder.
it uses CL-PPCRE's s-expression representation of regexps. I'm sure other languages have something like this (Perl does) because no-one wants to write stringy regexps: they must have died out decades ago.
It's a toy: it may be interesting to read but it's not suitable for any serious use. I found at least one bug while writing this: there will be many more.

i want to make a CFG in which letter b is never trippled

need help regarding context free grammar. I want a cfg in which letter b is never tripled. This means that no word contains the substring bbb in it. Language contains only letters {a,b}
Here is a helpful recursive definition of this language:
the empty string is in the language
b is in the language
bb is in the language
if x is a string in the language, then xa is in the language
if x is a string in the language, then xab is in the language
if x is a string in the language, then xabb is in the language
nothing else is in the language unless by the above rules
First, let's make sure this definition is right. It is not hard to argue that this definition defines strings that must be in our target language; no string with the substring bbb can satisfy the above rules. Does the definition cover all cases so that all strings without the substring bbb work? In fact, it must. Consider any string in the language. It either has length less than three (in this case, we can check all possible strings are treated correctly, which they are); for longer strings, they must end with a, ab or abb (they cannot end with bbb). Our rules imply the existence of a string x in the language without these suffixes, which can be recursively checked for membership. This can be reversed to yield a convincing proof by mathematical induction.
With a recursive definition like the above in hand, we can just write down the corresponding grammar:
S -> e | b | bb | Sa | Sab | Sabb
The real ingenuity here was getting the definition. How did I do that? I thought of the shortest strings in the language - the unique ones, that don't fit a pattern - and then I asked how to make longer strings from shorter ones. That is, given a string in the language, how do you make one bigger? That's the key to what context-free grammars let us do.

How can all keywords in a grammar be set to accept upper and lower case in an existing grammar?

I have a grammar that works, except the keywords must be upper case. Is there a way to shotgun all the keywords such that lower case equivalents will not be rejected? If not, how do I affect each of them individually?
I don't recommend input streams that convert case to make the keyword recognition case-insensitive. Such a stream will convert everything, strings, comments etc. even though that is a total waste of CPU cycles. A better approach is to tell explicitly in your grammar that you want (only) certain keywords to be case sensitive. The grammar is trivial:
fragment A: [aA];
fragment B: [bB];
...
fragment Z: [zZ];
KEYWORD1: K E Y W O R D '1';
...
The ATN for these rules is only marginally more complex (using 2 intervals instead of one for each letter, which is (in total) faster than a case conversion):
and as example the letter S:
Each node is a step the ATN simulator has to walk to parse a rule. Edge labels are symbols to match to allow this transition (with ɛ being the epsilon transition, i.e. an unconditional step without input consumption).

Grammar: start: (a b)? a c; Input: a d. Which error correct at position 2? 1. expected "b", "c". OR expected "c"

Grammar:
rule: (a b)? a c ;
Input:
a d
Question: Which error message correct at position 2 for given input?
1. expected "b", "c".
2. expected "c".
P.S.
I write parser and I have choice (dilemma) take into account that "b" expected at position or not take.
The #1 error (expected "b", "c") want to say that input "a b" expected but because it optional it may not expected but possible.
I don't know possible is the same as expected or not?
Which error message better and correct #1 or #2?
Thanks for answers.
P.S.
In first case I define marker of testing as limit of position.
if(_inputPos > testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
Limit moved in optional expressions:
OPTIONAL_EXPRESSION:
testing = _inputPos;
The "b" expression move _inputPos above the testing pos and add failure at _inputPos.
In second case I can define marker of testing as boolean flag.
if(!testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
The "b" expression in this case not add failure because it tested (inner for optional expression).
What you think what is better and correct?
Testing defined as specific position and if expression above this position (_inputPos > testing) it add failure (even it inside optional expression).
Testing defined as flag and if this flag set that the failures not takes into account. After executing optional expression it restore (not reset!) previous value of testing (true or false).
Also failures not takes into account if rule not fails. They only reported if parsing fails.
P.S.
Changes at 06 Jan 2014
This question raised because it related to two different problems.
First problem:
Parsing expression grammar (PEG) describe only three atomic items of input:
terminal symbol
nonterminal symbol
empty string
This grammar does not provide such operation as lexical preprocessing an thus it does not provide such element as the token.
Second problem:
What is a grammar? Are two grammars can be considred equal if they accept the same input but produce different result?
Assume we have two grammar:
Grammar 1
rule <- type? identifier
Grammar 2
rule <- type identifier / identifier
They both accept the same input but produce (in PEG) different result.
Grammar 1 results:
{type : type, identifier : identifier}
{type : null, identifier : identifier}
Grammar 2 results:
{type : type, identifier : identifier}
{identifier : identifier}
Quetions:
Both grammar equal?
It is painless to do optimization of grammars?
My answer on both questions is negative. No equal, Not painless.
But you may ask. "But why this happens?".
I can answer to you. "Because this is not a problem. This is a feature".
In PEG parser expression ALWAYS consists from these parts.
ORDERED_CHOICE => SEQUENCE => EXPRESSION
And this explanation is the my answer on question "But why this happens?".
Another problem.
PEG parser does not recognize WHITESPACES because it does not have tokens and tokens separators.
Now look at this grammar (in short):
program <- WHITESPACE expr EOF
expr <- ruleX
ruleX <- 'X' WHITESPACE
WHITESPACE < ' '?
EOF <- ! .
All PEG grammar desribed in this manner.
First WHITESPACE at begin and other WHITESPACE (often) at the end of rule.
In this case in PEG optional WHITESPACE must be assumed as expected.
But WHITESPACE not means only space. It may be more complex [\t\n\r] and even comments.
But the main rule of error messages is the following.
If not possible to display all expected elements (or not possible to display at least one from all set of expected elements) in this case is more correct do not display anything.
More precisely required to display "unexpected" error mesage.
How you in PEG will display expected WHITESPACE?
Parser error: expected WHITESPACE
Parser error: expected ' ', '\t', '\n' , 'r'
What about start charcters of comments? They also may be part of WHITESPACE in some grammars.
In this case optional WHITESPACE will be reject all other potential expected elements because not possible correctly to display WHITESPACE in error message because WHITESPACE is too complex to display.
Is this good or bad?
I think this is not bad and required some tricks to hide this nature of PEG parsers.
And in my PEG parser I not assume that the inner expression at first position of optional (optional & zero_or_more) expression must be treated as expected.
But all other inner (except at the first position) must treated as expected.
Example 1:
List<int list; // type? ident
Here "List<int" is a "type". But missing ">" is not at the first position in optional "type?".
This failure take into account and report as "expected '>'"
This is because we not skip "type" but enter into "type" and after really optional "List" we move position from first to next real "expected" (that already outside of testing position) element.
"List" was in "testing" position.
If inner expression (inside optional expression) "fits in the limitation" not continue at next position then it not assumed as the expected input.
From this assumption has been asked main question.
You must just take into account that we are talking about PEG parsers and their error messages.
Here is your grammar:
What is clear here is that after the first a there are two possible inputs: b or c. Your error message should not prioritize one over the other.
The basic idea to produce an error message for an invalid input is to find the most far place you failed (if your grammar where d | (a b)? a c, d wouldn't be part of the error) and determine what are all possible inputs that could make you advance and say "expected '...' but got '...'". There are other approaches to try to recover the parser and force it to continue. If there is only one possible expected token, let's temporarily insert it into the token stream and continue as if it where there since ever. This would lead to better error detection as you can find errors beyond the point where the parser first stopped.

REBOL path operator vs division ambiguity

I've started looking into REBOL, just for fun, and as a fan of programming languages, I really like seeing new ideas and even just alternative syntaxes. REBOL is definitely full of these. One thing I noticed is the use of '/' as the path operator which can be used similarly to the '.' operator in most object-oriented programming languages. I have not programmed in REBOL extensively, just looked at some examples and read some documentation, but it isn't clear to me why there's no ambiguity with the '/' operator.
x: 4
y: 2
result: x/y
In my example, this should be division, but it seems like it could just as easily be the path operator if x were an object or function refinement. How does REBOL handle the ambiguity? Is it just a matter of an overloaded operator and the type system so it doesn't know until runtime? Or is it something I'm missing in the grammar and there really is a difference?
UPDATE Found a good piece of example code:
sp: to-integer (100 * 2 * length? buf) / d/3 / 1024 / 1024
It appears that arithmetic division requires whitespace, while the path operator requires no whitespace. Is that it?
This question deserves an answer from the syntactic point of view. In Rebol, there is no "path operator", in fact. The x/y is a syntactic element called path. As opposed to that the standalone / (delimited by spaces) is not a path, it is a word (which is usually interpreted as the division operator). In Rebol you can examine syntactic elements like this:
length? code: [x/y x / y] ; == 4
type? first code ; == path!
type? second code
, etc.
The code guide says:
White-space is used in general for delimiting (for separating symbols).
This is especially important because words may contain characters such as + and -.
http://www.rebol.com/r3/docs/guide/code-syntax.html
One acquired skill of being a REBOler is to get the hang of inserting whitespace in expressions where other languages usually do not require it :)
Spaces are generally needed in Rebol, but there are exceptions here and there for "special" characters, such as those delimiting series. For instance:
[a b c] is the same as [ a b c ]
(a b c) is the same as ( a b c )
[a b c]def is the same as [a b c] def
Some fairly powerful tools for doing introspection of syntactic elements are type?, quote, and probe. The quote operator prevents the interpreter from giving behavior to things. So if you tried something like:
>> data: [x [y 10]]
>> type? data/x/y
>> probe data/x/y
The "live" nature of the code would dig through the path and give you an integer! of value 10. But if you use quote:
>> data: [x [y 10]]
>> type? quote data/x/y
>> probe quote data/x/y
Then you wind up with a path! whose value is simply data/x/y, it never gets evaluated.
In the internal representation, a PATH! is quite similar to a BLOCK! or a PAREN!. It just has this special distinctive lexical type, which allows it to be treated differently. Although you've noticed that it can behave like a "dot" by picking members out of an object or series, that is only how it is used by the DO dialect. You could invent your own ideas, let's say you make the "russell" command:
russell [
x: 10
y: 20
z: 30
x/y/z
(
print x
print y
print z
)
]
Imagine that in my fanciful example, this outputs 30, 10, 20...because what the russell function does is evaluate its block in such a way that a path is treated as an instruction to shift values. So x/y/z means x=>y, y=>z, and z=>x. Then any code in parentheses is run in the DO dialect. Assignments are treated normally.
When you want to make up a fun new riff on how to express yourself, Rebol takes care of a lot of the grunt work. So for example the parentheses are guaranteed to have matched up to get a paren!. You don't have to go looking for all that yourself, you just build your dialect up from the building blocks of all those different types...and hook into existing behaviors (such as the DO dialect for basics like math and general computation, and the mind-bending PARSE dialect for some rather amazing pattern matching muscle).
But speaking of "all those different types", there's yet another weirdo situation for slash that can create another type:
>> type? quote /foo
This is called a refinement!, and happens when you start a lexical element with a slash. You'll see it used in the DO dialect to call out optional parameter sets to a function. But once again, it's just another symbolic LEGO in the parts box. You can ascribe meaning to it in your own dialects that is completely different...
While I didn't find any written definitive clarification, I did also find that +,-,* and others are valid characters in a word, so clearly it requires a space.
x*y
Is a valid identifier
x * y
Performs multiplication. It looks like the path operator is just another case of this.

Resources