What is the point of the 4 grammars specified in Chomsky hierarchy? - automata

I'm currently studying compiler's and am on the topic of "Chomsky Hierarchy and the 4 languages." But it beats me as to what the practical purpose of all this is?
It'd be great if I could see real-life examples of the 4 grammars: Unrestricted, CSG, CFG, Regular Grammer come to play.
I found online that Chomsky hierarchy along with the 4 grammars is used to evaluate proposals within cognitive science but this goes way over my head. It'd be great if someone could break it down for me, thanks a lot!

There is no practical value. That's the whole point.
Let me try to break that down a bit. It's useful to remember that Chomsky is a linguist --someone who studies human languages-- and that he was writing in the late 1950s when computational theory was not as well-developed as it is today. (To put it mildly.) His goal was to find a mathematical model which could provide some insights into the mechanisms by which human beings generate and understand sentences, and he took as his starting point a particular simple model of sentence generation.
In this model, a grammar is a function F, which transforms elements from an arbitrary sequence of symbols from some alphabet onto another sequence of symbols from the same alphabet. F is defined by a finite set of pairs (called productions) α → β. We then say that F(ω) = F(ζ) if the definition of F contains some pair α → β such that α is a substring of ω and ζ is the result of substituting a single instance of α in ω with β.
That's not very interesting in and of itself; we make that into a full language by starting with some designated starting sequence, normally represented as the single symbol S, and repeatedly apply F as many times as is necessary. (In all interesting grammars, the set so constructed is infinite, so it cannot actually be constructed. But we can imagine proceeding from the starting point until we find the sentence we wanted to generate.)
The problem with this model is that it can be used to describe an arbitrary Turing Machine. Or, if you like, an arbitrary computer program, although the equivalence is easier to see with a Turing Machine. In other words, it is at least theoretically possible to construct a finite grammar which will recognise strings consisting of the description of a Turing Machine (i.e., a program written in some programming language) followed by an input and an output only if the Turing Machine applied to the input would produce the output. In other words, there exists (in the mathematical sense) a grammar of this form which is computationally equivalent to a general purpose computer.
Unfortunately, that's not actually very useful if our goal is to understand sentences, because there is actually no algorithm for running computers backwards. The best we can do as a general solution is to enumerate all possible inputs and run the program on each of them until we find the output we hoped for. And that doesn't actually work because there is no limit to the amount of time the program might take to produce an output and no way to even know if the program will eventually come to an end. (This is called the "halting problem".) So we might get stuck on some possible input, and we'll never know if some other input might have produced the desired output.
As a result, we cannot tell whether the provided input was "grammatical", that is, whether it conformed to the grammar provided. And that's not just the case with the particular grammar we built to emulate Turing Machines. It means that we have no confidence that we can recognise sentences from any arbitrary grammar, and even if we stumble upon an answer, we have no way to limit how much time it might take to get there.
Clearly, this is not how human beings understand each other. So if it is to serve a practical purpose, we must restrict the possible grammars in some way to make them computationally feasible.
On the other end of the spectrum, a lot was known about finite-state machines. A finite-state machine is a Turing Machine without a tape; that is to say, it is simply a finite collection of states. In each state, the machine reads a single input symbol and uses it to decide what the next state will be. It turns out that finite-state machines can be modelled using a grammar (as above) restricted to very simple productions, each of which is either of the form A → a B or A → a, where a is a symbol from the "terminal alphabet" (that is, a word) and A and B are single grammatical symbols. These grammars are called "regular grammars" and they are computationally equivalent to what mathematicians call "regular expressions" (which are a small subset of what is recognised by "regex" libraries, but that's a whole other discussion).
Regular grammars are easy to parse. All that is needed to is trace through the state machine, so it can be done without backtracking in time proportional to the length of the input. But regular grammars are far too weak to be able to represent human language, or even most computer languages. As a simple example, algebraic expressions with parentheses cannot be recognised with a regular grammar (or with a finite-state machine) because there is no way to count the parenthesis depth; the finite-state machine has no memory at all (other than knowing which state it is in, and there are only a finite number of states).
So unrestricted grammars are too powerful to parse and regular grammars are too weak to be useful. (Useful for complex parsing problems, that is. There are certain applications for regular expressions, but parsing complete computer programs is not one of them.)
The next step, then, was to try to find a restriction on grammars which was still powerful enough to represent human language without being so powerful that parsing became impossible.
That, finally, is the origin of Chomsky's hierarchy. Between the two extremes described above (type 0 and type 3 grammars), Chomsky proposed two possible intermediate restrictions -- type 1 and type 2 grammars -- and proved a number of important properties about each of them.
While this work turned out to be fundamental in the development of formal language theory, it cannot really be said to have answered the question Chomsky started with. Type 2 grammars -- context-free grammars -- are indeed computationally tractable; they can be parsed with simple algorithms in polynomial time, and can represent a large number of useful languages. But they are still too weak to represent human language. In particular, context-free grammars cannot represent a language as simple as "all strings which contain two instances of the same substring". Type 1 grammars -- context-sensitive grammars -- can probably represent any useful language, and are not quite as unruly as unrestricted grammars, but they are still too powerful to parse. (Since the derivation steps in a context-sensitive grammar never get shorter, it is possible to enumerate all possible derivations from a starting point in order by length, which means that you can decide whether a sentence is generated by the grammar without running into the halting problem. But that's as good as it gets; that procedure takes exponential time and is not remotely feasible for non-trivial inputs.)
In the six decades since Chomsky published his seminal papers, a lot of work has been done to try to find useful intermediate restrictions between type 1 and type 2 grammars. And there has been a lot of useful study into algorithms for parsing context-free languages, which is of enormous utility in building compilers. All of this builds on the crucial work done by Chomsky and the other computational theorists whose work he built on -- Markov, Turing, Church and Kleene, just to name a few worthy of study. But Chomsky's original project remains unsolved.
So if your goal is to build a simple parser for a programming language, the Chomsky hierarchy is probably just an interesting footnote. But if you are interested in the academic study of formal language theory, there are still lots of interesting unsolved problems to work on.

Related

Converting grammars to the most efficient parser

Is there an online algorithm which converts certain grammar to the most efficient parser possible?
For example: SLR/LR(k) such as k>=0
For the class of grammars you are discussing (xLR(k)), they are all linear time anyway, and it is impossible to do sublinear time if you have to examine every character.
If you insist on optimizing parse time, you should get a very fast LR parsing engine. LRStar used to be the cat's meow on this topic, but the guy behind it got zero reward from the world of "I want it for free" and pulled all instances of it off the net. You can settle for Bison.
Frankly most of your parsing time will be determined by how fast your parser can process individual characters, e.g., the lexer. Tune that first and you may discover there's no need to tune the parser.
First let's distinguish LR(k) grammars and LR(k) languages. A grammar may not be LR(1), but let's say, for example, LR(2). But the language it generates must have an LR(1) grammar -- and for that matter, it must have an LALR(1) grammar. The table size for such a grammar is essentially the same as for SLR(1) and is more powerful (all SLR(1) grammars are LALR(1) but not vice versa). So, there is really no reason not to use an LALR(1) parser generator if you are looking to do LR parsing.
Since parsing represents only a fraction of the compilation time in modern compilers when lexical analysis and code generation that contains peephole and global optimizations are taken into consideration, I would just pick a tool considering its entire set of features. You should also remember that one parser generator may take a little longer than another to analyze a grammar and to generate the parsing tables. But once that job is done, the table-driven parsing algorithm that will run in the actual compiler thousands of times should not vary significantly from one parser generator to another.
As far as tools for converting arbitrary grammars to LALR(1), for example (in theory this can be done), you could do a Google search (I have not done this). But since semantics are tied to the productions, I would want to have complete control over the grammar being used for parsing and would probably eschew such conversion tools.

Neural Networks For Generating New Programming Language Grammars

I have recently had the need to create an ANTLR language grammar for the purpose of a transpiler (Converting one scripting language to another). It occurs to me that Google Translate does a pretty good job translating natural language. We have all manner of recurrent neural network models, LSTM, and GPT-2 is generating grammatically correct text.
Question: Is there a model sufficient to train on grammar/code example combinations for the purpose of then outputting a new grammar file given an arbitrary example source-code?
I doubt any such model exists.
The main issue is that languages are generated from the grammars and it is next to impossible to convert back due to the infinite number of parser trees (combinations) available for various source-codes.
So in your case, say you train on python code (1000 sample codes), the resultant grammar for training will be the same. So, the model will always generate the same grammar irrespective of the example source code.
If you use training samples from a number of languages, the model still can't generate the grammar as it consists of an infinite number of possibilities.
Your example of Google translate works for real life translation as small errors are acceptable, but these models don't rely on generating the root grammar for each language. There are some tools that can translate programming languages example, but they don't generate the grammar, work based on the grammar.
Update
How to learn grammar from code.
After comparing to some NLP concepts, I have a list of issue that may arise and a way to counter them.
Dealing with variable names, coding structures and tokens.
For understanding the grammar, we'll have to breakdown the code to its bare minimum form. This means understanding what each and every term in the code means. Have a look at this example
The already simple expression is reduced to the parse tree. We can see that the tree breaks down the expression and tags each number as a factor. This is really important to get rid of the human element of the code (such as variable names etc.) and dive into the actual grammar. In NLP this concept is known as Part of Speech tagging. You'll have to develop your own method to do the tagging, by it's easy given that you know grammar for the language.
Understanding the relations
For this, you can tokenize the reduced code and train using a model based on the output you are looking for. In case you want to write code, make use of a n grams model using LSTM like this example. The model will learn the grammar, but extracting it is not a simple task. You'll have to run separate code to try and extract all the possible relations learned by the model.
Example
Code snippet
# Sample code
int a = 1 + 2;
cout<<a;
Tags
# Sample tags and tokens
int a = 1 + 2 ;
[int] [variable] [operator] [factor] [expr] [factor] [end]
Leaving the operator, expr and keywords shouldn't matter if there is enough data present, but they will become a part of the grammar.
This is a sample to help understand my idea. You can improve on this by having a deeper look at the Theory of Computation and understanding the working of the automata and the different grammars.
What you're describing is 'just' learning structure of Context-Free Grammars.
I'm not sure if this approach will actually work for your case, but it's a long-standing problem in NLP: grammar induction for Context-Free Grammars. An example introduction how to tackle this problem using statistical learning methods can be found in Charniak's Statistical Language Learning.
Note that what I described is about CFGs in general, but you might want to check induction for LL grammars, because parser generators mostly use these types of grammars.
I know nothing about ANTLR, but there are pretty good examples of translating natural language e.g. into valid SQL requests: http://nlpprogress.com/english/semantic_parsing.html#sql-parsing.

Cutting down on Stanford parser's time-to-parse by pruning the sentence

We are already aware that the parsing time of Stanford Parser increases as the length of a sentence increases. I am interested in finding creative ways in which we prune the sentence such that the parsing time decreases without compromising on accuracy. For e.g. we can replace known noun phrases with one word nouns. Similarly can there be some other smart ways of guessing a subtree before hand, let's say, using the POS Tag information? We have a huge corpus of unstructured text at our disposal. So we wish to learn some common patterns that can ultimately reduce the parsing time. Also some references to publicly available literature in this regards will also be highly appreciated.
P.S. We already are aware of how to multi-thread using Stanford Parser, so we are not looking for answers from that point of view.
You asked for 'creative' approaches - the Cell Closure pruning method might be worth a look. See the series of publications by Brian Roark, Kristy Hollingshead, and Nathan Bodenstab. Papers: 1 2 3. The basic intuition is:
Each cell in the CYK parse chart 'covers' a certain span (e.g. the first 4 words of the sentence, or words 13-18, etc.)
Some words - particularly in certain contexts - are very unlikely to begin a multi-word syntactic constituent; others are similarly unlikely to end a constituent. For example, the word 'the' almost always precedes a noun phrase, and it's almost inconceivable that it would end a constituent.
If we can train a machine-learned classifier to identify such words with very high precision, we can thereby identify cells which would only participate in parses placing said words in highly improbable syntactic positions. (Note that this classifier might make use of a linear-time POS tagger, or other high-speed preprocessing steps.)
By 'closing' these cells, we can reduce both the the asymptotic and average-case complexities considerably - in theory, from cubic complexity all the way to linear; practically, we can achieve approximately n^1.5 without loss of accuracy.
In many cases, this pruning actually increases accuracy slightly vs. an exhaustive search, because the classifier can incorporate information that isn't available to the PCFG. Note that this is a simple, but very effective form of coarse-to-fine pruning, with a single coarse stage (as compared to the 7-stage CTF approach in the Berkeley Parser).
To my knowledge, the Stanford Parser doesn't currently implement this pruning technique; I suspect you'd find it quite effective.
Shameless plug
The BUBS Parser implements this approach, as well as a few other optimizations, and thus achieves throughput of around 2500-5000 words per second, usually with accuracy at least equal to that I've measured with the Stanford Parser. Obviously, if you're using the rest of the Stanford pipeline, the built-in parser is already well integrated and convenient. But if you need improved speed, BUBS might be worth a look, and it does include some example code to aid in embedding the engine in a larger system.
Memoizing Common Substrings
Regarding your thoughts on pre-analyzing known noun phrases or other frequently-observed sequences with consistent structure: I did some evaluation of a similar idea a few years ago (in the context of sharing common substructures across a large corpus, when parsing on a massively parallel architecture). The preliminary results weren't encouraging.In the corpora we looked at, there just weren't enough repeated substrings of substantial length to make it worthwhile. And the aforementioned cell closure methods usually make those substrings really cheap to parse anyway.
However, if your target domains involved a lot of repetition, you might come to a different conclusion (maybe it would be effective on legal documents with lots of copy-and-paste boilerplate? Or news stories that are repeated from various sources or re-published with edits?)

Deterministic Context-Free Grammar versus Context-Free Grammar?

I'm reading my notes for my comparative languages class and I'm a bit confused...
What is the difference between a context-free grammar and a deterministic context-free grammar? I'm specifically reading about how parsers are O(n^3) for CFGs and compilers are O(n) for DCFGs, and don't really understand how the difference in time complexities could be that great (not to mention I'm still confused about what the characteristics that make a CFG a DCFG).
Thank you so much in advance!
Conceptually they are quite simple to understand. The context free grammars are those which can be expressed in BNF. The DCFGs are the subset for which a workable parser can be written.
In writing compilers we are only interested in DCFGs. The reason is that 'deterministic' means roughly that the next rule to be applied at any point in the parse is determined by the input so far and a finite amount of lookahead. Knuth invented the LR() compiler back in the 1960s and proved it could handle any DCFG. Since then some refinements, especially LALR(1) and LL(1), have defined grammars that can be parsed in limited memory, and techniques by which we can write them.
We also have techniques to derive parsers automatically from the BNF, if we know it's one of these grammars. Yacc, Bison and ANTLR are familiar examples.
I've never seen a parser for a NDCFG, but at any point in the parse it would potentially need to consider the whole of the input string and every possible parse that could be applied. It's not hard to see why that would get rather large and slow.
I should point out that many real languages are imperfect, in that they are not entirely context free, not unambiguous or otherwise depart from the ideal DCFG. C/C++ is a good example, but there are many others. These languages are usually handled by special purpose rules such as semantic or syntactic predicates, special case backtracking or other 'tricks' with no effect on performance.
The comments point out that certain kinds of NDCFG are common and many tools provide a way to handle them. One common problem is ambiguity. It is relatively easy to parse an ambiguous grammar by introducing a simple local semantic rule, but of course this can only ever generate one of the possible parse trees. A generalised parser for NDCFG would potentially produce all parse trees, and could perhaps allow those trees to be filtered on some arbitrary condition. I don't know any of those.
Left recursion is not a feature of NDCFG. It presents a particular challenge to the design of LL() parsers but no problems for LR() parsers.

How to check Natural Language Sentence Structure validity using parser in java?

I am working on a project in which there is a part where I will have to input a sentence to check whether it is a valid sentence or not.
For example, if I give the input as "I am working at home", then the output will give me "Valid Sentence" where if I give the input as "I working home am at", it will give me "Invalid Sentence".
I searched some natural language parsing methods like NLP, Stanford Parser, but it would be helpful if someone please guide me through some java examples about the related problems.
I will be grateful in advance for this help. Thank you.
Whether you use parse trees or not, you will need to use a Markov process to check validity. The features can be word sequences, part-of-speech tag sequences, parse tree segments (i.e. production rules and their extensions), etc. For these, you would use a tokenizer, a POS tagger and a natural language parser, respectively.
The validity check will also be a probabilistic score, not an absolute truth. All (or almost all) natural language parsers are statistical. Which means they require training data. These parsers use context-free grammars or mildly context-sensitive grammars such as CCG or TAG, which are among the best computational approximations of natural language grammars.
Essentially, the model will tell you how likely is it for a feature to appear in a valid sentence after a certain sequence of features has already been seen. That is, it will allow you to compute probabilities of the form P("at"|"am working") and P("at"|"home am"). The former should have a higher probability than the latter. You will need to experimentally determine how high a probability should be in order for a sentence to be considered as valid.
As qqlihq commented, these are under the broad definition of language models. For sentence validity, however, you will usually not need to measure perplexity. The conditional probability measures should suffice.

Resources